Here's a typical type of ColdFusion support ticket that we get in Macromedia Support. In this case, the server stopped responding or crashed over a weekend, and I was sent the log files and server settings to review for clues about what happened.

I'm providing this just an example of how I go about drawing conclusions and reconstructing the events that transpired. Maybe it will help you when thinking about troubleshooting your own CF servers if needed. This case isn't complete, so when further progress is made, I'll try to update the critical info here. Names and private info have been removed.

Background, this is ColdFusion MX 6.1 Updater 1 server configuration on Windows 2000 and IIS 5:

Problem Report

We had one of our CF sites go down this weekend and we dont know why the services were simply offline. Id like some help doing a post-mortem analysis to figure out why the services stopped.

Analysis

I believe that the server entered a state of unresponsiveness where the running request pool became saturated with slow or possibly hung web threads, forcing other incoming requests to queue up.




The server was started on Wednesday 06/15 23:35 and ran continuously until Monday at 06/20 06:53:53. The process was running during that period, so it wasn't a crash, but most likely a hang.

You have the ColdFusion Administrator configured to log slow pages that exceed 45 seconds. Over the weekend there were some pages that completed and exceeded 45 seconds, but not too many, only 2 on Friday and 2 on Saturday, and none on Sunday. Since the last slow page, a scheduled task that runs 11:30 pm every night, ran on Saturday, that means that the server was still processing requests at least until about midnight Sat/Sun. ColdFusion does not log slow pages that exceed the timeout if they *never* complete, so there's a remote chance that there were some web threads that got into a hung state, and perhaps enough of them accumulated until the running request pool was full with nothing but those hung/extremely slow pages.




None of the log files show any entries at all for Sunday the 19th. This is good, but might be a clue supporting the unresponsiveness. But when the server is unresponsive, users usually click STOP in their browser after a while and that will cause a log entry in default-event.log for Connection Reset. Oddly, the last connection reset error occurs Friday evening at 7PM, so over the weekend we know the server was handling requests at least until Saturday night and no one clicked STOP in the browser after Friday night. There's neither slow pages nor connection resets for Sunday the 19th, so its difficult to know what was going on that day.




My best suggestion at this point is prepare the ColdFusion server so that a series of thread dumps can be generated. Instructions are provided here: http://www.macromedia.com/go/tn_18339




You could use the instructions to start ColdFusion in the manner required for thread dumps, and leave it running that way indefinitely until the server hangs again, if it does at all. If you find the server in an unresponsive state, then use CTRL+BRK when the focus is on the command window to write a thread dump to the output file. Do two or three thread dumps in a row about 15 seconds apart. Then send them to me. I will be able to analyze them and compare what's happening between thread dumps so for example I can see if a given thread is stuck or hung because it will be doing the same thing in each dump.




Otherwise, the server settings do look good as far as Simultaneous Requests and Timeout. There's a dozen SQL datasources, and 1 MS Access dsn (i think its Access), so that's fine. There are 2 daily scheduled tasks, so that's fine. 1 Java CFX, no worries there. And a whole bunch of Verity collections, but nothing to indicate a problem with that either.




You have the following hotfixes installed:
  • hf55681_61.jar; ColdFusion MX 6.1 Updater: Hot fix for ColdFusion not Responding to Requests
  • hf56580_611.jar; ColdFusion MX 6.1 Updater: Hot fix for cfdump throwing unknown type error for cfcatch structure
  • hf59763_611.jar; ColdFusion MX 6.1 Updater: Hot fix to improve memory utilization when enabling Debug Logging
  • hf59993_611.jar; ColdFusion MX 6.1 Updater: Hot fix for client variables
  • jrun-hotfix-57510-updater4.jar; Hot fix for JRun 4.0 and JRun 4.0-based servers (Run ProxyService become unresponsive under light load)


These are the critical ones for your version of ColdFusion, so you're in good shape there and we can rule out those bugs from the possible causes.




Together, this leans more towards bottlenecks in the application causing slow downs. Note that some types of ColdFusion operations will not timeout at all. The most widely known concern is that if a request has connected to a database and has sent the SQL statement, then so long as ColdFusion is waiting for all the data from the database that web thread will *not* timeout during the query. What would happen instead is that say you wrote a query that took three hours but had the usual 60 second timeout, well the query would run for those three hours and as soon as it completed the CF server timeout would be checked when the next line of code is started, and since 60 seconds were exceeded, a Timeout Error would occur at the end of those 3 hours. Its not a helpful behavior, but thats the way it is.

You should also be aware of a new bug recently identified, the CFFTP attribute "timeout" is broken for all operations, with a hotfix pending, so please check your code for FTP operations.




That's all I have. Please set up ColdFusion to be ready to take thread dumps and send any that are generated back to me.