thanks, Neil. Just checked, apache2 MPM is prefork already. It crashed again today. so I checked the logs. scgi error log shows lots of these: IOError caught while writing request ([Errno 32] Broken pipe) then two of these: IOError while closing connection ignored: [Errno 32] Broken pipe right before scgi froze up. Seconds before these memcached connection timed out several times and mysql dead locked several times, and scgi access log shows requests returned in seconds or tens of seconds (normally less than 0.1 second) apache error log shows tons of: [notice] child pid 24880 exit signal Segmentation fault (11) then couple of: [error] [client ... ] (70007)The timeout specified has expired: scgi: The timeout specified has expired: scgi: can't connect to server, referer: ... and repeats the pattern until scgi was restarted. I think I've pin pointed the problem now: off-line batch processing jobs in crontab induced prolonged mysql table locks, the time-outs somehow caused a chain of failures in scgi and apache. I'll write to this list again when the issue is resolved, otherwise will discuss in private. - bo On Jul 31, 2005, at 1:02 AM, Neil Schemenauer wrote: > On Sat, Jul 30, 2005 at 02:04:26PM +0800, Bo Yang wrote: > >> It happened three times around the time mysql backup cron >> jobs run around 3 a.m. each day. mysqldump was configured to lock up >> each table in turn. so some of the unlucky write requests might have >> to wait for few minutes to get through. The backups took about 15 >> minutes, but scgi froze right there, all the way to the morning when >> I woke up in panic. >> > > Some ideas: > > * Use a tool like gdb or strace to see were the Apache process is > hanging. > > * If you don't have time to use gdb or strace, kill the proces > with SIGQUIT (should produce a core dump) and use gdb on the > core file. > > * Try a different MPM (multi-process module) for Apache. > Prefork should be the most robust. > > * Try removing the connection retry logic from mod_scgi (remove > the entire "if" block that contains "goto restart"). There > could be a bug in that code, although I don't see it. > > * If you have a test machine, perhaps you could try to reproduce > the hang (e.g. use "ab" and add some sleeps in the Quixote > application to simulate locked tables). > > > Neil >