From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7 Description of problem: Ocassionally, all user processes will freeze when the system is under moderate to heavy load. vmstat running on the console will stop producing output. Characters typed at the console will echo to screen but have no other effect. No error messages appear on the console or in the logs on disk. The machine is still pingable, and network connections (such as telnet) seem to be established, but no login banner ever appears. I can still use Alt+SysRq to produce output or reboot the machine. This situation ocurrs randomly; sometimes the machine does not freeze even during heavy load. It rarely goes more than a week, however, without freezing. This situation ocurrs with earlier versions of the kernel as well, going back to at least kernel-2.4.21-27.0.2.ELsmp. The system specs are as follows: Dell PowerEdge 2600 w/2GB RAM, 2xPentium Xeon 3.06GHz, RAID5 on embedded PERC 4/Di, 3xSeagate ST373307LC w/latest firmware (DS09) Intel PRO/1000 (running at 100Mbs) I have also experimented with different kernel, driver, and firmware revisions for various sytem components, especially with the PERC 4/Di (currently using megaraid2 v2.10.8.2-RH1 with 251S:1.07 firmware) but with no effect. Disabling write caching also has no effect. Changing BIOS settings, such as disabling sequential memory access and hyperthreading, also has no effect. Dell diagnostics check out; memory tests with memtest86+ check out. I've tried to artificically load the machine to coerce it into a freeze situation, but have been unable to. It only seems to happen during normal operation outside of my maintenance windows! At this point, I am not sure what the problem could be. The machine had been in service for about 1 year as an e-mail server before exhibiting this problem, although in general it has not always been as heavily loaded as it is now. I did manage to snag some screen captures of various Alt-SysRq dumps before the watchdog kicked in and will attach them to this report. Does anyone have any ideas on this one? It seems to be a real showstopper. Thanks! -S Version-Release number of selected component (if applicable): kernel-2.4.21-37.ELsmp How reproducible: Couldn't Reproduce Steps to Reproduce: 1.Wait a random amount of time once the system has at least a moderate load. 2.There is a chance that the machine will freeze. 3. Actual Results: User-space processes freeze, but kernel is still responsive to Alt-SysRq and external pings. Expected Results: User processes should continue to respond. Additional info: I'm not sure how useful this info is; it is the last bit of information provided by top before the freeze. I will attach some more output from Alt-SysRq that will hopefully be more useful to those who can read it. 09:20:13 up 3 days, 3:30, 14 users, load average: 6.37, 6.22, 5.11 250 processes: 249 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 5.1% 17.9% 8.2% 0.0% 0.5% 62.4% 5.7% cpu00 3.5% 14.5% 7.1% 0.0% 1.1% 73.1% 0.3% cpu01 5.1% 22.7% 7.5% 0.0% 0.0% 61.9% 2.5% cpu02 8.5% 12.7% 10.1% 0.0% 0.7% 57.5% 10.1% cpu03 3.1% 21.7% 7.9% 0.0% 0.1% 56.9% 9.9% Mem: 2055312k av, 2037808k used, 17504k free, 0k shrd, 8280k buff 1594384k actv, 302100k in_d, 30824k in_c Swap: 755044k av, 0k used, 755044k free 1363708k cached
Created attachment 121021 [details] CPU flag dump from Alt+SysRq+P, with tail-end of vmstat output
Created attachment 121022 [details] Memory dump from Alt+SysRq+M
Created attachment 121023 [details] Task dump from Alt+SysRq+T
This is the first I have heard about this problem. You said that it is not reproducable yet it happens??? Can you get the full AltSysrq-W and AltSysrq-T outputs from /var/log/messages so I can see what is happening when the system freezes? Thanks, Larry Woodman
I did not know about AltSysRq-W; I'll try to get that information the next time the freeze occurs. None of the Alt-SysRq output goes to the syslog because it is dead by that point. Is there another way I can capture the output to these commands since it scrolls off the screen? Thanks! -S
OK, can you try to set up a serial console so that you can capture the necessary debuging information? Also, exactly how reproducable is this problem? Larry Woodman
I will do as you suggest and work to set up a serial console. The problem is not reproducible "on demand." I basically have to wait for it to freeze and hopefully be in a position to capture the necessary information before the watchdog resets the machine. It usually freezes about once a week, so I may not be able to post more info until then.
OK and please make sure this is the latest RHEL3-U6 kernel. Larry
OK, it froze again, and I was able to retrieve the data you requested. I will attach it below. (I am using the latest RHEL3-U6 kernel.) -S
Created attachment 121353 [details] output from Alt+SysRq+P, M, W, T, & P again, all in one document output from Alt+SysRq+P, M, W, T, & P again, all in one document, obtained from the serial console during a freeze situation
Could this be another deadlock scenario, as discussed in bug #122077 ? The symptoms are identical, and there are plenty of references to dquot in the SysRq dumps I just posted. I also ran across this thread: https://listman.redhat.com/archives/ext3-users/2003-November/msg00024.html which included a patch for another ext3/quota deadlock scenario, but it doesn't seem to be incorporated into my kernel source tree (stock RHEL3). -S
OK, I think this bug is a duplicate of bug #122252. I applied the kernel patch that was posted there and have not had any freezes since then. My uptime is the highest it's ever been since the bug started manifesting itself, so I think it's safe to say this is the fix but will keep an eye on it. This bug is very old, and the patch was posted a year ago. Is there a technical reason why this has not been included in a Red Hat kernel update? -S
*** This bug has been marked as a duplicate of 122252 ***