Bug 173135

Summary: user-space processes freeze under moderate to heavy load
Product: Red Hat Enterprise Linux 3 Reporter: strovato
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: moixa, petrides, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-01-23 14:46:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
CPU flag dump from Alt+SysRq+P, with tail-end of vmstat output
none
Memory dump from Alt+SysRq+M
none
Task dump from Alt+SysRq+T
none
output from Alt+SysRq+P, M, W, T, & P again, all in one document none

Description strovato 2005-11-14 15:24:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7

Description of problem:
Ocassionally, all user processes will freeze when the system is under moderate to heavy load.  vmstat running on the console will stop producing output.  Characters typed at the console will echo to screen but have no other effect.  No error messages appear on the console or in the logs on disk.  The machine is still pingable, and network connections (such as telnet) seem to be established, but no login banner ever appears.  I can still use Alt+SysRq to produce output or reboot the machine.

This situation ocurrs randomly; sometimes the machine does not freeze even during heavy load.  It rarely goes more than a week, however, without freezing.

This situation ocurrs with earlier versions of the kernel as well, going back to at least kernel-2.4.21-27.0.2.ELsmp.  The system specs are as follows:

Dell PowerEdge 2600 w/2GB RAM,
2xPentium Xeon 3.06GHz,
RAID5 on embedded PERC 4/Di,
3xSeagate ST373307LC w/latest firmware (DS09)
Intel PRO/1000 (running at 100Mbs)

I have also experimented with different kernel, driver, and firmware revisions for various sytem components, especially with the PERC 4/Di (currently using megaraid2 v2.10.8.2-RH1 with 251S:1.07 firmware) but with no effect.  Disabling write caching also has no effect.  Changing BIOS settings, such as disabling sequential memory access and hyperthreading, also has no effect.

Dell diagnostics check out; memory tests with memtest86+ check out.  I've tried to artificically load the machine to coerce it into a freeze situation, but have been unable to.  It only seems to happen during normal operation outside of my maintenance windows!

At this point, I am not sure what the problem could be.  The machine had been in service for about 1 year as an e-mail server before exhibiting this problem, although in general it has not always been as heavily loaded as it is now.  I did manage to snag some screen captures of various Alt-SysRq dumps before the watchdog kicked in and will attach them to this report.

Does anyone have any ideas on this one?  It seems to be a real showstopper.  Thanks!

-S


Version-Release number of selected component (if applicable):
kernel-2.4.21-37.ELsmp

How reproducible:
Couldn't Reproduce

Steps to Reproduce:
1.Wait a random amount of time once the system has at least a moderate load.
2.There is a chance that the machine will freeze.
3.
  

Actual Results:  User-space processes freeze, but kernel is still responsive to Alt-SysRq and external pings.

Expected Results:  User processes should continue to respond.

Additional info:

I'm not sure how useful this info is; it is the last bit of information provided by top before the freeze.  I will attach some more output from Alt-SysRq that will hopefully be more useful to those who can read it.

 09:20:13  up 3 days,  3:30, 14 users,  load average: 6.37, 6.22, 5.11
250 processes: 249 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    5.1%   17.9%    8.2%   0.0%     0.5%   62.4%    5.7%
           cpu00    3.5%   14.5%    7.1%   0.0%     1.1%   73.1%    0.3%
           cpu01    5.1%   22.7%    7.5%   0.0%     0.0%   61.9%    2.5%
           cpu02    8.5%   12.7%   10.1%   0.0%     0.7%   57.5%   10.1%
           cpu03    3.1%   21.7%    7.9%   0.0%     0.1%   56.9%    9.9%
Mem:  2055312k av, 2037808k used,   17504k free,       0k shrd,    8280k buff
                   1594384k actv,  302100k in_d,   30824k in_c
Swap:  755044k av,       0k used,  755044k free                 1363708k cached

Comment 1 strovato 2005-11-14 15:28:52 UTC
Created attachment 121021 [details]
CPU flag dump from Alt+SysRq+P, with tail-end of vmstat output

Comment 2 strovato 2005-11-14 15:29:34 UTC
Created attachment 121022 [details]
Memory dump from Alt+SysRq+M

Comment 3 strovato 2005-11-14 15:30:10 UTC
Created attachment 121023 [details]
Task dump from Alt+SysRq+T

Comment 4 Larry Woodman 2005-11-16 15:27:38 UTC
This is the first I have heard about this problem.  You said that it is not
reproducable yet it happens???  Can you get the full AltSysrq-W and AltSysrq-T
outputs from /var/log/messages so I can see what is happening when the system
freezes?

Thanks, Larry Woodman


Comment 5 strovato 2005-11-16 15:34:21 UTC
I did not know about AltSysRq-W; I'll try to get that information the next time
the freeze occurs.

None of the Alt-SysRq output goes to the syslog because it is dead by that
point.  Is there another way I can capture the output to these commands since it
scrolls off the screen?

Thanks!

-S

Comment 6 Larry Woodman 2005-11-16 16:40:01 UTC
OK, can you try to set up a serial console so that you can capture the necessary
debuging information?

Also, exactly how reproducable is this problem?

Larry Woodman


Comment 7 strovato 2005-11-16 16:43:38 UTC
I will do as you suggest and work to set up a serial console.

The problem is not reproducible "on demand." I basically have to wait for it to
freeze and hopefully be in a position to capture the necessary information
before the watchdog resets the machine.  It usually freezes about once a week,
so I may not be able to post more info until then.

Comment 8 Larry Woodman 2005-11-16 18:39:29 UTC
OK and please make sure this is the latest RHEL3-U6 kernel.

Larry


Comment 9 strovato 2005-11-22 16:15:03 UTC
OK, it froze again, and I was able to retrieve the data you requested.  I will
attach it below.  (I am using the latest RHEL3-U6 kernel.)

-S

Comment 10 strovato 2005-11-22 16:16:49 UTC
Created attachment 121353 [details]
output from Alt+SysRq+P, M, W, T, & P again, all in one document

output from Alt+SysRq+P, M, W, T, & P again, all in one document, obtained from
the serial console during a freeze situation

Comment 11 strovato 2005-11-22 22:44:56 UTC
Could this be another deadlock scenario, as discussed in bug #122077 ?  The
symptoms are identical, and there are plenty of references to dquot in the SysRq
dumps I just posted.  I also ran across this thread:
https://listman.redhat.com/archives/ext3-users/2003-November/msg00024.html
which included a patch for another ext3/quota deadlock scenario, but it doesn't
seem to be incorporated into my kernel source tree (stock RHEL3).

-S

Comment 12 strovato 2005-12-01 18:04:15 UTC
OK, I think this bug is a duplicate of bug #122252.  I applied the kernel patch
that was posted there and have not had any freezes since then.  My uptime is the
highest it's ever been since the bug started manifesting itself, so I think it's
safe to say this is the fix but will keep an eye on it.

This bug is very old, and the patch was posted a year ago.  Is there a technical
reason why this has not been included in a Red Hat kernel update?

-S

Comment 13 strovato 2006-01-23 14:46:44 UTC

*** This bug has been marked as a duplicate of 122252 ***