Red Hat Bugzilla – Bug 113351
System non-usable after 2 days of stress
Last modified: 2007-11-30 17:07:00 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Description of problem:
I was stress-testing and IBM x440 8-way system over the weekend.
When I arived on monday the system was in a non-usable state. The
non-disk based tests were still running but the others had stoped.
Top was still running but not a nfs copy.
I was unable to logon to the system via the console or ssh. It
would take my username and password but would not give me a shell. My
one shell open was lost to an uptime command that never returned and I
could not kill.
Top showed very high load avrages 17-20 with all the cpus doing idle
things. It seems like new processes were not able to start. It also
showed there was lots of free memory left.
After rebooting the box to check /var/log/messages there was lots of
free disk space so I don't know what is going on. Just that the box
was unusable. Commands would not return and users could not logon.
I will attach the messages (I didn't see anything good in there but
I am working to test on other boxes asap to see if I can reproduce
the problem somewhere else. The test I am ruunning is a tools10 based
.(kernel compile, nfs copy, copy cd to disk, hell hound, do some pings
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Install AS3.0 update cd's re0108
3.wait a weekend
Actual Results: The system was non-usable
Expected Results: The system should have behaved in a usable fashion.
Working to retest system.
Created attachment 96914 [details]
var/log/messages from system
Created attachment 96919 [details]
had remove some of the data due to size constraints
Well, there's nothing to work with here.
Please reproduce the hang state, and then forward the outputs
Alt-Sysrq-p (several in a row)
Before starting your tests, make sure /proc/sys/kernel/sysrq is set
to 1, or that "kernel.sysrq" is set to 1 in /etc/sysctl.conf.
I am currently testing with 2 systems to see the issue again and to
get the above outputs.
update to my last request:
please do the Alt-Sysrq-w last, as it is possible that it will
hang the console (and never return) if one of the cpus is spinning
on a lock with interrupts disabled. So, if you get the same hang
please do the Alt-Sysrq's in this order:
Alt-Sysrp-p (several in a row)
Well I have been testing 2 systems for 6 days had havent seen the
issue again. If I see it again and am able capture any debug output I
will post again. Thanks.
This sounds a bit like bug 117210.
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.