Bug 113351

Summary: System non-usable after 2 days of stress
Product: Red Hat Enterprise Linux 3 Reporter: keith mannth <kmannth>
Component: kernelAssignee: Dave Anderson <anderson>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: barryn, jamesclv, lcm, petrides
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-19 19:31:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
var/log/messages from system
none
first var/log/messages none

Description keith mannth 2004-01-12 23:59:15 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20030716

Description of problem:
  I was stress-testing and IBM x440 8-way system over the weekend. 
When I arived on monday the system was in a non-usable state. The
non-disk based tests were still running but the others had stoped. 
Top was still running but not a nfs copy.  
   I was unable to logon to the system via the console or ssh.  It
would take my username and password but would not give me a shell.  My
one shell open was lost to an uptime command that never returned and I
could not kill.  
  Top showed very high load avrages 17-20 with all the cpus doing idle
things.  It seems like new processes were not able to start.  It also
showed there was lots of free memory left.
  After rebooting the box to check /var/log/messages there was lots of
free disk space so I don't know what is going on.   Just that the box
was unusable.  Commands would not return  and users could not logon. 
I will attach the messages (I didn't see anything good in there but
who knows) 
  I am working to test on other boxes asap to see if I can reproduce
the problem somewhere else.  The test I am ruunning is a tools10 based
.(kernel compile, nfs copy, copy cd to disk, hell hound, do some pings
loop)   

Version-Release number of selected component (if applicable):
kernel-2.4.21-7.EL

How reproducible:
Didn't try

Steps to Reproduce:
1.Install AS3.0 update cd's re0108
2.run tests 
3.wait a weekend
    

Actual Results:    The system was non-usable

Expected Results:    The system should have behaved in a usable fashion.

Additional info:

  Working to retest system.

Comment 1 keith mannth 2004-01-13 00:00:27 UTC
Created attachment 96914 [details]
var/log/messages from system

Comment 2 keith mannth 2004-01-13 00:14:34 UTC
Created attachment 96919 [details]
first var/log/messages

 had remove some of the data due to size constraints

Comment 3 Dave Anderson 2004-01-15 13:21:16 UTC
Well, there's nothing to work with here.

Please reproduce the hang state, and then forward the outputs
of:

  Alt-Sysrq-m
  Alt-Sysrq-p (several in a row)
  Alt-Sysrq-w
  Alt-Sysrq-t

Before starting your tests, make sure /proc/sys/kernel/sysrq is set
to 1, or that "kernel.sysrq" is set to 1 in /etc/sysctl.conf.

 


Comment 4 keith mannth 2004-01-15 18:16:44 UTC
  I am currently testing with 2 systems to see the issue again and to
get the above outputs.  


Comment 5 Dave Anderson 2004-01-16 18:32:54 UTC
update to my last request:

please do the Alt-Sysrq-w last, as it is possible that it will
hang the console (and never return) if one of the cpus is spinning
on a lock with interrupts disabled.  So, if you get the same hang
please do the Alt-Sysrq's in this order:

Alt-Sysrq-m
Alt-Sysrp-p (several in a row)
Alt-Sysrq-t
Alt-Sysrq-w
 

Comment 6 keith mannth 2004-01-20 00:37:00 UTC
  Well I have been testing 2 systems for 6 days had havent seen the
issue again.  If I see it again and am able capture any debug output I
will post again. Thanks.

Comment 7 Johan Walles 2004-03-19 13:51:01 UTC
This sounds a bit like bug 117210.


Comment 8 RHEL Program Management 2007-10-19 19:31:27 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.