140910 – mysterious system hangs

Bug 140910 - mysterious system hangs

Summary: mysterious system hangs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	140911 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-11-26 11:36 UTC by Jozsef Szabados
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:13:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output from sysrq (1.37 MB, text/plain) 2005-04-07 12:55 UTC, Petter Reinholdtsen	no flags	Details
View All

Description Jozsef Szabados 2004-11-26 11:36:31 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 
7.50  [en]

Description of problem:
System stop working, but seems to working: on console can enter 
username, but there is no password prompt; on network listen on all 
port but not give back any banner, and reset connection ater ~5 
minutes; there is no entry in any log file after. The problem occours 
on 4 machine running this kernel. Logs show no problem, no entry in 
audit, process accountig and syslog relevant to hang up.

Version-Release number of selected component (if applicable):
kernel-2.4.21-20.EL

How reproducible:
Didn't try

Steps to Reproduce:
1. I don't know why it happens. Occoured 1-4 times in a month.
2.
3.
    

Additional info:

There is 4 machine:
1. MAIL: ide disk, using home partition mounted over nfs.
2. FILE: 3ware 9000 sata controller (8 disk, raid10 array), use an 
firewire accessed ide disk for backups, home partition exported over 
nfs.
3. WWW: ide disk, not use nfs.

All machine using processor pentium 4.

4. Workstation: i don't know any hardware parameters, I'm not 
managing it.

Comment 1 Jozsef Szabados 2004-11-26 11:40:36 UTC

*** Bug 140911 has been marked as a duplicate of this bug. ***

Comment 2 Trond H. Amundsen 2004-11-30 11:57:02 UTC

We have a similar problem with our loghost, which is running
(obviously) syslogd but also agetty at ttyS0 with kernel console set
to ttyS0. The symptoms described in this bug report very much
resembles the problem we have, where an strace on syslogd shows that
it is waiting for open() on ttyS0 (/dev/console). No agetty was
running (configured in /etc/inittab), and with syslogd unable to
respond all processes trying to syslog would hang as well.

Perhaps there is a race in the kernel serial driver and the serial
console? All we know is that syslogd was waiting for open(ttyS0,..)
and that agetty wasn't running, and that killing and restarting
syslogd fixed the problem. Killing syslogd made agetty start, and
restarting syslogd fixed the rest.

We got an error message on console:

1. When logging out from the serial console on ttyS0:

    Warning: null TTY for (04:40) in tty_fasync
    Warning: null TTY for (04:40) in tty_fasync

2. When the last of the syslogd processes (yes, there were 8 of them)
   where killed:

    rs_close: bad serial port count; tty->count is 1, state->count is 6

Jozsef, does this in any way resemble your problem, and could it be
that we have experienced the same bug?

Comment 3 Jozsef Szabados 2004-12-01 15:08:42 UTC

We are using system builtin mingetty, but syslogd.conf isn't default. 
In this config only file logging exist and there is no console 
redirection.

It could be the same bug, but we could not resolve the problem from 
userspace, only if we hard reset the system. Cause of this I suspect 
too that it is a kernel bug. 

More investigation of this bug shows that:
1. could be cause of heavy swap usage
2. could be sysv shared memory, semaphor, message queue limitation (/
proc/sys/kernel/[sem,shmmni,..]
3. could be vm settings (/proc/sys/vm/*)
4. i don't know:)

Comment 4 Dave Anderson 2004-12-03 13:43:45 UTC

It's impossible to determine what's happening based upon the
information available so far.  Given that it is capable of receiving
keyboard interrupts, then it should be able to respond to Alt-Sysrq
input on the console.  The next time the hang occurs, please send the
output from  Alt-Sysrq-m, Alt-sysrq-t, Alt-sysrq-p, and Alt-Sysrq-w,
in that order.

Also, make sure that /proc/sys/kernel/sysrq is equal to 1.  If it is
not, then "echo 1 > /proc/sys/kernel/sysrq", or set it permanently
to 1 in /etc/sysctl.conf:

sysrq = 1

Comment 5 Petter Reinholdtsen 2005-04-07 12:55:57 UTC

Created attachment 112804 [details]
Output from sysrq


We had this hang yet another time, and this time we were able to extract
the sysrq output.  The attached file include the output in sequence.
The state dump took a long time.  The register dump was because of this
not taken directly after the state dump, but a few minutes later when I
discovered that the state dump was finished.

Comment 6 Petter Reinholdtsen 2005-05-18 16:06:47 UTC

Did the output from sysrq make it easier to find the problem?  Any clues
on how to avoid the problem would be very much appreaciated.

Comment 7 Dave Anderson 2005-05-18 18:27:16 UTC


SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2876 min:     0 low:     0 high:     0
Zone:Normal freepages:  1322 min:  1279 low:  4544 high:  6304
Zone:HighMem freepages:   571 min:   255 low:  6654 high:  9981
Free pages:        4768 (   570 HighMem)
( Active: 350863/84675, inactive_laundry: 18465, inactive_clean: 11537, free:
4769 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2876
  aa:633 ac:42957 id:7056 il:3268 ic:3391 fr:1324
  aa:44890 ac:262383 id:77621 il:15197 ic:8146 fr:567
2*4kB 1*8kB 2*16kB 2*32kB 2*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB
2*4096kB = 11504kB)
82*4kB 92*8kB 44*16kB 2*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB
0*4096kB = 5288kB)
302*4kB 6*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB
0*4096kB = 2264kB)
Swap cache: add 0, delete 0, find 0/0, race 0+0
140228 pages of slabcache
4060 pages of kernel stacks
0 lowmem pagetables, 15994 highmem pagetables
Free swap:       4192504kB
655339 pages of RAM
425963 pages of HIGHMEM
13876 reserved pages
477812 pages shared
0 pages swap cached

The system does not appear to be stopped or in any kind of 
"hard" hang.  It certainly is strapped for memory, although
there is page reclamation going on in order to keep the system
running.  In fact, kswapd is currently blocked because there 
is enough memory on the inactive clean (ic:) plus the free (fr:)
lists of both the normal and high zones that collectively are
greater than the "low:" values for those zones.  So at the time
of the alt-sysrq-t, kswapd was not actively reclaiming memory from
either zone's respective page caches.

What is a bit troubling is the amount of pages being used by
the slabcache (140228), all of which comes from the normal
zone, which when fully populated can have a maximum of
~225000 pages (~896MB).  Whenever the slabcache consumes
more than about 50% of the normal zone, there's potentially
a problem.  A "cat /proc/slabinfo" at the time of the hang
(if possible) might yield some clues.

However, swap is not even being used, because the page reclamation
process from each zone's page cache seems to be doing enough
to satisfy the memory requirements.  Furthermore, the alt-sysrq-w
at the end of the output shows 3 processors idle, and the
4th one doing a syslog read.

What is hard to understand, though, is why there are so many
crond processes running.  The system at the time of the alt-sysrq-t
had 2030 processes, and 1963 of them are "crond" processes, which
I've never seen before.

Is that a "normal" situation in your configuration?  (Or has
crond somehow gone wild?)

Lastly, this is a RHEL3-U3 kernel.  Numerous memory-handling
updates have gone into the RHEL3-U4 kernel, as well as into the
soon-to-be-released RHEL3-U5 kernel.  Before doing much more
with this case, the kernel will have to be updated.

Comment 8 Petter Reinholdtsen 2005-05-18 18:43:41 UTC

crond was blocked, trying to syslog.  sysklogd was hanging, and thus all
the crond processes got stuck.  I believe sysklogd was hanging because of
some serial console problem, or something like that.  This was definitely
not a normal situation.  The number of processes had been growing linearly
for quite some time when we discovered this.

You can see this pattern from our munin graphs at
<URL: http://yggdrasil.uio.no/munin/uio.no/hvelvet.uio.no.html >
Notice how there are spikes in november, december, march and april.

Comment 9 Dave Anderson 2005-05-18 19:02:33 UTC

Jason,

Does this sound like something that could be associated
with your init-dev patch?

Comment 10 Dave Anderson 2005-05-18 19:24:36 UTC

Nonetheless, an upgrade to RHEL3-U5 is in order here, due to several
fixes in the tty area.  It was released to RHN this morning as: 

RHSA-2005:294 - Updated kernel packages available for Red Hat Enterprise Linux 3
Update 5

Comment 11 Petter Reinholdtsen 2005-05-26 09:04:42 UTC

We upgraded our log host to a new kernel 2005-05-20 (kernel version
2.4.21-32.ELsmp), and the problem with blocked processes repeated
itself this night.  So the new kernel do not seem to make any difference for
us.

Comment 12 RHEL Program Management 2007-10-19 19:13:04 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.