From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 7.50 [en] Description of problem: System stop working, but seems to working: on console can enter username, but there is no password prompt; on network listen on all port but not give back any banner, and reset connection ater ~5 minutes; there is no entry in any log file after. The problem occours on 4 machine running this kernel. Logs show no problem, no entry in audit, process accountig and syslog relevant to hang up. Version-Release number of selected component (if applicable): kernel-2.4.21-20.EL How reproducible: Didn't try Steps to Reproduce: 1. I don't know why it happens. Occoured 1-4 times in a month. 2. 3. Additional info: There is 4 machine: 1. MAIL: ide disk, using home partition mounted over nfs. 2. FILE: 3ware 9000 sata controller (8 disk, raid10 array), use an firewire accessed ide disk for backups, home partition exported over nfs. 3. WWW: ide disk, not use nfs. All machine using processor pentium 4. 4. Workstation: i don't know any hardware parameters, I'm not managing it.
*** Bug 140911 has been marked as a duplicate of this bug. ***
We have a similar problem with our loghost, which is running (obviously) syslogd but also agetty at ttyS0 with kernel console set to ttyS0. The symptoms described in this bug report very much resembles the problem we have, where an strace on syslogd shows that it is waiting for open() on ttyS0 (/dev/console). No agetty was running (configured in /etc/inittab), and with syslogd unable to respond all processes trying to syslog would hang as well. Perhaps there is a race in the kernel serial driver and the serial console? All we know is that syslogd was waiting for open(ttyS0,..) and that agetty wasn't running, and that killing and restarting syslogd fixed the problem. Killing syslogd made agetty start, and restarting syslogd fixed the rest. We got an error message on console: 1. When logging out from the serial console on ttyS0: Warning: null TTY for (04:40) in tty_fasync Warning: null TTY for (04:40) in tty_fasync 2. When the last of the syslogd processes (yes, there were 8 of them) where killed: rs_close: bad serial port count; tty->count is 1, state->count is 6 Jozsef, does this in any way resemble your problem, and could it be that we have experienced the same bug?
We are using system builtin mingetty, but syslogd.conf isn't default. In this config only file logging exist and there is no console redirection. It could be the same bug, but we could not resolve the problem from userspace, only if we hard reset the system. Cause of this I suspect too that it is a kernel bug. More investigation of this bug shows that: 1. could be cause of heavy swap usage 2. could be sysv shared memory, semaphor, message queue limitation (/ proc/sys/kernel/[sem,shmmni,..] 3. could be vm settings (/proc/sys/vm/*) 4. i don't know:)
It's impossible to determine what's happening based upon the information available so far. Given that it is capable of receiving keyboard interrupts, then it should be able to respond to Alt-Sysrq input on the console. The next time the hang occurs, please send the output from Alt-Sysrq-m, Alt-sysrq-t, Alt-sysrq-p, and Alt-Sysrq-w, in that order. Also, make sure that /proc/sys/kernel/sysrq is equal to 1. If it is not, then "echo 1 > /proc/sys/kernel/sysrq", or set it permanently to 1 in /etc/sysctl.conf: sysrq = 1
Created attachment 112804 [details] Output from sysrq We had this hang yet another time, and this time we were able to extract the sysrq output. The attached file include the output in sequence. The state dump took a long time. The register dump was because of this not taken directly after the state dump, but a few minutes later when I discovered that the state dump was finished.
Did the output from sysrq make it easier to find the problem? Any clues on how to avoid the problem would be very much appreaciated.
SysRq : Show Memory Mem-info: Zone:DMA freepages: 2876 min: 0 low: 0 high: 0 Zone:Normal freepages: 1322 min: 1279 low: 4544 high: 6304 Zone:HighMem freepages: 571 min: 255 low: 6654 high: 9981 Free pages: 4768 ( 570 HighMem) ( Active: 350863/84675, inactive_laundry: 18465, inactive_clean: 11537, free: 4769 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2876 aa:633 ac:42957 id:7056 il:3268 ic:3391 fr:1324 aa:44890 ac:262383 id:77621 il:15197 ic:8146 fr:567 2*4kB 1*8kB 2*16kB 2*32kB 2*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11504kB) 82*4kB 92*8kB 44*16kB 2*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 5288kB) 302*4kB 6*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2264kB) Swap cache: add 0, delete 0, find 0/0, race 0+0 140228 pages of slabcache 4060 pages of kernel stacks 0 lowmem pagetables, 15994 highmem pagetables Free swap: 4192504kB 655339 pages of RAM 425963 pages of HIGHMEM 13876 reserved pages 477812 pages shared 0 pages swap cached The system does not appear to be stopped or in any kind of "hard" hang. It certainly is strapped for memory, although there is page reclamation going on in order to keep the system running. In fact, kswapd is currently blocked because there is enough memory on the inactive clean (ic:) plus the free (fr:) lists of both the normal and high zones that collectively are greater than the "low:" values for those zones. So at the time of the alt-sysrq-t, kswapd was not actively reclaiming memory from either zone's respective page caches. What is a bit troubling is the amount of pages being used by the slabcache (140228), all of which comes from the normal zone, which when fully populated can have a maximum of ~225000 pages (~896MB). Whenever the slabcache consumes more than about 50% of the normal zone, there's potentially a problem. A "cat /proc/slabinfo" at the time of the hang (if possible) might yield some clues. However, swap is not even being used, because the page reclamation process from each zone's page cache seems to be doing enough to satisfy the memory requirements. Furthermore, the alt-sysrq-w at the end of the output shows 3 processors idle, and the 4th one doing a syslog read. What is hard to understand, though, is why there are so many crond processes running. The system at the time of the alt-sysrq-t had 2030 processes, and 1963 of them are "crond" processes, which I've never seen before. Is that a "normal" situation in your configuration? (Or has crond somehow gone wild?) Lastly, this is a RHEL3-U3 kernel. Numerous memory-handling updates have gone into the RHEL3-U4 kernel, as well as into the soon-to-be-released RHEL3-U5 kernel. Before doing much more with this case, the kernel will have to be updated.
crond was blocked, trying to syslog. sysklogd was hanging, and thus all the crond processes got stuck. I believe sysklogd was hanging because of some serial console problem, or something like that. This was definitely not a normal situation. The number of processes had been growing linearly for quite some time when we discovered this. You can see this pattern from our munin graphs at <URL: http://yggdrasil.uio.no/munin/uio.no/hvelvet.uio.no.html > Notice how there are spikes in november, december, march and april.
Jason, Does this sound like something that could be associated with your init-dev patch?
Nonetheless, an upgrade to RHEL3-U5 is in order here, due to several fixes in the tty area. It was released to RHN this morning as: RHSA-2005:294 - Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5
We upgraded our log host to a new kernel 2005-05-20 (kernel version 2.4.21-32.ELsmp), and the problem with blocked processes repeated itself this night. So the new kernel do not seem to make any difference for us.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.