From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020412 Debian/0.9.9-6 Description of problem: Roughly once a day our RHAS2.1 crashes. It is pingable after crashing, but it doesn't respond to keyboard input, and the screen is black. It is a DELL Poweredge 2550 with two 1.4GHz PIII CPUs, 1Gb memory and 2Gb swap. Version-Release number of selected component (if applicable): Linux version 2.4.9-e.3smp (bhcompile.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-108.1)) #1 SMP Fri May 3 16:48:54 EDT 2002 How reproducible: Always Steps to Reproduce: 1. Get a DELL PE 2550 2. Install RHAS2.1 3. Don't know, but possibly load it heavily Actual Results: After something like 0-24 hours, the machine goes dead, except that it's still pingable. Expected Results: It should just keep on running... Additional info:
See also bug 67609.
Created attachment 63024 [details] /var/log/messages
Attached the /var/log/messages from the machine. Those lines about waitpid() failing with errno=512 look weird, as does the attempt to load the ^[J^S@ l\234G module. Don't know if they are related to the crashing though.
I have installed your netdump 0.6.6-1 thingy on it, and successfully tested it using your crash.o kernel module. However, after one of our "real" crashes, I don't get any crash dump. Also, after a crash, it seems not to respond to alt-sysrq 'k'ill, 's'ync or 'u'mount, but to 'b'oot. A blind guess of mine is that maybe the scheduler freaks out in some way and stops handing out time slices. Do you have any ideas about what I could possibly do to (dis)prove that theory?
Regarding the load of this machine, it regularly goes up over 30, and I have seen it reach 100.
I take back what I said about not responding to SysRq. Not only did the machine respond to all SysRq requests, but it also sent the output (success) to our netdump receiving server. So if we could just have a SysRq key combo for "you have crashed, please dump core" (bug 68435) I could probably provide you with lots more information, giving you a fighting chance of actually *doing* anything about this bug.
Would be worth posting the sysrq-t, sysrq-m, and sysrq-p output that got dumped via netdump to this bug report.
Will do that as soon as we get another crash. It's late friday afternoon here, so it may not happen before the weekend. Your concern is much appreciated btw.
I've been working with Johan via email for a while now. We have tried doubling the values of /proc/sys/vm/freepages and this works well most of the time. With two memory gobbling programs running they can still lock the system. Sysrq-M at that point will hang the box hard. Attaching an nmi_watchdog oops and partial netdump of the hard hang. System is 1GB Ram, 2GB swap.
Created attachment 65964 [details] nmi_watchdog oops
Created attachment 65965 [details] partial netdump bzipped.
This does not appear to be a crash. It appears that kswapd consumes the CPU when memory gets very low and memory gobbling programs continue to run and consume memory. We have heard about this problem before and have successfully reproduced it. Doubling the /proc/sys/vm/freepages from 638 1279 1914 to 1276 2552 3814 appears to work around the problem in all cases for a 256MB system. I will experiment with a much larger system ant determinw the optimal freepages values. Larry Woodman
looks like you are using clearcase binary only modules. Is that correct ?
We are using Clearcase. I was not aware that Clearcase was loading any kernel modules, and using lsmod while using XClearcase I cannot see which module should have been loaded by Clearcase: johan@transwarp:~/views/CR444_transwarp_johan$ /sbin/lsmod Module Size Used by Not tainted i810_audio 25440 1 (autoclean) ac97_codec 13664 0 (autoclean) [i810_audio] soundcore 7940 2 (autoclean) [i810_audio] radeon 105944 0 autofs 13796 0 (autoclean) (unused) nfs 91936 7 (autoclean) lockd 61184 1 (autoclean) [nfs] sunrpc 86000 1 (autoclean) [nfs lockd] 3c59x 32264 1 ide-scsi 10464 0 ide-cd 35296 0 cdrom 35520 0 [ide-cd] mousedev 5824 1 hid 22272 0 (unused) input 6560 0 [mousedev hid] usb-uhci 26948 0 (unused) usbcore 68864 1 [hid usb-uhci] ext3 73536 3 jbd 55048 3 [ext3] aic7xxx 127200 4 sd_mod 13468 4 scsi_mod 124988 3 [ide-scsi aic7xxx sd_mod]
Created attachment 77304 [details] Some logs caught with netdump and Alt-SysRq-TMPSUB
clearcase loads the mvfs module when you use it
Then how comes it isn't loaded when I use xclearcase? And the kernel says "not tainted" while using xclearcase (or am I mis-reading the output from lsmod?)? Have you found any trace of an mvfs module, or do you just suspect one from general knowledge about Clearcase? Here's the output from 'clearcase -version' in case that helps: ClearCase version 4.0 (Fri Mar 03 19:21:10 EST 2000) MVFS is not installed. cleartool V4.0 (Fri Mar 3 12:11:01 EST 2000) db_server V4.0 (Fri Mar 3 12:08:52 EST 2000) VOB database schema version: 53
We do have the same symptoms on two Dell Poweredge 1650 with 2*1.4 Pentium III 2GB Ram, PERC Raid adapter and shared disc. The machines are connected in cluster via a serial cable. Normally after 24h the first system hangs (like described above) after two to three hours the second system hangs too. The Kernel on all our machines is 2.4.9-e16smp. The system load high neither the memory usage is not very. Since the system is not yet in production. With this problem we cannot use the machines in production :-( 3 other dual Pentium 3 without a raid (extern) are working just fine without any problems. The cluster logs and syslogs are not giving any hints. Next time the machines are hanging I'll try to get some dumps. Netdump is not possible since the network adapters are not supported.
Is this still a problem with the latest AS2.1 kernel errata, e.24? Larry Woodman
We haven't upgraded the kernel because we are in the middle of a release cycle. We'll do a kernel upgrade towards the end of july. Since I'll be on vacation then, I probably won't be able to answer this question until the middle of august. Jorit, if you can answer this before that, please do.
The 'problem' has been removed by no longer using the cluster software. I'm sorry I cannot help. The solution now used is a simple heartbeat between the two machines and it works fine. Sorry. Jorit
I just tried my stress test over night, and the machine was fine when I got back to work this morning. I'm running kernel 2.4.9-e.34smp.
Created attachment 102780 [details] Stress test This is the stress test I used. Don't know what's necessary, but to bring my system to its knees I ran two instances of it in parallell (sp?), thusly: while true ; do date ; ./oom ; done That doesn't bring my system down any more, but I'm attaching it for reference.