Description of problem: System freeze when limpack stress test is running after a few hours. System is no longer reachable over network (no rsh, no ping) Version-Release number of selected component (if applicable): Test System: Tyan GT24 (S2891) Bios: v2.02n10 BMC: rev. 17 CPUs: 2 x AMD Opteron 280 Mem: 8 x 1GB PC3200 HDD: 2 x 80GB SATA Hitachi Interconect: 1 x Infiniband HCA (Voltaire) OS: RHEL4U2 kernel: 2.6.9-22.0.2 How reproducible: It doesn't happend every time Steps to Reproduce: 1. run linpack over 24h 2. 3. Actual results: Some server crashs some not. Expected results: no crash Additional info: we think that BIO_MEMPOOL is full, but the application will do more bio. during the linpack tress test the system swapps about 3GB. Attached the call trace from one crashed machine
Created attachment 134028 [details] Call trace from crached machine
Created attachment 134135 [details] an other erro log from a nother machine with some syntomps Here an other erro log from a different machine (same type) which also crashed.
Looks like the patch submitted in the LKML thread "[PATCH] dm: Fix deadlock under high i/o load in raid1 setup." is addressing exactly this issue. http://opensubscriber.com/message/linux-kernel%40vger.kernel.org/4640513.html Any chance this goes into the RHEL4 kernel? regards, Erich
Did you get a crash dump form this machine? The reason I ask is because it looks like the machine took an NMI watchdog timeout panic because this CPU was stuck in a spinlock with interrupts disabled. Evidently someone else has the zone->lock so this CPU starved without taking timer interrupts long enough to incur the NMIwatchdog crash. static struct page * buffered_rmqueue(struct zone *zone, int order, int gfp_flags) { ... if (page == NULL) { spin_lock_irqsave(&zone->lock, flags); page = __rmqueue(zone, order); spin_unlock_irqrestore(&zone->lock, flags); }
No we have no crash dump for this machine
I have look at the patch posted by Erich, but I'm not sure this will help us because we are not using dm-raid we are using mdadm. Could this effect also happend with mdadm or is it a bug only from the dm-raid package?
Created attachment 134221 [details] raid1_mempool_race.patch In theory this patch should solve the issue in drivers/md/raid1.c similarly to what was posted to LKML. My attempt to reproduce the bug lead straight into another lockup (ext3 related). Will check bugzilla for something similar and eventually post the report in another ticket...
I t looks like an other bugreport exist with the same problem (Bugreport #149088)
Erich, did you verify that the patch in comment #7 fixed this problem? The NMI watchdog panic attached in comment #2 is certainly a different problem but this patch might be the fix for memory allocation failure attached in comment #1 and that might very well cause the system to hang. Larry Woodman
Hi Larry, I'm trying to produce the first (kswapd related) freeze but didn't succeed, yet. It's a pretty rare event. Trying still with the original kernel. Actually this should occur faster on single core machines (IMHO), so we switched testing to single core nodes. Once the reproducer works, I'll try with the patch. And keep you updated, of course. Regards, Erich
Erich or Benedikt, can you try increasing /proc/sys/vm/min_free_kbytes to 4 times its default value and see if this prevents this hang from happening? This is what was done in the upstream kernel and does prevent the system from totally exhausting RAM. Thanks, Larry Woodman
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.