LTC Owner is: dvhltc.com LTC Originator is: dvhltc.com ltcrt3 is an LS20 with 4 GB RAM. Kernbech hangs on the second maximum load run when ran from a shell in rt-sshd - this means all the compiles were run as FIFO 99 as well! (Did that accidentally while trying to repro this issue). KERNBENCH OUTPUT: Running kernbench tests 4 cpus found Cleaning source tree... Making defconfig... Making dep if needed... Caching kernel source in ram... Half load is 2 jobs, changing to 3 as a kernel compile won't guarantee 2 jobs Performing 5 runs of make -j 3 make -j 16 make -j Warmup half load run... Half load -j3 run number 1... Half load -j3 run number 2... Half load -j3 run number 3... Half load -j3 run number 4... Half load -j3 run number 5... Warmup optimal load run... Optimal load -j16 run number 1... Optimal load -j16 run number 2... Optimal load -j16 run number 3... Optimal load -j16 run number 4... Optimal load -j16 run number 5... Warmup maximal load run... Maximal load -j run number 1... Maximal load -j run number 2... I saw two distinct traces, but was only able to get at them from the remote console - cropped and inverted with gimp and gocr'd here (errors probable :-) dvhart@aeon:~$ gocr -i Desktop/Screenshot.png [<ffffffff810c1351>] _ age_cache_alloc+0 _ e/0x7Z [<ffffffff8101Ze43>] _ do _ age_cache_readahead+0 m 7/0 xZ 05 [<ffffffff810cc754>] do _ age_cache_readahead+0 _ Z/0 _ f [<ffffffff810137da>] file nm p_nopage+0x183/0 x3 8Z [<ffffffff81008818>] _ handle_ mn _fault+0 xZ 26/0 d 2b [<ffffffff81068101>] do _ age_fault+0 d 6d/0 _ 13 [<ffffffff81066add>] error_exit+0 dJ /0 _ 4 [<0000555555557510>] Codei 8b 0Z f6 c4 04 74 05 41 ff c4 eb 35 8b 0Z 66 85 c0 71 05 41 RIP [<ffffffff81081787>] show_ me m+0 _ f/0x144 RSP <ffff81010118bbZ8> CRZi 0000000005a00000 usb Z-Z: new full speed USB device using ohci_hcd and address 5 usb Z-Z: configuration #1 chosen from 1 choice hub Z-Z:1.0: USB hub found hub Z-Z:1.0: 4 ports detected usb Z-Z.1: new full speed USB device using ohci_hcd and address 6 usb Z-Z.1: configuration #1 chosen from 1 choice input: IBff PPC I/F as /class/input/input6 input: USB HID v1.10 Keyboard [IBff PPC I/F] on usb-0000:01:00.1-Z.1 input: IBff PPC I/F as /class/input/input7 input: USB HID v1.10 fb use [IBff PPC I/F] on usb-0000:01:00.1-Z.1 dvhart@aeon:~$ gocr -i Desktop/Screenshot-1.png R10i ffff81011100f840 R11i ffffffff818Zad88 R1Zi 0000000000015371 R13i 0000000000001411 R14i 000000000000001Z R15i 0000000000000001 FSi 0000Zb1cc63e0Z40(0000) GSiffff810111c1bZc0(0000) XnIGSi0000000000000000 CSi 0010 DSi 0018 ESi 0018 CR0i 000000008005003b CRZi 0000000005a00000 CR3i 00000000ece68000 CR4i 00000000000006e0 Process Xthread (pid: 51, threadinfo ffff81011100e000, tasX ffff8101117ff040) StacXi 0000000000000001 00000000000000d0 ffff8100000111e0 ffff8101117ff040 ffff81011100fb10 ffffffff810caZa1 000000Z000000001 000000010000003c ffff81011100fad8 0000000000000000 0000000000000000 00000000000000d0 Call N acei [<ffffffff810caZa1>] out_of_ me nD ry+0x71/0 xZ cd [<ffffffff8100f375>] _ alloc _ ages+0 xZ 38/0 xZ c5 [<ffffffff81015ea0>] alloc _ ages_current+0 a 8/0xb0 [<ffffffff8103d13Z>] _ et_free _ ages+0 x /0x71 [<ffffffff8101f11d>] copy _ rocess+0 x d/0x15Za [<ffffffff8103268e>] do_forX+0 x ç/0x1ç6 [<ffffffff8105ff01>] Xernel_thread+0 _ 1/0 d e [<ffffffff8105ff5e>] child_rip+0 dJ /0x1Z Codei 8b 0Z f6 c4 04 74 05 41 ff c4 eb 35 8b 0Z 66 85 c0 71 05 41 RIP [<ffffffff81081787>] show_ me m+0 _ f/0x144 RSP <ffff81011100faa0> CRZi 0000000005a00000 Comment #1 was with 2.6.21-14ibm The reason I thought I had seen the second trace on ltcrt3 was that it's showing up (via different allocation path) on the Intellistation A-Pro hang traces I was seeing: Jun 1 11:17:23 apro-gs kernel: sig_exit invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0 Jun 1 11:17:23 apro-gs kernel: Jun 1 11:17:23 apro-gs kernel: Call Trace: Jun 1 11:17:23 apro-gs kernel: [<ffffffff8106d5da>] dump_trace+0xaa/0x32a Jun 1 11:17:23 apro-gs kernel: [<ffffffff8106d89b>] show_trace+0x41/0x5c Jun 1 11:17:23 apro-gs kernel: [<ffffffff8106d8cb>] dump_stack+0x15/0x17 Jun 1 11:17:23 apro-gs kernel: [<ffffffff810ca2a2>] out_of_memory+0x74/0x2cd Jun 1 11:17:23 apro-gs kernel: [<ffffffff8100f375>] __alloc_pages+0x238/0x2c5 Jun 1 11:17:23 apro-gs kernel: [<ffffffff81015ea0>] alloc_pages_current+0xa8/0xb0 Jun 1 11:17:23 apro-gs kernel: [<ffffffff8103d932>] __get_free_pages+0xe/0x71 Jun 1 11:17:23 apro-gs kernel: [<ffffffff8101f99d>] copy_process+0xcd/0x152a Jun 1 11:17:23 apro-gs kernel: [<ffffffff8103268e>] do_fork+0xcc/0x1c6 Jun 1 11:17:23 apro-gs kernel: [<ffffffff8106cfa2>] sys_clone+0x23/0x25 Jun 1 11:17:23 apro-gs kernel: [<ffffffff8105f437>] ptregscall_common+0x67/0xb0 Jun 1 11:17:23 apro-gs kernel: [<ffffffff8105f29e>] tracesys+0xdc/0xe1 Jun 1 11:17:23 apro-gs kernel: [<00000035f0a94679>] Jun 1 11:17:24 apro-gs kernel: This time I ran on 2.6.21-23.el5rt and didn't run kernbench from an rt shell :-) I appended /proc/meminfo into a log every 2 seconds (from the rt shell) while kernbench ran from the normal shell until it hung the box again. This time I saw no memory related traces. The log is rather interesting. Things run fine for the first 585 iterations (1170 seconds) and then it started losing 13.5 MB / sec from MemFree until it had dropped from 3.18 GB free to 25 MB - at which point the box stopped responding and no more data was logged. I should note that during this run I saw several occurences of this known trace: BUG: at kernel/softirq.c:639 __tasklet_action() Call Trace: [<ffffffff8106d5da>] dump_trace+0xaa/0x32a [<ffffffff8106d89b>] show_trace+0x41/0x5c [<ffffffff8106d8cb>] dump_stack+0x15/0x17 [<ffffffff81094a97>] __tasklet_action+0xdf/0x12e [<ffffffff81094f76>] tasklet_action+0x27/0x29 [<ffffffff8109530a>] ksoftirqd+0x16c/0x271 [<ffffffff81033d4d>] kthread+0xf5/0x128 [<ffffffff8105ff68>] child_rip+0xa/0x12 Note, a short time after memfree started dropping, swapfree tanked at an even faster rate, from 2 GB free to 0 in 80 seconds (25 MB/sec). A kernbench run on the stock rhel5 2.6.18-8.el5 kernel also starts rapidly dropping memory as the maximal loads start, but after dropping from 3 GB to 30 MB of free mem, and 2 GB to 1 GB of free swap, the system recovered to it's nearly all free state just before the second maximal run started. The second run never dropped below 2 GB of free memory. The third run dropped as low as 20 MB mem free and 700 MB swap free (that was displayed anyway, I didn't log this run ... wishing now that I had...). The fourth run too mem down to 20 MB, and swap all the way down to 4 kB, and then hung. It is currently in this state now, rtsshd responds, and the SOL console allows me to type - neither can execute commands in a timely manner. I'll post an additional comment if the system recovers. The kernbench readme says it requires 2 GB to complete. Perhaps this is inaccurate, as it appears to require nearly 6 (4 GB RAM + 2 GB Swap). Or perhaps it will eat as much as is available given that the maximal load option passes -j to make with no numeric argument, so there is no limit set on the number of parallel tasks make can fork off. I hit the hang again while running kernbench. The OOM messages appeared. System becomes absolutely unresponsive, could not log in even using rt-sshd. But the system could be pinged. Yet to see any message that indicates the OOM killer has killed a particular process. kernbench ran to completion on rt-cypress (LS21). It has 8GB of RAM. A note from the Kernbench README, pointed out to me by rostedt on irc: "Ideally it should be run in single user mode on a non-journalled filesystem. To compare results it should always be run in the same kernel tree." rostedt also pointed out that ext3 is really bad for RT anyway - it does things like "wait-on-bit" which my task dump showed that a lot of stuff was doing and has busy spin locks. We need to ask Ted about the function "inverted_lock" in fs/jbd/commit.c - on RT it can lockup the system if a higher priority process than the kjournal thread is doing disk access. I just ran 3 kernbench runs in an ext2 filesystem. All three hung during the maximal load runs, with similar memory behavior to the prior runs.
Created attachment 156317 [details] Trace 1 Screenshot with 2.6.21-14ibm The OCR wasn't very good afterall, attaching both trace screenshots - better than nothing...
Created attachment 156318 [details] Trace 2 Screenshot
Created attachment 156319 [details] meminfo log - from 2.6.21-23.el5rt
----- Additional Comments From dvhltc.com 2007-06-06 13:14 EDT ------- After a discussion with Dave Hansen I am back to thinking this is a valid OOM situation. The most likely reason that we are only seeing this now is that RHEL5 includes a more modern gcc, which is more memory hungry than previous versions. 'make -j' is unbounded as to the amount of memory it will consume and the number of tasks it will fork. It is essentially a fork bomb. Folks are also telling me that running the -j runs is not a typical way to run kernbench. I suggest we remove them by appending the -M option to the kernbench command in run_auto.sh in our test harness. As for the RAS aspect of this. We need to discuss how a RT platform should respond to a fork bomb such as this.
changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|ship issue |high Impact|Installability |RAS Priority|P2 |P3 ------- Additional Comments From dvhltc.com 2007-06-06 13:15 EDT ------- Recategorizing as RAS, high, and P3. This is not a blocker for java entry, and I don't think it is even a ship/block issue for RHEL5-RT.
----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-06-06 13:49 EDT ------- Actually, I don't even mind recategorizing as test bug and removing that option from the kernbench run. -M will only contain it if less than 2GB or some other memory threshold? Sorry, not checked that - might be misremembering. So I'm gathering the current status to be it's too hostile a test?
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-06 13:55 EDT ------- I suspect I even saw this on an 8G box last night. I still worry that the OOM killer should kill the right thing and not hang the box. This still seems like a pretty bad issue to me.
changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P2 Summary|RH242868 - [FOCUS] Kernbench|RH242868 - Kernbench hangs |hangs smaller memory |smaller memory |configurations |configurations ------- Additional Comments From dvhltc.com 2007-06-06 14:02 EDT ------- Dropping FOCUS from IBM's side as this seems to be a common issue on RHEL5 and RHEL5-RT. I am bumping back up to P2 though as even though this is likely a valid OOM situation - we feel the machine should recover from an OOM by killing off the right processes. RedHat's thoughts on this would be appreciated.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |NEEDINFO ------- Additional Comments From dvhltc.com 2007-06-06 14:02 EDT ------- Marking as NEEDINFO, as we want to hear what RedHat's take is on this bug.
Can someone attach the show_mem() output when the OOM kill occur? I need it to debug this problem. Larry Woodman
Created attachment 156835 [details] oom-out.txt
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-12 19:36 EDT ------- OOM output Here is the OOM output seen. Additionally, here is the meminfo data right before an OOM was seen: MemTotal: 8100056 kB MemFree: 108412 kB Buffers: 460 kB Cached: 1584 kB SwapCached: 1144 kB Active: 2556 kB Inactive: 1096 kB SwapTotal: 2040212 kB SwapFree: 2002784 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 1148 kB Mapped: 840 kB Slab: 6120372 kB SReclaimable: 86620 kB SUnreclaim: 6033752 kB PageTables: 4488 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 6090240 kB Committed_AS: 86092 kB VmallocTotal: 34359738367 kB VmallocUsed: 1416 kB VmallocChunk: 34359736783 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB The Slab and Sunreclaim are large. In other runs, we've noticed the filp slab seemingly is growing and growing. Occasionally it will be cleaned up and the system will recover, but not always. We're not sure why the slab doesn't seem to reclaim things promptly. We've verified that RCU is still functioning, so now we're digging in in the slab and aliencache details.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-12 23:00 EDT ------- Few of the things Keith and I have uncovered: 1) When the system gets into its funky state, we do not get a call to file_free_rcu for ever call to file_free. This points that something in RCU is not functioning properly. 2) RCU completed values continue to increment while its in its funky state, so RCU isn't completely horked. 3) rcu_process_callbacks doesn't seem to be called in the funky state (explains #1) 4) rcu_advance_callbacks is called, but doesn't seem to schedule the rcu tasklet. From this, we're guessing something is still broken w/ the tasklet code. The RCU tasklet is marked as TASKLET_STATE_SCHED, but it doesn't seem to be called. I'm worried that some how it has been marked scheduled, but has not been added to the list, which would cause such a problem.
Created attachment 156844 [details] tasklet2.patch
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-12 23:40 EDT ------- Possible fix for tasklet race that would keep RCU from running I tried to explain the race in the bug. Ends up my last tasklet race fix was incomplete. So that rather then having unscheduled tasklets on the list, we get scheduled tasklets not on the list! This patch *should* resolve it.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-12 23:50 EDT ------- Crud. The above fix did not resolve the issue. Keith still got into the funky state w/ that patch.
----- Additional Comments From mannthey.com (prefers email at kmannth.com) 2007-06-13 13:13 EDT ------- Our kprobing last night showed that rcu_check_callbacks was getting called and that tasklet_schedule was getting called. The problems was that the rcu_tasklet state was 1 or scheduled. The rcu callback were not going on and the rcu done list was never updated, so the rcu callback tasklet was not getting ran and all the new callers just see that the taskelt is scheduled and return right away. The thought is (as John mentioned) that the rcu_data.tcu_tasklet is marked as scheduled but no soft irq was raised to kick off tasklet_action due to madness in the tasklet code. (OR perhaps something is borked with the softirq code?) Further debug is needed. The "funky" state is just as described. The tasklet_action is never called for the rcu tasklet. For sure the slab grows and grows you see SUnreclaim grow to fill memory. Modules may not be removed. You can add more modules (kprobe) as long as you do not try a rmmod (rmmod hangs and every thing modules gets borked.)
Created attachment 156940 [details] rostdet.patch
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-13 19:24 EDT ------- Steven's patch for softirq issues This patch from Steven seems to resolve the issue for Keith. I'm currently building a test kernel with it and the tasklet race fix above.
Also, please grab a "cat /proc/slabiunfo" when the OOMkill occurs. The problem is that the slabcache sucked up most of the memory: Slab: 6120372 kB Larry Woodman
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-13 20:12 EDT ------- Hey Larry, We've dug down pretty far on this one and I suspect the issue has been finally resolved. You're right, the slab was very large, and it was usually caused by the filp slab growing w/o bound. This was tracked back to the RCU code not running the rcu tasklet. This seems to be resolved w/ Steven's patch (attached above), and the earlier tasklet fix (attached a few more comments up) is probably a good idea too.
Created attachment 156942 [details] slab.final
----- Additional Comments From mannthey.com (prefers email at kmannth.com) 2007-06-13 20:29 EDT ------- Example of slabinfo This is a slabinfo capture deal the end for a box running. You can see the slab is very large. The system was idle when this stapshot was taken. You can see the massive abmout of filp structs. Steven's patch fixes this issue.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-14 16:21 EDT ------- Overnight testing looked good. Marking this fixed.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-18 19:28 EDT ------- Additional testing info: Ran kernbench in a loop for 4 days and the box is fine w/ -23ibm3
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-06-20 15:11 EDT ------- Verified fixed in 2.6.21-31.el5rt (ran overnight w/o issues)