Bug 242868 - [FOCUS] Kernbench hangs smaller memory configurations
[FOCUS] Kernbench hangs smaller memory configurations
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
i386 Linux
low Severity urgent
: ---
: ---
Assigned To: Larry Woodman
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-06 01:46 EDT by IBM Bug Proxy
Modified: 2008-02-27 14:58 EST (History)
0 users

See Also:
Fixed In Version: latest RHEL-5-RT
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-06-20 14:50:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Trace 1 Screenshot (15.81 KB, image/png)
2007-06-06 01:50 EDT, IBM Bug Proxy
no flags Details
Trace 2 Screenshot (16.80 KB, image/png)
2007-06-06 01:50 EDT, IBM Bug Proxy
no flags Details
meminfo log - from 2.6.21-23.el5rt (490.65 KB, text/plain)
2007-06-06 01:51 EDT, IBM Bug Proxy
no flags Details
oom-out.txt (121.09 KB, text/plain)
2007-06-12 19:40 EDT, IBM Bug Proxy
no flags Details
tasklet2.patch (1.17 KB, text/plain)
2007-06-12 23:45 EDT, IBM Bug Proxy
no flags Details
rostdet.patch (922 bytes, text/plain)
2007-06-13 19:30 EDT, IBM Bug Proxy
no flags Details
slab.final (15.17 KB, text/plain)
2007-06-13 20:30 EDT, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 35223 None None None Never

  None (edit)
Description IBM Bug Proxy 2007-06-06 01:46:57 EDT
LTC Owner is: dvhltc@us.ibm.com
LTC Originator is: dvhltc@us.ibm.com


ltcrt3 is an LS20 with 4 GB RAM.  Kernbech hangs on the second maximum load run
when ran from a shell in rt-sshd - this means all the compiles were run as FIFO
99 as well!  (Did that accidentally while trying to repro this issue).

KERNBENCH OUTPUT:

Running kernbench tests 

4 cpus found
Cleaning source tree...
Making defconfig...
Making dep if needed...
Caching kernel source in ram...
Half load is 2 jobs, changing to 3 as a kernel compile won't guarantee 2 jobs

Performing 5 runs of
make -j 3
make -j 16
make -j

Warmup half load run...

Half load -j3 run number 1...
Half load -j3 run number 2...
Half load -j3 run number 3...
Half load -j3 run number 4...
Half load -j3 run number 5...
Warmup optimal load run...
Optimal load -j16 run number 1...
Optimal load -j16 run number 2...
Optimal load -j16 run number 3...
Optimal load -j16 run number 4...
Optimal load -j16 run number 5...
Warmup maximal load run...
Maximal load -j run number 1...
Maximal load -j run number 2...



I saw two distinct traces, but was only able to get at them from the remote
console - cropped and inverted with gimp and gocr'd here (errors probable :-)

dvhart@aeon:~$ gocr -i Desktop/Screenshot.png 
[<ffffffff810c1351>] _ age_cache_alloc+0 _ e/0x7Z
[<ffffffff8101Ze43>] _ do _ age_cache_readahead+0 m 7/0 xZ 05
[<ffffffff810cc754>] do _ age_cache_readahead+0 _ Z/0 _ f
[<ffffffff810137da>] file nm p_nopage+0x183/0 x3 8Z
[<ffffffff81008818>] _ handle_ mn _fault+0 xZ 26/0 d 2b
[<ffffffff81068101>] do _ age_fault+0 d 6d/0 _ 13
[<ffffffff81066add>] error_exit+0 dJ /0 _ 4
[<0000555555557510>]



Codei 8b 0Z f6 c4 04 74 05 41 ff c4 eb 35 8b 0Z 66 85 c0 71 05 41
RIP  [<ffffffff81081787>] show_ me m+0 _ f/0x144
RSP <ffff81010118bbZ8>
CRZi 0000000005a00000
usb Z-Z: new full speed USB device using ohci_hcd and address 5
usb Z-Z: configuration #1 chosen from 1 choice
hub Z-Z:1.0: USB hub found
hub Z-Z:1.0: 4 ports detected
usb Z-Z.1: new full speed USB device using ohci_hcd and address 6
usb Z-Z.1: configuration #1 chosen from 1 choice
input: IBff PPC I/F as /class/input/input6
input: USB HID v1.10 Keyboard [IBff PPC I/F] on usb-0000:01:00.1-Z.1
input: IBff PPC I/F as /class/input/input7
input: USB HID v1.10 fb use [IBff PPC I/F] on usb-0000:01:00.1-Z.1



dvhart@aeon:~$ gocr -i Desktop/Screenshot-1.png 
R10i ffff81011100f840 R11i ffffffff818Zad88 R1Zi 0000000000015371
R13i 0000000000001411 R14i 000000000000001Z R15i 0000000000000001
FSi  0000Zb1cc63e0Z40(0000) GSiffff810111c1bZc0(0000) XnIGSi0000000000000000
CSi  0010 DSi 0018 ESi 0018 CR0i 000000008005003b
CRZi 0000000005a00000 CR3i 00000000ece68000 CR4i 00000000000006e0
Process Xthread (pid: 51, threadinfo ffff81011100e000, tasX ffff8101117ff040)
StacXi  0000000000000001 00000000000000d0 ffff8100000111e0 ffff8101117ff040
ffff81011100fb10 ffffffff810caZa1 000000Z000000001 000000010000003c
ffff81011100fad8 0000000000000000 0000000000000000 00000000000000d0
Call N acei
[<ffffffff810caZa1>] out_of_ me nD ry+0x71/0 xZ cd
[<ffffffff8100f375>] _ alloc _ ages+0 xZ 38/0 xZ c5
[<ffffffff81015ea0>] alloc _ ages_current+0 a 8/0xb0
[<ffffffff8103d13Z>] _ et_free _ ages+0 x /0x71
[<ffffffff8101f11d>] copy _ rocess+0 x d/0x15Za
[<ffffffff8103268e>] do_forX+0 x ç/0x1ç6
[<ffffffff8105ff01>] Xernel_thread+0 _ 1/0 d e
[<ffffffff8105ff5e>] child_rip+0 dJ /0x1Z



Codei 8b 0Z f6 c4 04 74 05 41 ff c4 eb 35 8b 0Z 66 85 c0 71 05 41
RIP  [<ffffffff81081787>] show_ me m+0 _ f/0x144
RSP <ffff81011100faa0>
CRZi 0000000005a00000

Comment #1 was with 2.6.21-14ibm




The reason I thought I had seen the second trace on ltcrt3 was that
it's showing up (via different allocation path) on the Intellistation
A-Pro hang traces I was seeing:

Jun  1 11:17:23 apro-gs kernel: sig_exit invoked oom-killer: gfp_mask=0xd0,
order=1, oomkilladj=0
Jun  1 11:17:23 apro-gs kernel:
Jun  1 11:17:23 apro-gs kernel: Call Trace:
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8106d5da>] dump_trace+0xaa/0x32a
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8106d89b>] show_trace+0x41/0x5c
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8106d8cb>] dump_stack+0x15/0x17
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff810ca2a2>] out_of_memory+0x74/0x2cd
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8100f375>] __alloc_pages+0x238/0x2c5
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff81015ea0>] alloc_pages_current+0xa8/0xb0
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8103d932>] __get_free_pages+0xe/0x71
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8101f99d>] copy_process+0xcd/0x152a
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8103268e>] do_fork+0xcc/0x1c6
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8106cfa2>] sys_clone+0x23/0x25
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8105f437>] ptregscall_common+0x67/0xb0
Jun  1 11:17:23 apro-gs kernel:  [<ffffffff8105f29e>] tracesys+0xdc/0xe1
Jun  1 11:17:23 apro-gs kernel:  [<00000035f0a94679>]
Jun  1 11:17:24 apro-gs kernel:



This time I ran on 2.6.21-23.el5rt and didn't run kernbench from an rt shell
:-)

I appended /proc/meminfo into a log every 2 seconds (from the rt shell) while
kernbench ran from the normal shell until it hung the box again.  This time I
saw no memory related traces.  The log is rather interesting.  Things run fine
for the first 585 iterations (1170 seconds) and then it started losing 13.5 MB
/ sec from MemFree until it had dropped from 3.18 GB free to 25 MB - at which
point the box stopped responding and no more data was logged.

I should note that during this run I saw several occurences of this known
trace:

BUG: at kernel/softirq.c:639 __tasklet_action()

Call Trace:
 [<ffffffff8106d5da>] dump_trace+0xaa/0x32a
 [<ffffffff8106d89b>] show_trace+0x41/0x5c
 [<ffffffff8106d8cb>] dump_stack+0x15/0x17
 [<ffffffff81094a97>] __tasklet_action+0xdf/0x12e
 [<ffffffff81094f76>] tasklet_action+0x27/0x29
 [<ffffffff8109530a>] ksoftirqd+0x16c/0x271
 [<ffffffff81033d4d>] kthread+0xf5/0x128
 [<ffffffff8105ff68>] child_rip+0xa/0x12


Note, a short time after memfree started dropping, swapfree tanked at an even
faster rate, from 2 GB free to 0 in 80 seconds (25 MB/sec).



A kernbench run on the stock rhel5 2.6.18-8.el5 kernel also starts rapidly
dropping memory as the maximal loads start, but after dropping from 3 GB to 30
MB of free mem, and 2 GB to 1 GB of free swap, the system recovered to it's
nearly all free state just before the second maximal run started.  The second
run never dropped below 2 GB of free memory.  The third run dropped as low as 20
MB mem free and 700 MB swap free (that was displayed anyway, I didn't log this
run ... wishing now that I had...).  The fourth run too mem down to 20 MB, and
swap all the way down to 4 kB, and then hung.  It is currently in this state
now, rtsshd responds, and the SOL console allows me to type - neither can
execute commands in a timely manner.  I'll post an additional comment if the
system recovers.

The kernbench readme says it requires 2 GB to complete.  Perhaps this is
inaccurate, as it appears to require nearly 6 (4 GB RAM + 2 GB Swap).  Or
perhaps it will eat as much as is available given that the maximal load option
passes -j to make with no numeric argument, so there is no limit set on the
number of parallel tasks make can fork off.

I hit the hang again while running kernbench. The OOM messages appeared. System
becomes absolutely unresponsive, could not log in even using rt-sshd. But the
system could be pinged. Yet to see any message that indicates the OOM killer has
killed a particular process.

kernbench ran to completion on rt-cypress (LS21). It has 8GB of RAM.

A note from the Kernbench README, pointed out to me by rostedt on irc:

"Ideally it should be run in single user mode on a non-journalled filesystem.
To compare results it should always be run in the same kernel tree."

rostedt also pointed out that ext3 is really bad for RT anyway - it does things
like "wait-on-bit" which my task dump showed that a lot of stuff was doing and
has busy spin locks.  We need to ask Ted about the function "inverted_lock" in
fs/jbd/commit.c - on RT it can lockup the system if a higher priority process
than the kjournal thread is doing disk access.


I just ran 3 kernbench runs in an ext2 filesystem.  All three hung during the
maximal load runs, with similar memory behavior to the prior runs.
Comment 1 IBM Bug Proxy 2007-06-06 01:50:05 EDT
Created attachment 156317 [details]
Trace 1 Screenshot

with 2.6.21-14ibm

The OCR wasn't very good afterall, attaching both trace screenshots - better
than nothing...
Comment 2 IBM Bug Proxy 2007-06-06 01:50:44 EDT
Created attachment 156318 [details]
Trace 2 Screenshot
Comment 3 IBM Bug Proxy 2007-06-06 01:51:45 EDT
Created attachment 156319 [details]
meminfo log - from 2.6.21-23.el5rt
Comment 4 IBM Bug Proxy 2007-06-06 13:20:18 EDT
----- Additional Comments From dvhltc@us.ibm.com  2007-06-06 13:14 EDT -------
After a discussion with Dave Hansen I am back to thinking this is a valid OOM
situation.  The most likely reason that we are only seeing this now is that
RHEL5 includes a more modern gcc, which is more memory hungry than previous
versions.

'make -j' is unbounded as to the amount of memory it will consume and the number
of tasks it will fork.  It is essentially a fork bomb.

Folks are also telling me that running the -j runs is not a typical way to run
kernbench.  I suggest we remove them by appending the -M option to the kernbench
command in run_auto.sh in our test harness.

As for the RAS aspect of this. We need to discuss how a RT platform should
respond to a fork bomb such as this. 
Comment 5 IBM Bug Proxy 2007-06-06 13:20:51 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|ship issue                  |high
             Impact|Installability              |RAS
           Priority|P2                          |P3




------- Additional Comments From dvhltc@us.ibm.com  2007-06-06 13:15 EDT -------
Recategorizing as RAS, high, and P3.  This is not a blocker for java entry, and
I don't think it is even a ship/block issue for RHEL5-RT. 
Comment 6 IBM Bug Proxy 2007-06-06 13:50:40 EDT
----- Additional Comments From nivedita@us.ibm.com (prefers email at niv@us.ibm.com)  2007-06-06 13:49 EDT -------
Actually, I don't even mind recategorizing as test bug and removing that option
from the kernbench run. -M will only contain it if less than 2GB or some other
memory threshold? Sorry, not checked that - might be misremembering. 

So I'm gathering the current status to be it's too hostile a test? 
Comment 7 IBM Bug Proxy 2007-06-06 14:00:27 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-06 13:55 EDT -------
I suspect I even saw this on an 8G box last night. I still worry that the OOM
killer should kill the right thing and not hang the box. This still seems like a
pretty bad issue to me. 
Comment 8 IBM Bug Proxy 2007-06-06 14:05:19 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
            Summary|RH242868 - [FOCUS] Kernbench|RH242868 - Kernbench hangs
                   |hangs smaller memory        |smaller memory
                   |configurations              |configurations




------- Additional Comments From dvhltc@us.ibm.com  2007-06-06 14:02 EDT -------
Dropping FOCUS from IBM's side as this seems to be a common issue on RHEL5 and
RHEL5-RT.  I am bumping back up to P2 though as even though this is likely a
valid OOM situation - we feel the machine should recover from an OOM by killing
off the right processes.  RedHat's thoughts on this would be appreciated. 
Comment 9 IBM Bug Proxy 2007-06-06 14:05:52 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |NEEDINFO




------- Additional Comments From dvhltc@us.ibm.com  2007-06-06 14:02 EDT -------
Marking as NEEDINFO, as we want to hear what RedHat's take is on this bug. 
Comment 10 Larry Woodman 2007-06-11 15:44:08 EDT
Can someone attach the show_mem() output when the OOM kill occur?  I need it to
debug this problem.

Larry Woodman
Comment 11 IBM Bug Proxy 2007-06-12 19:40:26 EDT
Created attachment 156835 [details]
oom-out.txt
Comment 12 IBM Bug Proxy 2007-06-12 19:40:32 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-12 19:36 EDT -------
 
OOM output

Here is the OOM output seen.

Additionally, here is the meminfo data right before an OOM was seen:
MemTotal:      8100056 kB
MemFree:	108412 kB
Buffers:	   460 kB
Cached: 	  1584 kB
SwapCached:	  1144 kB
Active: 	  2556 kB
Inactive:	  1096 kB
SwapTotal:     2040212 kB
SwapFree:      2002784 kB
Dirty:		     0 kB
Writeback:	     0 kB
AnonPages:	  1148 kB
Mapped: 	   840 kB
Slab:	       6120372 kB
SReclaimable:	 86620 kB
SUnreclaim:    6033752 kB
PageTables:	  4488 kB
NFS_Unstable:	     0 kB
Bounce: 	     0 kB
CommitLimit:   6090240 kB
Committed_AS:	 86092 kB
VmallocTotal: 34359738367 kB
VmallocUsed:	  1416 kB
VmallocChunk: 34359736783 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:	  2048 kB

The Slab and Sunreclaim are large. In other runs, we've noticed the filp slab
seemingly is growing and growing. Occasionally it will be cleaned up and the
system will recover, but not always.

We're not sure why the slab doesn't seem to reclaim things promptly. We've
verified that RCU is still functioning, so now we're digging in in the slab and
aliencache details. 
Comment 13 IBM Bug Proxy 2007-06-12 23:05:22 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-12 23:00 EDT -------
Few of the things Keith and I have uncovered:
1) When the system gets into its funky state, we do not get a call to
file_free_rcu for ever call to file_free. This points that something in RCU is
not functioning properly.
2) RCU completed values continue to increment while its in its funky state, so
RCU isn't completely horked.
3) rcu_process_callbacks doesn't seem to be called in the funky state (explains #1)
4) rcu_advance_callbacks is called, but doesn't seem to schedule the rcu tasklet.

From this, we're guessing something is still broken w/ the tasklet code. The RCU
tasklet is marked as TASKLET_STATE_SCHED, but it doesn't seem to be called.

I'm worried that some how it has been marked scheduled, but has not been added
to the list, which would cause such a problem. 
Comment 14 IBM Bug Proxy 2007-06-12 23:45:21 EDT
Created attachment 156844 [details]
tasklet2.patch
Comment 15 IBM Bug Proxy 2007-06-12 23:45:27 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-12 23:40 EDT -------
 
Possible fix for tasklet race that would keep RCU from running

I tried to explain the race in the bug. Ends up my last tasklet race fix was
incomplete. So that rather then having unscheduled tasklets on the list, we get
scheduled tasklets not on the list! This patch *should* resolve it. 
Comment 16 IBM Bug Proxy 2007-06-12 23:55:21 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-12 23:50 EDT -------
Crud. The above fix did not resolve the issue. Keith still got into the funky
state w/ that patch. 
Comment 17 IBM Bug Proxy 2007-06-13 13:16:49 EDT
----- Additional Comments From mannthey@us.ibm.com (prefers email at kmannth@us.ibm.com)  2007-06-13 13:13 EDT -------
Our kprobing last night showed that rcu_check_callbacks was getting called and
that tasklet_schedule was getting called.   The problems was that the
rcu_tasklet state was 1 or scheduled.  The rcu callback were not going on and
the rcu done list was never updated, so the rcu callback tasklet was not getting
ran and all the new callers just see that the taskelt is scheduled and return
right away. 

The thought is (as John mentioned) that the rcu_data.tcu_tasklet is marked as
scheduled but no soft irq was raised to kick off tasklet_action due to madness
in the tasklet code.   (OR perhaps something is borked with the softirq code?)
Further debug is needed.   


The "funky" state is just as described.  The tasklet_action is never called for
the rcu tasklet.  For sure the slab grows and grows you see SUnreclaim grow to
fill memory.  Modules may not be removed.  You can add more modules (kprobe) as
long as you do not try a rmmod (rmmod hangs and every thing modules gets borked.) 
Comment 18 IBM Bug Proxy 2007-06-13 19:30:44 EDT
Created attachment 156940 [details]
rostdet.patch
Comment 19 IBM Bug Proxy 2007-06-13 19:30:49 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-13 19:24 EDT -------
 
Steven's patch for softirq issues

This patch from Steven seems to resolve the issue for Keith. I'm currently
building a test kernel with it and the tasklet race fix above. 
Comment 20 Larry Woodman 2007-06-13 20:00:07 EDT
Also, please grab a "cat /proc/slabiunfo" when the OOMkill occurs.  The problem
is that the slabcache sucked up most of the memory:

Slab:	       6120372 kB


Larry Woodman
Comment 21 IBM Bug Proxy 2007-06-13 20:15:23 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-13 20:12 EDT -------
Hey Larry,  We've dug down pretty far on this one and I suspect the issue has
been finally resolved. You're right, the slab was very large, and it was usually
caused by the filp slab growing w/o bound. This was tracked back to the RCU code
not running the rcu tasklet. This seems to be resolved w/ Steven's patch
(attached above), and the earlier tasklet fix  (attached a few more comments up)
is probably a good idea too. 
Comment 22 IBM Bug Proxy 2007-06-13 20:30:27 EDT
Created attachment 156942 [details]
slab.final
Comment 23 IBM Bug Proxy 2007-06-13 20:30:44 EDT
----- Additional Comments From mannthey@us.ibm.com (prefers email at kmannth@us.ibm.com)  2007-06-13 20:29 EDT -------
 
Example of slabinfo

This is a slabinfo capture deal the end for a box running.  You can see the
slab is very large.  The system was idle when this stapshot was taken.	You can
see the massive abmout of filp structs. 

Steven's patch fixes this issue. 
Comment 24 IBM Bug Proxy 2007-06-14 16:27:33 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED




------- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-14 16:21 EDT -------
Overnight testing looked good. Marking this fixed. 
Comment 25 IBM Bug Proxy 2007-06-18 19:30:43 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-18 19:28 EDT -------
Additional testing info: Ran kernbench in a loop for 4 days and the box is fine
w/ -23ibm3 
Comment 26 IBM Bug Proxy 2007-06-20 15:19:17 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-06-20 15:11 EDT -------
Verified fixed in 2.6.21-31.el5rt (ran overnight w/o issues) 

Note You need to log in before you can comment on or make changes to this bug.