LTC Owner is: jstultz.com LTC Originator is: dbhilley.com Problem description: Stressing the network interface causes a kernel oops: c1f2bc2n13 kernel: Unable to handle kernel paging request at ffff81051fbe06b4 RIP: c1f2bc2n13 kernel: [<ffffffff802dbaed>] slab_get_obj+0x14/0x21 c1f2bc2n13 kernel: PGD 8063 PUD 0 c1f2bc2n13 kernel: Oops: 0000 [1] PREEMPT SMP If this is a customer issue, please indicate the impact to the customer: If this is not an installation problem, Describe any custom patches installed. GPFS is compiled and loaded as a kernel module. Provide output from "uname -a", if possible: Linux c1f2bc2n13 2.6.21-31.el5rt #1 SMP PREEMPT RT Mon Jun 18 16:44:12 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Hardware Environment Machine type (p650, x235, SF2, etc.): IBM eServer BladeCenter LS20 -[8850Z47]- Cpu type (Power4, Power5, IA-64, etc.): x86_64 Describe any special hardware you think might be relevant to this problem: Please provide contact information if the submitter is not the primary contact. Please provide access information for the machine if it is available. Is this reproducible? Yes If so, how long does it (did it) take to reproduce it? Describe the steps: Simply stress the network interface using iperf for a few hours. I've done this several times. If not, describe how the bug was encountered: Is the system (not just the application) hung? If so, describe how you determined this: Did the system produce an OOPS message on the console? If so, copy it here: On Jul 13: c1f2bc2n13 kernel: Unable to handle kernel paging request at ffff81051fbe06b4 RIP: c1f2bc2n13 kernel: [<ffffffff802dbaed>] slab_get_obj+0x14/0x21 c1f2bc2n13 kernel: PGD 8063 PUD 0 c1f2bc2n13 kernel: Oops: 0000 [1] PREEMPT SMP On Jul 8: c1f2bc2n13 kernel: Unable to handle kernel paging request at ffff81051fbe06b4 RIP: c1f2bc2n13 kernel: [<ffffffff802dbaed>] slab_get_obj+0x14/0x21 c1f2bc2n13 kernel: PGD 8063 PUD 0 c1f2bc2n13 kernel: Oops: 0000 [1] PREEMPT SMP Is the system sitting in a debugger right now? If so, how long may it stay there? Additional information: Since this is caused by stressing the network, it may be related to bug #36015.(RH246711) Reproduced again with a slightly different function in the backtrace and more info in /var/log/messages after reboot: c1f2bc2n13 kernel: Unable to handle kernel paging request at ffff810174607e80 RIP: c1f2bc2n13 kernel: [<ffffffff8022f8fe>] __alloc_skb+0xbe/0x140 c1f2bc2n13 kernel: PGD 8063 PUD 0 c1f2bc2n13 kernel: Oops: 0002 [1] PREEMPT SMP c1f2bc2n13 kernel: CPU 3 c1f2bc2n13 kernel: Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg shpchp pcspkr amd_rng i2c_amd756 i2c_core k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd c1f2bc2n13 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1 c1f2bc2n13 kernel: RIP: 0010:[<ffffffff8022f8fe>] [<ffffffff8022f8fe>] __alloc_skb+0xbe/0x140 c1f2bc2n13 kernel: RSP: 0018:ffff81011fdcfcc0 EFLAGS: 00010212 c1f2bc2n13 kernel: RAX: ffff810174607e80 RBX: ffff810174607800 RCX: 0000000000000000 c1f2bc2n13 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff810072a29a90 c1f2bc2n13 kernel: RBP: ffff81011fdcfd00 R08: 0000000000000000 R09: ffff810072a299c0 c1f2bc2n13 kernel: R10: ffff81011fdcfbb0 R11: ffff81007cb29e30 R12: 0000000000000680 c1f2bc2n13 kernel: R13: ffff810072a299c0 R14: 0000000000000680 R15: 0000000000000020 c1f2bc2n13 kernel: FS: 0000000040800940(0000) GS:ffff81011fd9f6c0(0000) knlGS:0000000000000000 c1f2bc2n13 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b c1f2bc2n13 kernel: CR2: ffff810174607e80 CR3: 00000001155cc000 CR4: 00000000000006e0 c1f2bc2n13 kernel: Process softirq-net-rx/ (pid: 46, threadinfo ffff81011fdce000, task ffff810037c77040) c1f2bc2n13 kernel: Stack: ffff81011fdcfd20 0000000080236999 ffff81011f838480 ffff81007fe88000 c1f2bc2n13 kernel: ffff81007fe88900 ffff81007db718d0 ffff81007cad31a0 0000000000000642 c1f2bc2n13 kernel: ffff81011fdcfd20 ffffffff80429e12 ffff81007fe88000 00000000000000c5 c1f2bc2n13 kernel: Call Trace: c1f2bc2n13 kernel: [<ffffffff80429e12>] __netdev_alloc_skb+0x31/0x4e c1f2bc2n13 kernel: [<ffffffff88234e1d>] :tg3:tg3_alloc_rx_skb+0xe1/0x177 c1f2bc2n13 kernel: [<ffffffff8823c668>] :tg3:tg3_poll+0x470/0xa39 c1f2bc2n13 kernel: [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2 c1f2bc2n13 kernel: [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271 c1f2bc2n13 kernel: [<ffffffff80233d76>] kthread+0xf5/0x128 c1f2bc2n13 kernel: [<ffffffff8025ff68>] child_rip+0xa/0x12 c1f2bc2n13 kernel: c1f2bc2n13 kernel: c1f2bc2n13 kernel: Code: c7 00 01 00 00 00 66 c7 40 04 00 00 66 c7 40 06 00 00 66 c7 c1f2bc2n13 kernel: RIP [<ffffffff8022f8fe>] __alloc_skb+0xbe/0x140 c1f2bc2n13 kernel: RSP <ffff81011fdcfcc0> c1f2bc2n13 kernel: CR2: ffff810174607e80 Again, tg3 is involved, which makes me wonder if it is possibly related to bug #36015.(RH246711)
Changed state to NEEDINFO as IBM is trying to isolate whether this problem occurs if GPFS is not loaded.
----- Additional Comments From dbhilley.com 2007-07-19 15:19 EDT ------- Yes, it occurs when GPFS is not loaded. Check the second backtrace I posted as it contains a list of "modules linked in." You can verify that mmfs is not among them.
Can you give us any suggestions on how to reproduce?
Can it be reproduced on a different nic? ie, e1000
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-24 21:02 EDT ------- I'm working to reproduce w/ iperf.
----- Additional Comments From dbhilley.com 2007-07-25 00:53 EDT ------- My iperf testing setup is very simple. I have two nodes and I run a unidirectional client and server on both nodes with a very long runtime value. All of the rest of the settings are untouched (default: TCP, as fast as possible, default buffer size, etc.). On node1 I'd run: $ iperf -s -p 9999 $ iperf -c node2 -p 9999 -t 9999999 On node2 I'd run: $ iperf -s -p 9999 $ iperf -c node1 -p 9999 -t 9999999 I just run this continuously and in 2-24 hours one of the nodes will almost always crash. Sometimes I get nothing in the messages output, sometimes I get some variation on "Unable to handle kernel paging request oops" (as posted). Other configurations, such as UDP, UDP capped at 100mbit, bidirectional client on one machine and server on the other, etc. have also elicited crashes. As far as testing with another NIC, these are LS20 blades with two built-in Broadcom gigE NICs. I'm not sure if adding another NIC is feasible.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-27 17:42 EDT ------- I tried the iperf test for two days over loopback on a single box and did not reproduce the issue. It seems the actual network driver needs to be in use. I'll setup a multi-system test as described soon.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-27 20:37 EDT ------- Currently running half duplex version of the above, with server on ltcrt18 and client on ltcrt15.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-27 20:51 EDT ------- Cool. That didn't take long to trigger. The issue is fairly quickly reproduced w/ one way transfers, w/ the client being the one that crashes.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-30 19:08 EDT ------- I'm checking to see if this can be reproduced w/ upstream 2.6.22.1-rt9
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-30 21:24 EDT ------- Could not easily reproduce w/ 2.6.22.1-rt9.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-31 21:06 EDT ------- With 2.6.21-31 while trying to reproduce I got the following: tg3: eth0: transmit timed out, resetting tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] Over and over. Finally followed by a few: PM: Writing back config space on device 0000:02:01.0 at offset b (was 164814e4, writing 3011014) PM: Writing back config space on device 0000:02:01.0 at offset 3 (was 804000, writing 804010) PM: Writing back config space on device 0000:02:01.0 at offset 2 (was 2000000, writing 2000010) PM: Writing back config space on device 0000:02:01.0 at offset 1 (was 2b00000, writing 2b00146) PM: Writing back config space on device 0000:02:01.0 at offset 0 (was 164914e4, writing 16a814e4) tg3: eth0: Link is down. Followed by some more "PM: Writing back.." messages, then finally: tg3: eth0: Link is up at 1000 Mbps, full duplex. tg3: eth0: Flow control is off for TX and off for RX. The box was still up and usable, but network connections were dropped.
changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 |P3 ------- Additional Comments From dbhilley.com 2007-07-31 23:09 EDT ------- I got that same repeated output on bug #36015 which I believe may be related: tg3: eth0: transmit timed out, resetting tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] except, in that case the messages were repeated for hours and hours and the interface never came back up. I had to restart the machines remotely. When encountering this bug, however, I never got those messages. On the last excerpt I posted, where __alloc_skb was the offending function, there was no output at all after the backtrace and the next entry is the syslogd restart (12+ hours later). In the first two excerpts I posted, where slab_get_obj is the offending function, I didn't even get a backtrace at all or a list of modules linked in, etc.: the log ends there.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-01 18:57 EDT ------- Possibly related and triggered on a 2.6.21.4-rt10 based kernel running iperf: BUG: at mm/slab.c:2795 slab_put_obj() Call Trace: [<ffffffff810af10a>] free_block+0x14a/0x220 [<ffffffff810af2c9>] drain_array+0xe9/0x150 [<ffffffff810af44a>] __drain_alien_cache+0x8a/0xa0 [<ffffffff810b0a5d>] cache_reap+0x12d/0x2c0 [<ffffffff810b0930>] cache_reap+0x0/0x2c0 [<ffffffff810567d1>] run_workqueue+0xc1/0x120 [<ffffffff8105784b>] worker_thread+0x15b/0x1a0 [<ffffffff8103bd50>] default_wake_function+0x0/0x20 [<ffffffff81037586>] __wake_up_common+0x46/0x80 [<ffffffff810576f0>] worker_thread+0x0/0x1a0 [<ffffffff8105a5f9>] kthread+0x129/0x160 [<ffffffff8100b538>] child_rip+0xa/0x12 [<ffffffff8105a4d0>] kthread+0x0/0x160 [<ffffffff8100b52e>] child_rip+0x0/0x12 --------------------------- | preempt count: 00000000 ] | 0-level deep critical section nesting: ----------------------------------------
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-03 16:48 EDT ------- I'm still hammering on this, but so far have not narrowed it down much. I have found that running "iperf -c <client> -t 999999 -P 4" helps reproduce the issue much faster.
Created attachment 160683 [details] out
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-03 21:17 EDT ------- dmesg and sysrq-t output I got into a situation where the box hung, then after a bit the box recovered (this happens 1 out of every 10 times or so), however I also ran sysrq-t while the box was hung and printing out "tg3: eth0: transmit timed out, resetting" messages. This is the resulting dmesg output.
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-06 19:57 EDT ------- Tried to get a better grip on the issue, bisected down to the following: 2.6.21.4-rt10: Does hang. 2.6.21.4-rt11: Does hang. 2.6.21.5-rt14: Does hang. 2.6.21.5-rt15: Does hang. 2.6.21.5-rt16: NA (doesn't build) 2.6.21.5-rt17: Does not hang. 2.6.21.6-rt19: Does not hang. I'm working to narrow down the -15 to -17 diff.
Created attachment 160793 [details] mini17.patch
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-06 22:52 EDT ------- Steven's restrict irqd from migrating Apparent fix for the issue. I'm still testing, but with this patch, 2.6.21.4-rt10 appears not to hang w/ iperf stress testing. The issue of course, is that this binds irqd's to one cpu of the affinity mask so they do not migrate, which makes it possible that an rt task can starve the interrupt, should it land on the one selected. I've sent previous patches to Ingo w/ basically reverting this patch. :( Talking w/ Steven, he suggested we look to stop softirqs from being called from irqds. I'm still mulling over my own thoughts as to what the right fix would be.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |FIXEDAWAITINGTEST Resolution| |FIX_ALREADY_AVAIL ------- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-08-07 22:00 EDT ------- I'm going to mark this as fixed, however its likely we'll need to open a new bug to track the softirq migration vs starvation issue.
Created attachment 161080 [details] Steven's recent patch against -31 Patch against RH's -31 kernel from Steven to avoid calling softirqs directly from irqds. This resolves the network hang as well as the previous patch and avoids irq starvation.
------- Comment From jstultz.com 2007-08-13 15:21 EDT------- So with the last patch attached, I had the box fall over w/ heavy stress testing. Not sure if this is starvation caused or not. Continuing testing and will provide more details when I can.
------- Comment From jstultz.com 2007-08-13 20:13 EDT------- Verified as starvation issue. Steven's patch still improves on -31, so should likely go in, but it is still possible w/ a single rt-hog to lock up the box (after some time).
The RT starvation issue is due to the RT hogs preempting a IRQ thread that is bounded to a CPU. If the IRQ affinity mask is set to more than one CPU, the expected behaviour (from the user point of view) is that the IRQ thread would migrate over to other CPUS. But unfortunately, if we bind the IRQ thread, this prevents the migration from happening, and thus we get an IRQ starvation issue. The fix is to simply disable the Hard irq thread running any softirq routines, until we can make it possible for the softirq routines to migrate themselves. We can still starve softirqs (and possibly more than one if the use of locks is applied) but that's a separate issue. I'll attach a patch that disables the IRQ thread from running softirq routines and not to need to be bounded to any less amount of CPUS than what the IRQ affinity stated.
Created attachment 172422 [details] Don't run the softirq routine from the hard IRQ thread
Has this bug been seen since the patch listed in #27 was applied (2.6.21-36)?
------- Comment From jstultz.com 2008-03-25 19:29 EDT------- This fix has been included in RH's tree, so it can be marked accepted.