Description of problem: There has been numerous CPU soft-lockup that was tracked down to http://bugzilla.kernel.org/show_bug.cgi?id=8668 How reproducible: Very, see: http://www.spinics.net/lists/netdev/msg35100.html Steps to Reproduce: Run tc-crash.sh <http://www.spinics.net/lists/netdev/binpKJ7TGP5BK.bin> Actual results: CPU Soft-lockup like: [<c043ea0f>] softlockup_tick+0x98/0xa6 [<c0408b7d>] timer_interrupt+0x504/0x557 [<c043ec43>] handle_IRQ_event+0x27/0x51 [<c043ed00>] __do_IRQ+0x93/0xe8 [<c040672b>] do_IRQ+0x93/0xae [<c053a04d>] evtchn_do_upcall+0x64/0x9b [<c0404ec5>] hypervisor_callback+0x3d/0x48 [<c053999c>] force_evtchn_callback+0xa/0xc [<c0424601>] try_to_del_timer_sync+0x44/0x4a [<c0424611>] del_timer_sync+0xa/0x14 [<ee5814ad>] htb_destroy+0x20/0x74 [sch_htb] [<c05aa29e>] qdisc_destroy+0x41/0x8a [<c05ab9cc>] tc_get_qdisc+0x169/0x1a3 [<c05ab863>] tc_get_qdisc+0x0/0x1a3 [<c05a35e1>] rtnetlink_rcv_msg+0x1b7/0x1dc [<c05af8a4>] netlink_run_queue+0x63/0xfa [<c05a342a>] rtnetlink_rcv_msg+0x0/0x1dc [<c05a33e9>] rtnetlink_rcv+0x25/0x3d [<c05afd1d>] netlink_data_ready+0xf/0x44 [<c05aed71>] netlink_sendskb+0x19/0x30 [<c05afd01>] netlink_sendmsg+0x277/0x284 [<c0593172>] sock_sendmsg+0xce/0xe8 [<c042cc1d>] autoremove_wake_function+0x0/0x2d [<c044427d>] find_get_page+0x37/0x3c [<c0446cde>] filemap_nopage+0x192/0x314 BUG: soft lockup detected on CPU#0! [<c043ea0f>] softlockup_tick+0x98/0xa6 [<c0408b7d>] timer_interrupt+0x504/0x557 [<c043ec43>] handle_IRQ_event+0x27/0x51 [<c043ed00>] __do_IRQ+0x93/0xe8 [<c040672b>] do_IRQ+0x93/0xae [<c053a04d>] evtchn_do_upcall+0x64/0x9b [<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb] [<c0404ec5>] hypervisor_callback+0x3d/0x48 [<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb] [<c05f4f47>] _spin_lock_bh+0xf/0x18 [<ee581eec>] htb_rate_timer+0x19/0xbf [sch_htb] [<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb] [<c0424a25>] run_timer_softirq+0x101/0x15c [<c041ffa7>] __do_softirq+0x5e/0xc3 [<c040679c>] do_softirq+0x56/0xae [<c040673d>] do_IRQ+0xa5/0xae [<c053a04d>] evtchn_do_upcall+0x64/0x9b [<c0404ec5>] hypervisor_callback+0x3d/0x48 [<c0407fd1>] raw_safe_halt+0x8c/0xaf [<c0402bca>] xen_idle+0x22/0x2e [<c0402ce9>] cpu_idle+0x91/0xab [<c06cc799>] start_kernel+0x381/0x388 ======================= BUG: soft lockup detected on CPU#3!
The problem still happens with RHEL-5.2. --- On RHEL-5.0 it takes < 5 min to reproduce for us. On RHEL-5.2 it takes a little longer ~ 7-10 minutes. --- Note that we accelerate the test (and crash) by running iperf in parallel with the test script. So, we load up the DomU/Dom0 network interface with DomU <--> DomU traffic to accelerate the soft-lockup. --- The hosts are para-virtualized with no pci-device pass-through. --- Provide issue repro information: sun-x4200-1.gsslab.rdu.redhat.com dom0 guest1 domU guest2 domU Install RHEL5.2dom0 and two RHEL5.2domU's Start iperf test between domU's guest1: # iperf -s guest2: # iperf -t 0 -c guest1 dom0: # ./tc-crash.sh --- Hi there James, Yeah, the upstream bz#8668 seems to be the same problem here and I believe the patch should apply directly, no rejects. I'm wondering here if you have already tried it and what were the results. Could you let me know? --- Flavio, I haven't done anything with the patch as I wasn't certain this was the same thing, although in retrospect I probably could have tried it since I have the reproducer... Anyway, I can try that, but apparently Amazon was having issues with the patch which they don't normally have problems with... - James --- Flavio, So, I installed the kernel with the patch and still received the softlockups... I realize you just hopped on this, but Amazon is looking for a solution soon. I appreciate your help. - James --- Hi James, I've checked your srpm and noticed an old patch applied with some missing stuff. Instead, use this one merged in upstream: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0929c2dd83317813425b937fbc0041013b8685ff I did some work on it to apply on 2.6.18-92.1.18.el5 which is the version you were working on. Please, replace the old patch with the attached patch and try again. Let me know your results. thanks, Flavio --- Flavio, Nope, no go... Looks like it may have taken longer to reproduce, but I still received the softlockups... - James ---
Created attachment 325834 [details] first patch
Created attachment 325835 [details] second patch backported from upstream
Created attachment 325837 [details] rhel-soft-lockups.txt
Flavio, Just to be clear, what's the status here? It looks like you have two patches, but the last comment from James makes it seem like this isn't entirely sufficient. I just want to see if your patches were finally proved to fix the issue, or if there is further work needed. Thanks, Chris Lalancette
Hi Chris, The first patch didn't fix the problem, so I thought it was because the patch was different from the other one merged in upstream. I did some work on the upstream version to apply on 2.6.18-92.1.18.el5 but it still hangs. Now we are working to see if something changed with the patch applied and this is another issue. On last test we see many of this message below: # dmesg | grep BUG | head -n 1 BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) arch/i386/kernel/smp-xen.c 514 int smp_call_function (void (*func) (void *info), void *info, int nonatomic, 515 int wait) 528 /* Can deadlock when called with interrupts disabled */ 529 WARN_ON(irqs_disabled()); Another thing is, in the original issue there are two CPUs stucks but now we only have one CPU stuck, so I believe we have another issue. James is working to get some sysrq+t and sysrq+w. thoughts? Flavio
The warning message seems to be triggered by sysrq+w Dec 5 09:19:36 sun-x4200-1 kernel: SysRq : Show CPUs Dec 5 09:19:36 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) Dec 5 09:20:36 sun-x4200-1 kernel: SysRq : Show CPUs Dec 5 09:20:36 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) Dec 5 09:22:44 sun-x4200-1 kernel: SysRq : Show CPUs Dec 5 09:22:44 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) ... Flavio
Created attachment 325918 [details] NET_SCHED-sch_htb-use-generic-estimator.patch Hi, The vmcore shows both CPUs stuck on the same stack trace as before and that happened because the patch applied fixed the generic estimator but sch_htb wasn't using it on RHEL-5. The attached patch comes from upstream and convert sch_htb to use generic estimator, so _both_ patches needs to be applied. I could reproduce the problem locally and easily in 10 minutes, now it is still running after half hour without problems. Can you verify if these two patches works for you? thanks, (and thanks to James for getting the vmcore) Flavio
Forgot to mention the upstream commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee39e10c27ca5293c72addb95bff864095e19904 Flavio
Can we get word from engineering if the patches attached are good and get these changes committed? Thanks! - James
So what exactly is going on with this bug, I'm looking at the patch that flavio attached and it doesn't match the upstream commit he mentioned in comment 10. What exactly is it that Amazon has been testing?
There are two patches: - comment#3 adds the first patch fixing gen_estimator deadlock and the upstream link is on comment#1 - comment#9 adds the other patch fixing sch_htb to use generic estimator and the upstream link is on comment#10 Flavio
Thanks, flavio. Ok, I've gone over it and it looks good to me. I'm a bit concerned over the use of list_empty outside the control of the est_lock, but I don't think its a catastrophic problem, it'll probably just cause a minor perf degradation. I'll look more closely and pursue it upstream if need be. I've posted this to rhkl. Thanks guys!
in kernel-2.6.18-129.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Updating PM score.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html