Bug 474797 - [RHEL 5] gen_estimator deadlock fix
Summary: [RHEL 5] gen_estimator deadlock fix
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 481746 483701 485920 546229
TreeView+ depends on / blocked
 
Reported: 2008-12-05 12:28 UTC by Flavio Leitner
Modified: 2018-10-27 11:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:10:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
first patch (3.73 KB, patch)
2008-12-05 12:35 UTC, Flavio Leitner
no flags Details | Diff
second patch backported from upstream (5.96 KB, patch)
2008-12-05 12:37 UTC, Flavio Leitner
no flags Details | Diff
rhel-soft-lockups.txt (7.80 KB, text/plain)
2008-12-05 12:39 UTC, Flavio Leitner
no flags Details
NET_SCHED-sch_htb-use-generic-estimator.patch (6.09 KB, patch)
2008-12-05 22:14 UTC, Flavio Leitner
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Flavio Leitner 2008-12-05 12:28:33 UTC
Description of problem: 

There has been numerous CPU soft-lockup that was tracked down to
http://bugzilla.kernel.org/show_bug.cgi?id=8668  

How reproducible: Very, see: http://www.spinics.net/lists/netdev/msg35100.html

Steps to Reproduce: Run tc-crash.sh <http://www.spinics.net/lists/netdev/binpKJ7TGP5BK.bin>

Actual results: CPU Soft-lockup like:

[<c043ea0f>] softlockup_tick+0x98/0xa6
[<c0408b7d>] timer_interrupt+0x504/0x557
[<c043ec43>] handle_IRQ_event+0x27/0x51
[<c043ed00>] __do_IRQ+0x93/0xe8
[<c040672b>] do_IRQ+0x93/0xae
[<c053a04d>] evtchn_do_upcall+0x64/0x9b
[<c0404ec5>] hypervisor_callback+0x3d/0x48
[<c053999c>] force_evtchn_callback+0xa/0xc
[<c0424601>] try_to_del_timer_sync+0x44/0x4a
[<c0424611>] del_timer_sync+0xa/0x14
[<ee5814ad>] htb_destroy+0x20/0x74 [sch_htb]
[<c05aa29e>] qdisc_destroy+0x41/0x8a
[<c05ab9cc>] tc_get_qdisc+0x169/0x1a3
[<c05ab863>] tc_get_qdisc+0x0/0x1a3
[<c05a35e1>] rtnetlink_rcv_msg+0x1b7/0x1dc
[<c05af8a4>] netlink_run_queue+0x63/0xfa
[<c05a342a>] rtnetlink_rcv_msg+0x0/0x1dc
[<c05a33e9>] rtnetlink_rcv+0x25/0x3d
[<c05afd1d>] netlink_data_ready+0xf/0x44
[<c05aed71>] netlink_sendskb+0x19/0x30
[<c05afd01>] netlink_sendmsg+0x277/0x284
[<c0593172>] sock_sendmsg+0xce/0xe8
[<c042cc1d>] autoremove_wake_function+0x0/0x2d
[<c044427d>] find_get_page+0x37/0x3c
[<c0446cde>] filemap_nopage+0x192/0x314
BUG: soft lockup detected on CPU#0!
[<c043ea0f>] softlockup_tick+0x98/0xa6
[<c0408b7d>] timer_interrupt+0x504/0x557
[<c043ec43>] handle_IRQ_event+0x27/0x51
[<c043ed00>] __do_IRQ+0x93/0xe8
[<c040672b>] do_IRQ+0x93/0xae
[<c053a04d>] evtchn_do_upcall+0x64/0x9b
[<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb]
[<c0404ec5>] hypervisor_callback+0x3d/0x48
[<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb]
[<c05f4f47>] _spin_lock_bh+0xf/0x18
[<ee581eec>] htb_rate_timer+0x19/0xbf [sch_htb]
[<ee581ed3>] htb_rate_timer+0x0/0xbf [sch_htb]
[<c0424a25>] run_timer_softirq+0x101/0x15c
[<c041ffa7>] __do_softirq+0x5e/0xc3
[<c040679c>] do_softirq+0x56/0xae
[<c040673d>] do_IRQ+0xa5/0xae
[<c053a04d>] evtchn_do_upcall+0x64/0x9b
[<c0404ec5>] hypervisor_callback+0x3d/0x48
[<c0407fd1>] raw_safe_halt+0x8c/0xaf
[<c0402bca>] xen_idle+0x22/0x2e
[<c0402ce9>] cpu_idle+0x91/0xab
[<c06cc799>] start_kernel+0x381/0x388
=======================
BUG: soft lockup detected on CPU#3!

Comment 1 Flavio Leitner 2008-12-05 12:34:30 UTC
The problem still happens with RHEL-5.2.
---
On RHEL-5.0 it takes < 5 min to reproduce for us.
On RHEL-5.2 it takes a little longer ~ 7-10 minutes.  
---
Note that we accelerate the test (and crash) by running iperf in parallel 
with the test script.  So, we load up the DomU/Dom0 network interface with 
DomU <--> DomU traffic to accelerate the soft-lockup.
---
The hosts are para-virtualized with no pci-device pass-through.
---

Provide issue repro information:
        
sun-x4200-1.gsslab.rdu.redhat.com dom0
guest1 domU
guest2 domU

Install RHEL5.2dom0 and two RHEL5.2domU's

Start iperf test between domU's

guest1: # iperf -s
guest2: # iperf -t 0 -c guest1
dom0:   # ./tc-crash.sh
---

Hi there James,

Yeah, the upstream bz#8668 seems to be the same problem here and I believe the
patch should apply directly, no rejects.

I'm wondering here if you have already tried it and what were the results.
Could you let me know?
---
Flavio, I haven't done anything with the patch as I wasn't certain this was the same thing, although in retrospect I probably could have tried it since I have the reproducer... Anyway, I can try that, but apparently Amazon was having issues with the patch which they don't normally have problems with...

- James
---
Flavio, So, I installed the kernel with the patch and still received the softlockups... I realize you just hopped on this, but Amazon is looking for a solution soon. I appreciate your help.

- James
---
Hi James,

I've checked your srpm and noticed an old patch applied with some missing
stuff. Instead, use this one merged in upstream:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0929c2dd83317813425b937fbc0041013b8685ff

I did some work on it to apply on 2.6.18-92.1.18.el5 which is the version you
were working on.

Please, replace the old patch with the attached patch and try again.
Let me know your results.

thanks,
Flavio
---
Flavio, Nope, no go... Looks like it may have taken longer to reproduce, but I still received the softlockups...

- James
---

Comment 2 Flavio Leitner 2008-12-05 12:35:46 UTC
Created attachment 325834 [details]
first patch

Comment 3 Flavio Leitner 2008-12-05 12:37:38 UTC
Created attachment 325835 [details]
second patch backported from upstream

Comment 4 Flavio Leitner 2008-12-05 12:39:19 UTC
Created attachment 325837 [details]
rhel-soft-lockups.txt

Comment 6 Chris Lalancette 2008-12-05 14:00:42 UTC
Flavio,
     Just to be clear, what's the status here?  It looks like you have two patches, but the last comment from James makes it seem like this isn't entirely sufficient.  I just want to see if your patches were finally proved to fix the issue, or if there is further work needed.

Thanks,
Chris Lalancette

Comment 7 Flavio Leitner 2008-12-05 16:25:12 UTC
Hi Chris,

The first patch didn't fix the problem, so I thought it was because the patch
was different from the other one merged in upstream. I did some work on the
upstream version to apply on 2.6.18-92.1.18.el5 but it still hangs.

Now we are working to see if something changed with the patch applied and this
is another issue. On last test we see many of this message below:

 # dmesg | grep BUG | head -n 1
BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted)

arch/i386/kernel/smp-xen.c
514 int smp_call_function (void (*func) (void *info), void *info, int nonatomic,
515                         int wait)
528         /* Can deadlock when called with interrupts disabled */
529         WARN_ON(irqs_disabled());

Another thing is, in the original issue there are two CPUs stucks but now we 
only have one CPU stuck, so I believe we have another issue. James is working 
to get some sysrq+t and sysrq+w.

thoughts?

Flavio

Comment 8 Flavio Leitner 2008-12-05 16:40:22 UTC
The warning message seems to be triggered by sysrq+w
Dec  5 09:19:36 sun-x4200-1 kernel: SysRq : Show CPUs
Dec  5 09:19:36 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted)

Dec  5 09:20:36 sun-x4200-1 kernel: SysRq : Show CPUs
Dec  5 09:20:36 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted)

Dec  5 09:22:44 sun-x4200-1 kernel: SysRq : Show CPUs
Dec  5 09:22:44 sun-x4200-1 kernel: BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted)
...

Flavio

Comment 9 Flavio Leitner 2008-12-05 22:14:56 UTC
Created attachment 325918 [details]
NET_SCHED-sch_htb-use-generic-estimator.patch

Hi,

The vmcore shows both CPUs stuck on the same stack trace as before and that 
happened because the patch applied fixed the generic estimator but sch_htb 
wasn't using it on RHEL-5. The attached patch comes from upstream and convert
sch_htb to use generic estimator, so _both_ patches needs to be applied.

I could reproduce the problem locally and easily in 10 minutes, now it is
still running after half hour without problems.

Can you verify if these two patches works for you?
thanks,

(and thanks to James for getting the vmcore)
Flavio

Comment 10 Flavio Leitner 2008-12-05 22:17:57 UTC
Forgot to mention the upstream commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee39e10c27ca5293c72addb95bff864095e19904

Flavio

Comment 13 James G. Brown III 2008-12-09 14:25:22 UTC
Can we get word from engineering if the patches attached are good and get these changes committed? Thanks!

- James

Comment 16 Neil Horman 2009-01-08 19:35:11 UTC
So what exactly is going on with this bug, I'm looking at the patch that flavio attached and it doesn't match the upstream commit he mentioned in comment 10.  What exactly is it that Amazon has been testing?

Comment 17 Flavio Leitner 2009-01-08 20:11:56 UTC
There are two patches:

- comment#3 adds the first patch fixing gen_estimator deadlock and the 
  upstream link is on comment#1

- comment#9 adds the other patch fixing sch_htb to use generic estimator
  and the upstream link is on comment#10

Flavio

Comment 18 Neil Horman 2009-01-09 01:30:27 UTC
Thanks, flavio.  Ok, I've gone over it and it looks good to me.  I'm a bit concerned over the use of list_empty outside the control of the est_lock, but I don't think its a catastrophic problem, it'll probably just cause a minor perf degradation.  I'll look more closely and pursue it upstream if need be.  I've posted this to rhkl.  Thanks guys!

Comment 24 Don Zickus 2009-01-27 16:02:26 UTC
in kernel-2.6.18-129.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 25 RHEL Program Management 2009-02-16 15:09:15 UTC
Updating PM score.

Comment 30 errata-xmlrpc 2009-09-02 08:10:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.