Bug 480937 - RHEL-4: Deadlock in Xen netfront driver.
Summary: RHEL-4: Deadlock in Xen netfront driver.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen
Version: 4.7.z
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 458302
TreeView+ depends on / blocked
 
Reported: 2009-01-21 13:49 UTC by Ian Campbell
Modified: 2011-02-16 16:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 480939 (view as bug list)
Environment:
Last Closed: 2011-02-16 16:03:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL (3.80 KB, patch)
2009-01-21 13:49 UTC, Ian Campbell
no flags Details | Diff
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL (1.54 KB, patch)
2009-01-21 13:50 UTC, Ian Campbell
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Ian Campbell 2009-01-21 13:49:14 UTC
Created attachment 329602 [details]
xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL

Description of problem:

Some time ago Jeremy Fitzhardinge discovered a couple of potential deadlocks in the Xen netfront code using lockdep on pvops[0] this was fixed in xen-unstable with 14844:abea8d171503 [1] with an update based on comments by Herbert Xu in 14851:22460cfaca71 [2]

[0] http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00339.html
[1] http://xenbits.xensource.com/xen-unstable.hg?cs=abea8d171503
[2] http://xenbits.xensource.com/xen-unstable.hg?cs=22460cfaca71

We have been seeing very ocasional hangs during boot of RHEL 4 and RHEL 5 at the 'Bringing up interface eth0' stage of the boot during automated testing for some time now and recently were able to obtain a backtrace of one:

Bringing up interface eth0:  SysRq : HELP : loglevel0-8 reBoot Crash tErm kIll saK showMem showPc unRaw Sync showTasks Unmount shoWcpus 

SysRq : Show Regs

Pid: 695, comm:                   ip
EIP: 0061:[<c026f5b7>] CPU: 0
EIP is at _spin_lock+0x29/0x34
 EFLAGS: 00000286    Not tainted  (2.6.9-78.0.8.EL.xs5.1.0.39xenU)
EAX: cf1102d0 EBX: cf1102d0 ECX: f5392000 EDX: cf110100
ESI: cf110000 EDI: c0323fa0 EBP: c0323000 DS: 007b ES: 007b
 [<d0881b2b>] netif_poll+0x41/0x64c [xennet]
 [<c01187f0>] __wake_up_common+0x2f/0x4b
 [<c01188d2>] complete+0x24/0x37
 [<c012d9aa>] __rcu_process_callbacks+0xf7/0x110
 [<c021c74b>] net_rx_action+0xde/0x1e1
 [<c01212ac>] __do_softirq+0x64/0xdd
 [<c010a35a>] do_softirq+0x61/0x89
 =======================
 [<c0109ba6>] do_IRQ+0x1a8/0x1b5
 [<c01fc940>] evtchn_do_upcall+0x84/0xb8
 [<c01075c8>] hypervisor_callback+0x2c/0x34
 [<c014007b>] free_pages_bulk+0x12b/0x1d2
 [<c01fc8ba>] force_evtchn_callback+0xa/0xc
 [<d088074c>] network_open+0x10a/0x121 [xennet]
 [<c021b66c>] dev_open+0x2f/0x6c
 [<c021ce31>] dev_change_flags+0x4d/0xf0
 [<c025497b>] devinet_ioctl+0x2ac/0x61e
 [<c025663b>] inet_ioctl+0x77/0xa1
 [<c0213881>] sock_ioctl+0x283/0x2ae
 [<c016c976>] sys_ioctl+0x22c/0x272
 [<c01504c9>] sys_munmap+0x48/0x63
 [<c010740f>] syscall_call+0x7/0xb

Ignoring the spurious entries due to stack polution (__wake_up_common, complete,  __rcu_process_callbacks) this stack trace precisely matches the second issue described by Jeremy:

"rx_lock can also be used in softirq context, so it should be taken/released
   with spin_(un)lock_bh."

Here network_open() has taken rx_lock with plain spin_lock() and netif_poll() is called in softirq context and tries to take it again.

Version-Release number of selected component (if applicable):

2.6.9-78.0.8.EL

Confirmed by inspection to still be present in 2.6.9-78.0.13.EL and also in RHEL 5 2.6.18-92.1.22.el5 and 2.6.18-128.el5.

How reproducible:

Our automated testing probably does several dozen RHEL 4 and RHEL 5 installs/boots each week and we've seen this a very small number of times ever so it seems to be extremely rare and very hard to trigger deliberately, certainly I've been unable to.

Comment 1 Ian Campbell 2009-01-21 13:50:05 UTC
Created attachment 329603 [details]
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL

Comment 6 Andrew Jones 2009-07-01 18:27:20 UTC
This is a difficult bug to recreate, but the proposed patches have been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patches to see if it goes away.

Comment 8 RHEL Program Management 2010-10-12 17:51:15 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Vivek Goyal 2010-10-13 16:11:38 UTC
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 12 errata-xmlrpc 2011-02-16 16:03:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.