Bug 480939 - RHEL-5: Deadlock in Xen netfront driver.
RHEL-5: Deadlock in Xen netfront driver.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.2
All Linux
low Severity medium
: rc
: ---
Assigned To: Chris Lalancette
Martin Jenner
: ZStream
: 567418 (view as bug list)
Depends On:
Blocks: 574672
  Show dependency treegraph
 
Reported: 2009-01-21 08:56 EST by Ian Campbell
Modified: 2013-01-10 21:30 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 480937
Environment:
Last Closed: 2009-09-02 04:55:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
xen-unstable.hg 14844:abea8d171503 backported to 2.6.18-92.1.22.el5 (3.81 KB, patch)
2009-01-21 08:57 EST, Ian Campbell
no flags Details | Diff
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.18-92.1.22.el5 (1.55 KB, patch)
2009-01-21 08:57 EST, Ian Campbell
no flags Details | Diff

  None (edit)
Description Ian Campbell 2009-01-21 08:56:09 EST
+++ This bug was initially created as a clone of Bug #480937 +++

Created an attachment (id=329602)
xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL

Description of problem:

Some time ago Jeremy Fitzhardinge discovered a couple of potential deadlocks in the Xen netfront code using lockdep on pvops[0] this was fixed in xen-unstable with 14844:abea8d171503 [1] with an update based on comments by Herbert Xu in 14851:22460cfaca71 [2]

[0] http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00339.html
[1] http://xenbits.xensource.com/xen-unstable.hg?cs=abea8d171503
[2] http://xenbits.xensource.com/xen-unstable.hg?cs=22460cfaca71

We have been seeing very ocasional hangs during boot of RHEL 4 and RHEL 5 at the 'Bringing up interface eth0' stage of the boot during automated testing for some time now and recently were able to obtain a backtrace of one:

Bringing up interface eth0:  SysRq : HELP : loglevel0-8 reBoot Crash tErm kIll saK showMem showPc unRaw Sync showTasks Unmount shoWcpus 

SysRq : Show Regs

Pid: 695, comm:                   ip
EIP: 0061:[<c026f5b7>] CPU: 0
EIP is at _spin_lock+0x29/0x34
 EFLAGS: 00000286    Not tainted  (2.6.9-78.0.8.EL.xs5.1.0.39xenU)
EAX: cf1102d0 EBX: cf1102d0 ECX: f5392000 EDX: cf110100
ESI: cf110000 EDI: c0323fa0 EBP: c0323000 DS: 007b ES: 007b
 [<d0881b2b>] netif_poll+0x41/0x64c [xennet]
 [<c01187f0>] __wake_up_common+0x2f/0x4b
 [<c01188d2>] complete+0x24/0x37
 [<c012d9aa>] __rcu_process_callbacks+0xf7/0x110
 [<c021c74b>] net_rx_action+0xde/0x1e1
 [<c01212ac>] __do_softirq+0x64/0xdd
 [<c010a35a>] do_softirq+0x61/0x89
 =======================
 [<c0109ba6>] do_IRQ+0x1a8/0x1b5
 [<c01fc940>] evtchn_do_upcall+0x84/0xb8
 [<c01075c8>] hypervisor_callback+0x2c/0x34
 [<c014007b>] free_pages_bulk+0x12b/0x1d2
 [<c01fc8ba>] force_evtchn_callback+0xa/0xc
 [<d088074c>] network_open+0x10a/0x121 [xennet]
 [<c021b66c>] dev_open+0x2f/0x6c
 [<c021ce31>] dev_change_flags+0x4d/0xf0
 [<c025497b>] devinet_ioctl+0x2ac/0x61e
 [<c025663b>] inet_ioctl+0x77/0xa1
 [<c0213881>] sock_ioctl+0x283/0x2ae
 [<c016c976>] sys_ioctl+0x22c/0x272
 [<c01504c9>] sys_munmap+0x48/0x63
 [<c010740f>] syscall_call+0x7/0xb

Ignoring the spurious entries due to stack polution (__wake_up_common, complete,  __rcu_process_callbacks) this stack trace precisely matches the second issue described by Jeremy:

"rx_lock can also be used in softirq context, so it should be taken/released
   with spin_(un)lock_bh."

Here network_open() has taken rx_lock with plain spin_lock() and netif_poll() is called in softirq context and tries to take it again.

Version-Release number of selected component (if applicable):

2.6.9-78.0.8.EL

Confirmed by inspection to still be present in 2.6.9-78.0.13.EL and also in RHEL 5 2.6.18-92.1.22.el5 and 2.6.18-128.el5.

How reproducible:

Our automated testing probably does several dozen RHEL 4 and RHEL 5 installs/boots each week and we've seen this a very small number of times ever so it seems to be extremely rare and very hard to trigger deliberately, certainly I've been unable to.

--- Additional comment from ijc@hellion.org.uk on 2009-01-21 08:50:05 EDT ---

Created an attachment (id=329603)
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL
Comment 1 Ian Campbell 2009-01-21 08:57:04 EST
Created attachment 329605 [details]
xen-unstable.hg 14844:abea8d171503 backported to 2.6.18-92.1.22.el5
Comment 2 Ian Campbell 2009-01-21 08:57:46 EST
Created attachment 329606 [details]
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.18-92.1.22.el5
Comment 3 Chris Lalancette 2009-01-23 05:32:21 EST
I've uploaded a test kernel that contains this fix (along with several others)
to this location:

http://people.redhat.com/clalance/virttest

Could the original reporter try out the test kernels there, and report back if
it fixes the problem?

Thanks,
Chris Lalancette
Comment 4 Ian Campbell 2009-01-23 05:58:34 EST
I'll give it a go but the issue is exceedingly rare so I doubt it would reproduce anyway. I have every confidence in the fix ;-)
Comment 5 Ian Campbell 2009-01-23 06:06:52 EST
Well, FWIW I can confirm that it booted without hanging.
Comment 10 errata-xmlrpc 2009-09-02 04:55:00 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html
Comment 12 Bill Burns 2010-02-23 06:19:25 EST
*** Bug 567418 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.