Red Hat Bugzilla – Bug 480939
RHEL-5: Deadlock in Xen netfront driver.
Last modified: 2013-01-10 21:30:11 EST
+++ This bug was initially created as a clone of Bug #480937 +++
Created an attachment (id=329602)
xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL
Description of problem:
Some time ago Jeremy Fitzhardinge discovered a couple of potential deadlocks in the Xen netfront code using lockdep on pvops this was fixed in xen-unstable with 14844:abea8d171503  with an update based on comments by Herbert Xu in 14851:22460cfaca71 
We have been seeing very ocasional hangs during boot of RHEL 4 and RHEL 5 at the 'Bringing up interface eth0' stage of the boot during automated testing for some time now and recently were able to obtain a backtrace of one:
Bringing up interface eth0: SysRq : HELP : loglevel0-8 reBoot Crash tErm kIll saK showMem showPc unRaw Sync showTasks Unmount shoWcpus
SysRq : Show Regs
Pid: 695, comm: ip
EIP: 0061:[<c026f5b7>] CPU: 0
EIP is at _spin_lock+0x29/0x34
EFLAGS: 00000286 Not tainted (2.6.9-78.0.8.EL.xs220.127.116.11xenU)
EAX: cf1102d0 EBX: cf1102d0 ECX: f5392000 EDX: cf110100
ESI: cf110000 EDI: c0323fa0 EBP: c0323000 DS: 007b ES: 007b
[<d0881b2b>] netif_poll+0x41/0x64c [xennet]
[<d088074c>] network_open+0x10a/0x121 [xennet]
Ignoring the spurious entries due to stack polution (__wake_up_common, complete, __rcu_process_callbacks) this stack trace precisely matches the second issue described by Jeremy:
"rx_lock can also be used in softirq context, so it should be taken/released
Here network_open() has taken rx_lock with plain spin_lock() and netif_poll() is called in softirq context and tries to take it again.
Version-Release number of selected component (if applicable):
Confirmed by inspection to still be present in 2.6.9-78.0.13.EL and also in RHEL 5 2.6.18-92.1.22.el5 and 2.6.18-128.el5.
Our automated testing probably does several dozen RHEL 4 and RHEL 5 installs/boots each week and we've seen this a very small number of times ever so it seems to be extremely rare and very hard to trigger deliberately, certainly I've been unable to.
--- Additional comment from email@example.com on 2009-01-21 08:50:05 EDT ---
Created an attachment (id=329603)
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL
Created attachment 329605 [details]
xen-unstable.hg 14844:abea8d171503 backported to 2.6.18-92.1.22.el5
Created attachment 329606 [details]
xen-unstable.hg 14851:22460cfaca71 backported to 2.6.18-92.1.22.el5
I've uploaded a test kernel that contains this fix (along with several others)
to this location:
Could the original reporter try out the test kernels there, and report back if
it fixes the problem?
I'll give it a go but the issue is exceedingly rare so I doubt it would reproduce anyway. I have every confidence in the fix ;-)
Well, FWIW I can confirm that it booted without hanging.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 567418 has been marked as a duplicate of this bug. ***