+++ This bug was initially created as a clone of Bug #480937 +++ Created an attachment (id=329602) xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL Description of problem: Some time ago Jeremy Fitzhardinge discovered a couple of potential deadlocks in the Xen netfront code using lockdep on pvops[0] this was fixed in xen-unstable with 14844:abea8d171503 [1] with an update based on comments by Herbert Xu in 14851:22460cfaca71 [2] [0] http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00339.html [1] http://xenbits.xensource.com/xen-unstable.hg?cs=abea8d171503 [2] http://xenbits.xensource.com/xen-unstable.hg?cs=22460cfaca71 We have been seeing very ocasional hangs during boot of RHEL 4 and RHEL 5 at the 'Bringing up interface eth0' stage of the boot during automated testing for some time now and recently were able to obtain a backtrace of one: Bringing up interface eth0: SysRq : HELP : loglevel0-8 reBoot Crash tErm kIll saK showMem showPc unRaw Sync showTasks Unmount shoWcpus SysRq : Show Regs Pid: 695, comm: ip EIP: 0061:[<c026f5b7>] CPU: 0 EIP is at _spin_lock+0x29/0x34 EFLAGS: 00000286 Not tainted (2.6.9-78.0.8.EL.xs5.1.0.39xenU) EAX: cf1102d0 EBX: cf1102d0 ECX: f5392000 EDX: cf110100 ESI: cf110000 EDI: c0323fa0 EBP: c0323000 DS: 007b ES: 007b [<d0881b2b>] netif_poll+0x41/0x64c [xennet] [<c01187f0>] __wake_up_common+0x2f/0x4b [<c01188d2>] complete+0x24/0x37 [<c012d9aa>] __rcu_process_callbacks+0xf7/0x110 [<c021c74b>] net_rx_action+0xde/0x1e1 [<c01212ac>] __do_softirq+0x64/0xdd [<c010a35a>] do_softirq+0x61/0x89 ======================= [<c0109ba6>] do_IRQ+0x1a8/0x1b5 [<c01fc940>] evtchn_do_upcall+0x84/0xb8 [<c01075c8>] hypervisor_callback+0x2c/0x34 [<c014007b>] free_pages_bulk+0x12b/0x1d2 [<c01fc8ba>] force_evtchn_callback+0xa/0xc [<d088074c>] network_open+0x10a/0x121 [xennet] [<c021b66c>] dev_open+0x2f/0x6c [<c021ce31>] dev_change_flags+0x4d/0xf0 [<c025497b>] devinet_ioctl+0x2ac/0x61e [<c025663b>] inet_ioctl+0x77/0xa1 [<c0213881>] sock_ioctl+0x283/0x2ae [<c016c976>] sys_ioctl+0x22c/0x272 [<c01504c9>] sys_munmap+0x48/0x63 [<c010740f>] syscall_call+0x7/0xb Ignoring the spurious entries due to stack polution (__wake_up_common, complete, __rcu_process_callbacks) this stack trace precisely matches the second issue described by Jeremy: "rx_lock can also be used in softirq context, so it should be taken/released with spin_(un)lock_bh." Here network_open() has taken rx_lock with plain spin_lock() and netif_poll() is called in softirq context and tries to take it again. Version-Release number of selected component (if applicable): 2.6.9-78.0.8.EL Confirmed by inspection to still be present in 2.6.9-78.0.13.EL and also in RHEL 5 2.6.18-92.1.22.el5 and 2.6.18-128.el5. How reproducible: Our automated testing probably does several dozen RHEL 4 and RHEL 5 installs/boots each week and we've seen this a very small number of times ever so it seems to be extremely rare and very hard to trigger deliberately, certainly I've been unable to. --- Additional comment from ijc.uk on 2009-01-21 08:50:05 EDT --- Created an attachment (id=329603) xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL
Created attachment 329605 [details] xen-unstable.hg 14844:abea8d171503 backported to 2.6.18-92.1.22.el5
Created attachment 329606 [details] xen-unstable.hg 14851:22460cfaca71 backported to 2.6.18-92.1.22.el5
I've uploaded a test kernel that contains this fix (along with several others) to this location: http://people.redhat.com/clalance/virttest Could the original reporter try out the test kernels there, and report back if it fixes the problem? Thanks, Chris Lalancette
I'll give it a go but the issue is exceedingly rare so I doubt it would reproduce anyway. I have every confidence in the fix ;-)
Well, FWIW I can confirm that it booted without hanging.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
*** Bug 567418 has been marked as a duplicate of this bug. ***