Created attachment 329602 [details] xen-unstable 14844:abea8d171503 backported to 2.6.9-78.0.13.EL Description of problem: Some time ago Jeremy Fitzhardinge discovered a couple of potential deadlocks in the Xen netfront code using lockdep on pvops[0] this was fixed in xen-unstable with 14844:abea8d171503 [1] with an update based on comments by Herbert Xu in 14851:22460cfaca71 [2] [0] http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00339.html [1] http://xenbits.xensource.com/xen-unstable.hg?cs=abea8d171503 [2] http://xenbits.xensource.com/xen-unstable.hg?cs=22460cfaca71 We have been seeing very ocasional hangs during boot of RHEL 4 and RHEL 5 at the 'Bringing up interface eth0' stage of the boot during automated testing for some time now and recently were able to obtain a backtrace of one: Bringing up interface eth0: SysRq : HELP : loglevel0-8 reBoot Crash tErm kIll saK showMem showPc unRaw Sync showTasks Unmount shoWcpus SysRq : Show Regs Pid: 695, comm: ip EIP: 0061:[<c026f5b7>] CPU: 0 EIP is at _spin_lock+0x29/0x34 EFLAGS: 00000286 Not tainted (2.6.9-78.0.8.EL.xs5.1.0.39xenU) EAX: cf1102d0 EBX: cf1102d0 ECX: f5392000 EDX: cf110100 ESI: cf110000 EDI: c0323fa0 EBP: c0323000 DS: 007b ES: 007b [<d0881b2b>] netif_poll+0x41/0x64c [xennet] [<c01187f0>] __wake_up_common+0x2f/0x4b [<c01188d2>] complete+0x24/0x37 [<c012d9aa>] __rcu_process_callbacks+0xf7/0x110 [<c021c74b>] net_rx_action+0xde/0x1e1 [<c01212ac>] __do_softirq+0x64/0xdd [<c010a35a>] do_softirq+0x61/0x89 ======================= [<c0109ba6>] do_IRQ+0x1a8/0x1b5 [<c01fc940>] evtchn_do_upcall+0x84/0xb8 [<c01075c8>] hypervisor_callback+0x2c/0x34 [<c014007b>] free_pages_bulk+0x12b/0x1d2 [<c01fc8ba>] force_evtchn_callback+0xa/0xc [<d088074c>] network_open+0x10a/0x121 [xennet] [<c021b66c>] dev_open+0x2f/0x6c [<c021ce31>] dev_change_flags+0x4d/0xf0 [<c025497b>] devinet_ioctl+0x2ac/0x61e [<c025663b>] inet_ioctl+0x77/0xa1 [<c0213881>] sock_ioctl+0x283/0x2ae [<c016c976>] sys_ioctl+0x22c/0x272 [<c01504c9>] sys_munmap+0x48/0x63 [<c010740f>] syscall_call+0x7/0xb Ignoring the spurious entries due to stack polution (__wake_up_common, complete, __rcu_process_callbacks) this stack trace precisely matches the second issue described by Jeremy: "rx_lock can also be used in softirq context, so it should be taken/released with spin_(un)lock_bh." Here network_open() has taken rx_lock with plain spin_lock() and netif_poll() is called in softirq context and tries to take it again. Version-Release number of selected component (if applicable): 2.6.9-78.0.8.EL Confirmed by inspection to still be present in 2.6.9-78.0.13.EL and also in RHEL 5 2.6.18-92.1.22.el5 and 2.6.18-128.el5. How reproducible: Our automated testing probably does several dozen RHEL 4 and RHEL 5 installs/boots each week and we've seen this a very small number of times ever so it seems to be extremely rare and very hard to trigger deliberately, certainly I've been unable to.
Created attachment 329603 [details] xen-unstable.hg 14851:22460cfaca71 backported to 2.6.9-78.0.13.EL
This is a difficult bug to recreate, but the proposed patches have been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patches to see if it goes away.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html