Description of problem: When removing a slave tg3 interface (and most probably others, see below) with vlan support from bond, a schedulling while atomic bug happens and in a short time the system encounters a deadlock. Version-Release number of selected component (if applicable): kernel 2.6.18-194 How reproducible: Easily, just follow the steps. Steps to Reproduce: 1. setup a bonding bond1 with 2 interfaces that use tg3 driver 2. setup a vlan bond1.2 using vlan 3. remove any interface from bonding. Actual results: The machine printk()s from schedule() that we're "scheduling while atomic" and in short time freezes. Expected results: Works as expected. Additional info: BUG: scheduling while atomic: ifdown-eth/0x00000100/16060 Call Trace: [<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6 [<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c [<ffffffff8001cb9f>] __mod_timer+0x100/0x10f [<ffffffff800648ab>] schedule_timeout+0x8a/0xad [<ffffffff80098e91>] process_timeout+0x0/0x5 [<ffffffff882f25b2>] :tg3:tg3_napi_disable+0x2a/0x41 [<ffffffff882fb437>] :tg3:tg3_vlan_rx_register+0x46/0x10a [<ffffffff8879da37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9 [<ffffffff8879fd27>] :bonding:bond_release+0x2e2/0x3e8 [<ffffffff800655ab>] __down_write_nested+0x12/0x92 [<ffffffff887a8245>] :bonding:bonding_store_slaves+0x25c/0x2f7 [<ffffffff8010da64>] sysfs_write_file+0xb9/0xe8 [<ffffffff80016a49>] vfs_write+0xce/0x174 [<ffffffff80017316>] sys_write+0x45/0x6e [<ffffffff8005e116>] system_call+0x7e/0x83 It seems to happen because bond_del_vlans_from_slave takes the lock write_lock_bh(&bond->lock); and calls the vlan_rx_register driver function, that might schedule(). It's not necessary to take this lock because anyway we hold the rtnl mutex acquired previously. Patch attached, which is a simple backport from upstreams' 03dc2f4c525afb9488edb687c2e1f7057d59b40e.
Created attachment 441531 [details] proposed patch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Was this patch tested and confirmed to fix the issue? This would be an easy fix to add, but I would like confirmation that the problem is actually resolved.
(In reply to comment #4) > Was this patch tested and confirmed to fix the issue? This would be an easy > fix to add, but I would like confirmation that the problem is actually > resolved. Hello, Veaceslav? Can you ping the customer again? If we know this helps them we can push it into RHEL5.6 -- otherwise I cannot justify it.
My test kernels have been updated to include a patch for this bugzilla. Please test them and report back your results. http://people.redhat.com/agospoda/#rhel5 Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update. Please test them and report back your results.
The same issue was seen with mlx4 cards (schedule()ing while holding a spinlock), and the patch fixed it. The original reporter disappeared, so I think it's fair enough to say that the patch fixes the bug.
Excellent. Thank you for getting feedback from a customer!
in kernel-2.6.18-232.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html