Bug 627974 - Scheduling while atomic when removing slave tg3 interface from bonding
Summary: Scheduling while atomic when removing slave tg3 interface from bonding
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Hangbin Liu
URL:
Whiteboard:
Depends On:
Blocks: 652561
TreeView+ depends on / blocked
 
Reported: 2010-08-27 14:29 UTC by Veaceslav Falico
Modified: 2018-11-14 18:58 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:13:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
proposed patch (1.48 KB, patch)
2010-08-27 14:30 UTC, Veaceslav Falico
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Veaceslav Falico 2010-08-27 14:29:38 UTC
Description of problem:
When removing a slave tg3 interface (and most probably others, see below) with vlan support from bond, a schedulling while atomic bug happens and in a short time the system encounters a deadlock.

Version-Release number of selected component (if applicable):
kernel 2.6.18-194

How reproducible:
Easily, just follow the steps.


Steps to Reproduce:
1. setup a bonding bond1 with 2 interfaces that use tg3 driver
2. setup a vlan bond1.2 using vlan
3. remove any interface from bonding.
  
Actual results:
The machine printk()s from schedule() that we're "scheduling while atomic" and in short time freezes.


Expected results:
Works as expected.


Additional info:
BUG: scheduling while atomic: ifdown-eth/0x00000100/16060

Call Trace:
[<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
[<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cb9f>] __mod_timer+0x100/0x10f
[<ffffffff800648ab>] schedule_timeout+0x8a/0xad
[<ffffffff80098e91>] process_timeout+0x0/0x5
[<ffffffff882f25b2>] :tg3:tg3_napi_disable+0x2a/0x41
[<ffffffff882fb437>] :tg3:tg3_vlan_rx_register+0x46/0x10a
[<ffffffff8879da37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
[<ffffffff8879fd27>] :bonding:bond_release+0x2e2/0x3e8
[<ffffffff800655ab>] __down_write_nested+0x12/0x92
[<ffffffff887a8245>] :bonding:bonding_store_slaves+0x25c/0x2f7
[<ffffffff8010da64>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016a49>] vfs_write+0xce/0x174
[<ffffffff80017316>] sys_write+0x45/0x6e
[<ffffffff8005e116>] system_call+0x7e/0x83

It seems to happen because bond_del_vlans_from_slave takes the lock write_lock_bh(&bond->lock); and calls the vlan_rx_register driver function, that might schedule(). It's not necessary to take this lock because anyway we hold the rtnl mutex acquired previously.

Patch attached, which is a simple backport from upstreams' 03dc2f4c525afb9488edb687c2e1f7057d59b40e.

Comment 1 Veaceslav Falico 2010-08-27 14:30:32 UTC
Created attachment 441531 [details]
proposed patch

Comment 3 RHEL Program Management 2010-08-27 15:09:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Andy Gospodarek 2010-09-15 19:27:42 UTC
Was this patch tested and confirmed to fix the issue?  This would be an easy fix to add, but I would like confirmation that the problem is actually resolved.

Comment 6 Andy Gospodarek 2010-10-15 15:24:56 UTC
(In reply to comment #4)
> Was this patch tested and confirmed to fix the issue?  This would be an easy
> fix to add, but I would like confirmation that the problem is actually
> resolved.

Hello, Veaceslav?  Can you ping the customer again?  If we know this helps them we can push it into RHEL5.6 -- otherwise I cannot justify it.

Comment 8 Andy Gospodarek 2010-10-26 18:59:51 UTC
My test kernels have been updated to include a patch for this bugzilla.
Please test them and report back your results.

http://people.redhat.com/agospoda/#rhel5

Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.  Please test them and report back your results.

Comment 11 Veaceslav Falico 2010-11-11 18:04:33 UTC
The same issue was seen with mlx4 cards (schedule()ing while holding a spinlock), and the patch fixed it. The original reporter disappeared, so I think it's fair enough to say that the patch fixes the bug.

Comment 12 Andy Gospodarek 2010-11-11 18:37:28 UTC
Excellent.  Thank you for getting feedback from a customer!

Comment 16 Jarod Wilson 2010-11-16 16:57:40 UTC
in kernel-2.6.18-232.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 21 errata-xmlrpc 2011-01-13 21:13:25 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.