627974 – Scheduling while atomic when removing slave tg3 interface from bonding

Bug 627974 - Scheduling while atomic when removing slave tg3 interface from bonding

Summary: Scheduling while atomic when removing slave tg3 interface from bonding

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Hangbin Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	652561
TreeView+	depends on / blocked

Reported:	2010-08-27 14:29 UTC by Veaceslav Falico
Modified:	2018-11-14 18:58 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-13 21:13:25 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed patch (1.48 KB, patch) 2010-08-27 14:30 UTC, Veaceslav Falico	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Veaceslav Falico 2010-08-27 14:29:38 UTC

Description of problem:
When removing a slave tg3 interface (and most probably others, see below) with vlan support from bond, a schedulling while atomic bug happens and in a short time the system encounters a deadlock.

Version-Release number of selected component (if applicable):
kernel 2.6.18-194

How reproducible:
Easily, just follow the steps.


Steps to Reproduce:
1. setup a bonding bond1 with 2 interfaces that use tg3 driver
2. setup a vlan bond1.2 using vlan
3. remove any interface from bonding.
  
Actual results:
The machine printk()s from schedule() that we're "scheduling while atomic" and in short time freezes.


Expected results:
Works as expected.


Additional info:
BUG: scheduling while atomic: ifdown-eth/0x00000100/16060

Call Trace:
[<ffffffff8006343d>] __sched_text_start+0x7d/0xbd6
[<ffffffff8003ddd5>] lock_timer_base+0x1b/0x3c
[<ffffffff8001cb9f>] __mod_timer+0x100/0x10f
[<ffffffff800648ab>] schedule_timeout+0x8a/0xad
[<ffffffff80098e91>] process_timeout+0x0/0x5
[<ffffffff882f25b2>] :tg3:tg3_napi_disable+0x2a/0x41
[<ffffffff882fb437>] :tg3:tg3_vlan_rx_register+0x46/0x10a
[<ffffffff8879da37>] :bonding:bond_del_vlans_from_slave+0xa6/0xb9
[<ffffffff8879fd27>] :bonding:bond_release+0x2e2/0x3e8
[<ffffffff800655ab>] __down_write_nested+0x12/0x92
[<ffffffff887a8245>] :bonding:bonding_store_slaves+0x25c/0x2f7
[<ffffffff8010da64>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016a49>] vfs_write+0xce/0x174
[<ffffffff80017316>] sys_write+0x45/0x6e
[<ffffffff8005e116>] system_call+0x7e/0x83

It seems to happen because bond_del_vlans_from_slave takes the lock write_lock_bh(&bond->lock); and calls the vlan_rx_register driver function, that might schedule(). It's not necessary to take this lock because anyway we hold the rtnl mutex acquired previously.

Patch attached, which is a simple backport from upstreams' 03dc2f4c525afb9488edb687c2e1f7057d59b40e.

Comment 1 Veaceslav Falico 2010-08-27 14:30:32 UTC

Created attachment 441531 [details]
proposed patch

Comment 3 RHEL Program Management 2010-08-27 15:09:54 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Andy Gospodarek 2010-09-15 19:27:42 UTC

Was this patch tested and confirmed to fix the issue?  This would be an easy fix to add, but I would like confirmation that the problem is actually resolved.

Comment 6 Andy Gospodarek 2010-10-15 15:24:56 UTC

(In reply to comment #4)
> Was this patch tested and confirmed to fix the issue?  This would be an easy
> fix to add, but I would like confirmation that the problem is actually
> resolved.

Hello, Veaceslav?  Can you ping the customer again?  If we know this helps them we can push it into RHEL5.6 -- otherwise I cannot justify it.

Comment 8 Andy Gospodarek 2010-10-26 18:59:51 UTC

My test kernels have been updated to include a patch for this bugzilla.
Please test them and report back your results.

http://people.redhat.com/agospoda/#rhel5

Without immediate feedback there is a good chance this or any other fix for this driver will not be included in the upcoming update.  Please test them and report back your results.

Comment 11 Veaceslav Falico 2010-11-11 18:04:33 UTC

The same issue was seen with mlx4 cards (schedule()ing while holding a spinlock), and the patch fixed it. The original reporter disappeared, so I think it's fair enough to say that the patch fixes the bug.

Comment 12 Andy Gospodarek 2010-11-11 18:37:28 UTC

Excellent.  Thank you for getting feedback from a customer!

Comment 16 Jarod Wilson 2010-11-16 16:57:40 UTC

in kernel-2.6.18-232.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 21 errata-xmlrpc 2011-01-13 21:13:25 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.