Bug 445016

Summary:	occasional ASSERT_RTNL with bonding driver
Product:	Red Hat Enterprise Linux 5	Reporter:	nate.dailey
Component:	kernel	Assignee:	Andy Gospodarek <agospoda>
Status:	CLOSED DUPLICATE	QA Contact:	Martin Jenner <mjenner>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.2	CC:	agospoda, dzickus, ijc, peterm, simon.mcgrath
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-10-07 13:20:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nate.dailey 2008-05-02 18:59:07 UTC

Description of problem:

See messages like this:

May  2 13:46:54 fhloston kernel: RTNL: assertion failed at net/core/fib_rules.c
(388)
May  2 13:46:54 fhloston kernel:
May  2 13:46:54 fhloston kernel: Call Trace:
May  2 13:46:54 fhloston kernel:  [<ffffffff80220d2d>] fib_rules_event+0x3d/0xff
May  2 13:46:54 fhloston kernel:  [<ffffffff80066b91>] notifier_call_chain+0x20/0x32
May  2 13:46:54 fhloston kernel:  [<ffffffff802141f3>] netdev_state_change+0x21/0x33
May  2 13:46:54 fhloston kernel:  [<ffffffff8021cab8>]
__linkwatch_run_queue+0x1b1/0x1e7
May  2 13:46:54 fhloston kernel:  [<ffffffff8021cb18>] linkwatch_event+0x2a/0x30
May  2 13:46:54 fhloston kernel:  [<ffffffff8004ce70>] run_workqueue+0x94/0xe4
May  2 13:46:54 fhloston kernel:  [<ffffffff80049785>] worker_thread+0x0/0x122
May  2 13:46:54 fhloston kernel:  [<ffffffff80049875>] worker_thread+0xf0/0x122
May  2 13:46:54 fhloston kernel:  [<ffffffff8008abd6>] default_wake_function+0x0/0xe
May  2 13:46:54 fhloston kernel:  [<ffffffff8003253d>] kthread+0xfe/0x132
May  2 13:46:54 fhloston kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
May  2 13:46:54 fhloston kernel:  [<ffffffff8003243f>] kthread+0x0/0x132
May  2 13:46:54 fhloston kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Version-Release number of selected component (if applicable):

RHEL5 update 2 kernel 91

How reproducible:

Only happens after seeing a message like:

May  2 13:46:52 fhloston kernel: bonding: bond1: Unable to update slaves because
interface is down.

But after that, it happens a lot.

Steps to Reproduce:
1. configure a bond made up of, for example, eth0 and eth2, then service network
restart
2. ifdown bond0
3. ifdown eth0
4. service network restart

Actual results:

ASSERT_RTNL messages.

Expected results:

No ASSERT_RTNL messages.

Additional info:

This is a bug in bond_sysfs.c. The following patch takes care of it:

--- bond_sysfs.c.orig   2008-05-02 13:04:03.000000000 -0400
+++ bond_sysfs.c        2008-05-02 14:56:14.000000000 -0400
@@ -255,7 +255,7 @@ static ssize_t bonding_store_slaves(stru
                       ": %s: Unable to update slaves because interface is down.\n",
                       bond->dev->name);
                ret = -EPERM;
-               goto out;
+               goto out_nolock;
        }

        /* Note:  We can't hold bond->lock here, as bond_create grabs it. */
@@ -374,6 +374,7 @@ err_no_cmd:
 out:
        up_write(&(bonding_rwsem));
        rtnl_unlock();
+out_nolock:
        return ret;
 }

The problem is that the "Unable to update slaves" code jumps to a label that
release a mutex which had never been taken. This screws up the mutex's count,
which in turn causes the ASSERT_RTNL errors.

Comment 1 Andy Gospodarek 2008-05-14 20:11:07 UTC

This doesn't look like the same code that I see for the latest RHEL5.

I don't see any returns before taking the rtnl_lock so we always need to release it.

Comment 2 nate.dailey 2008-05-15 12:24:35 UTC

The latest RHEL5 update 2 code that I have access to is kernel 2.6.18-92.el5.
Here's a snippet from the version of bond_sysfs.c from this kernel:

    243 static ssize_t bonding_store_slaves(struct class_device *cd, const char
*buffer, size_t count)
    244 {
    245 	char command[IFNAMSIZ + 1] = { 0, };
    246 	char *ifname;
    247 	int i, res, found, ret = count;
    248 	struct slave *slave;
    249 	struct net_device *dev = NULL;
    250 	struct bonding *bond = to_bond(cd);
    251 
    252 	/* Quick sanity check -- is the bond interface up? */
    253 	if (!(bond->dev->flags & IFF_UP)) {
    254 		printk(KERN_ERR DRV_NAME
    255 		       ": %s: Unable to update slaves because interface is down.\n",
    256 		       bond->dev->name);
    257 		ret = -EPERM;
    258 		goto out;
    259 	}
    260 
    261 	/* Note:  We can't hold bond->lock here, as bond_create grabs it. */
    262 
    263 	rtnl_lock();
    264 	down_write(&(bonding_rwsem));

,,,


    374 out:
    375 	up_write(&(bonding_rwsem));
    376 	rtnl_unlock();
    377 	return ret;
    378 }

Note that the "goto out;" on line 258 happens before taking the rtnl_lock. Aftet
the "out" label, the lock is released (though it was never taken). Same thing
for the bonding_rwsem.

Comment 3 Andy Gospodarek 2008-05-15 20:52:32 UTC

Ah yes, thank you.  I was looking at my latest test kernels, but they do not
contain this exact code anymore since mine have support for IPoIB + bonding.

Comment 4 Andy Gospodarek 2008-10-07 13:20:46 UTC

This should be resolved with the changes for bug 450219.

*** This bug has been marked as a duplicate of bug 450219 ***