Description of problem:
Some error conditions handled by the bonding sysfs interface on 5.2 can leave
the rtnl_lock unbalanced, either by locking and returning without unlocking,
or by unlocking when it didn't lock.
If one sees kernel messages about bonding and then RTNL asserts failing, it's
very likely to be the issues pointed out here.
kernel: bonding: bond1: Unable to update slaves because interface is down.
kernel: RTNL: assertion failed at net/core/fib_rules.c (388)
... repeats many times and at several points in the code.
That "unable to update slaves" message is precisely the point where
bonding_store_slaves() does an unbalanced unlock.
Version-Release number of selected component (if applicable):
First noticed on 2.6.18-92.el5, but the code changes happened over a few
earlier commits starting at -62.el5 (bug 268001).
Looking at the code I guess 100%. :)
Steps to Reproduce:
Hits a customer, I could not try on the lab yet. I figure adding a NIC to a
bond that is down could trigger this easily.
RTNL lock cannot be trusted after those errors and I gather networking would
get quite racy.
RTNL lock nicely managed.
Patch attached, already in a test package on a customer. Will post results as
soon as the tests are done. Looking at the code for bonding_store_bonds() and
bonding_store_slaves() it is easy to spot those problems.
Created attachment 308494 [details]
Patch to correct the rtnl lock usage.
Patch is posted and ACKed.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Looks like this has fairly high impact on the customers that have reported the
issue and there's potential for more customers to hit this problem. Proposing
for EUS as a result.
You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 445016 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.