Description of problem: tlb_clear_slave drops a lock before calling tlb_init_slave. This allows a race with tlb_choose_channel which might be trying to set "head", which tlb_init_slave could be at that moment clearing. The result is a corrupted hash table. Version-Release number of selected component (if applicable): Appears in U3 and earlier. How reproducible: Not too easy. Need a PCI hotplug system that supports bonded network devices. Steps to Reproduce: I'm trying to dig this up. The fellow who discovered the BUG in no longer reachable. Actual results: Oops resulting from hash table corruption. Expected results: Additional info: This patch was submitted upstream, and first appears in kernel 2.6.16. The git details follow: commit 5af47b2ff124fdad9ba84baeb9f7eeebeb227b43 tree 1085c636295cd3f9ade5611f9519d83731e27cdc parent 9a6301c114aaab1df6de6fad9899bb89852a7592 author Jay Vosburgh <fubar.com> Mon, 09 Jan 2006 12:14:00 -0800 committer Jeff Garzik <jgarzik> Thu, 12 Jan 2006 16:35:39 -0500 [PATCH] bonding: UPDATED hash-table corruption in bond_alb.c I believe I see the race Michael refers to (tlb_choose_channel may set head, which tlb_init_slave clears), although I was not able to reproduce it. I have updated his patch for the current netdev-2.6.git tree and added a version update. His original comment follows: Our systems have been crashing during testing of PCI HotPlug support in the various networking components. We've faulted in the bonding driver due to a bug in bond_alb.c:tlb_clear_slave() In that routine, the last modification to the TLB hash table is made without protection of the lock, allowing a race that can lead tlb_choose_channel() to select an invalid table element. -J Signed-off-by: Jeff Garzik <jgarzik>
Created attachment 127472 [details] patch for 2.6.9-34.14
committed in stream U4 build 34.18. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ However, there is a *serious* slab corruption issue with this kernel, and thus it should not be released to customers under any circumstances. I'll update this bug when the kernel is stable again.
We've identified the corruption as specfic to x86-64 smp kernel builds 34.16 and 34.17. All other builds are safe for consumption.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html