Bug 188296

Summary: tlb_clear_slave races with tlb_choose_channel
Product: Red Hat Enterprise Linux 4 Reporter: Kimball Murray <kimball.murray>
Component: kernelAssignee: Kimball Murray <kmurray>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0575 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-10 23:06:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181409    
Attachments:
Description Flags
patch for 2.6.9-34.14 none

Description Kimball Murray 2006-04-07 18:03:01 UTC
Description of problem:
tlb_clear_slave drops a lock before calling tlb_init_slave.  This allows a race
with tlb_choose_channel which might be trying to set "head", which
tlb_init_slave could be at that moment clearing.  The result is a corrupted hash
table.

Version-Release number of selected component (if applicable):
Appears in U3 and earlier.

How reproducible:
Not too easy.  Need a PCI hotplug system that supports bonded network devices.


Steps to Reproduce:

I'm trying to dig this up.  The fellow who discovered the BUG in no longer
reachable.


Actual results:
Oops resulting from hash table corruption.

Expected results:


Additional info:
This patch was submitted upstream, and first appears in kernel 2.6.16.  The git
details follow:

commit 5af47b2ff124fdad9ba84baeb9f7eeebeb227b43
tree 1085c636295cd3f9ade5611f9519d83731e27cdc
parent 9a6301c114aaab1df6de6fad9899bb89852a7592
author Jay Vosburgh <fubar.com> Mon, 09 Jan 2006 12:14:00 -0800
committer Jeff Garzik <jgarzik> Thu, 12 Jan 2006 16:35:39 -0500

    [PATCH] bonding: UPDATED hash-table corruption in bond_alb.c
    
    I believe I see the race Michael refers to (tlb_choose_channel
    may set head, which tlb_init_slave clears), although I was not able to
    reproduce it.  I have updated his patch for the current netdev-2.6.git
    tree and added a version update.  His original comment follows:
    
    Our systems have been crashing during testing of PCI HotPlug
    support in the various networking components.  We've faulted in
    the bonding driver due to a bug in bond_alb.c:tlb_clear_slave()
    
    In that routine, the last modification to the TLB hash table is
    made without protection of the lock, allowing a race that can lead
    tlb_choose_channel() to select an invalid table element.
    
    -J
    
    Signed-off-by: Jeff Garzik <jgarzik>

Comment 1 Kimball Murray 2006-04-07 18:03:01 UTC
Created attachment 127472 [details]
patch for 2.6.9-34.14

Comment 2 Jason Baron 2006-04-18 19:51:22 UTC
committed in stream U4 build 34.18. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/ However, there is a *serious* slab
corruption issue with this kernel, and thus it should not be released to
customers under any circumstances. I'll update this bug when the kernel is
stable again.


Comment 3 Jason Baron 2006-04-19 19:53:17 UTC
We've identified the corruption as specfic to x86-64 smp kernel builds 34.16 and
34.17. All other builds are safe for consumption.


Comment 6 Red Hat Bugzilla 2006-08-10 23:06:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html