Bug 144477

Summary: bonding mode=6 + dhcp doesn't work correctly
Product: Red Hat Enterprise Linux 4 Reporter: Danny Trinh <danny_trinh>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: davej, jbaron, jturner, notting, tao, thomas_chenault, wwlinuxengineering
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0132 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-03-07 18:37:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 168429    
Attachments:
Description Flags
dmesg of failed system
none
/var/log/messages of failed system
none
Content of /proc/net/bonding/bond0
none
This patch implements the workaround described in comment 12.
none
jwltest-bond_alb-mac-collision.patch none

Description Danny Trinh 2005-01-07 14:57:31 UTC
Description of problem:
After modified modprobe.conf, create ifcfg-bond0, ifcfg-eth0,
ifcfg-eth1, I reboot the system. I see only eth0 is active. And it
generated a lot of warnings, errors in dmesg, and /var/log/messages.

Below are samples of errors:
bonding: bond0: link status definitely up for interface eth1.
bonding: Warning: the permanent HWaddr of eth0 - 00:06:5B:0F:6F:F8 -
is still in use by bond0. Set the HWaddr of eth0 to a different
address to avoid conflicts.
bonding: bond0: releasing active interface eth0
bonding: bond0: making interface eth1 the new active one.
bonding: Warning: the hw address of slave eth0 is in use by the bond;
giving it the hw address of eth1
bonding: bond0: enslaving eth0 as an active interface with a down link.
bonding: bond0: releasing active interface eth1
bonding: bond0: now running without any active interface !
bonding: Error: the hw address of slave eth1 is not unique - cannot
enslave it!<3>bonding: Error: the hw address of slave eth1 is not
unique - cannot enslave it!<6>e1000: eth0: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.


Version-Release number of selected component (if applicable):
RHEL4-prerc2

How reproducible:
always

Steps to Reproduce:
1) Do a fresh install of RHEL4-pre-rc2.
B) Modify your modprobe.conf file to make and alias for the bonding
module named bond0 with options miimon=200 and mode=6 i.e. balance-alb.
C) Create ifcfg-bond0, ifcfg-eth0 and ifcfg-eth1 configuration files
pre se the redhat manuals. Please note that I'm using DHCP for bond0
interface where as the RH manuals aren't specific to either DHCP or
static, so that shouldn't be a problem.
D) Reboot a server or do a network restart using service network restart
  
Actual results:
Only eth0 is active, and alot of warning and errors in dmesg, and
/var/log/messages.

Expected results:
Should work as expected.

Additional info:
- This bonding mode=6 works fine, if I use static IP addr.
- This bonding mode=6 only fails when I do service network restart, or
when I reboot the system.
- However, if I manually turn on network by doing:
   service network stop
   ifconfig bond0 up
   ifenslave bond0 eth0
   ifenslave bond0 eth1
It works as expected
- Also, I tried using bonding mode=1. This work fine regardless of
with or without dhcp.
- I have checked bug#91399 and add TYPE=Bonding to ifcfg-bond0. Well,
It still failed as described above.

Comment 1 Danny Trinh 2005-01-07 15:01:11 UTC
Created attachment 109468 [details]
dmesg of failed system

Comment 2 Danny Trinh 2005-01-07 15:01:41 UTC
Created attachment 109469 [details]
/var/log/messages of failed system

Comment 3 Danny Trinh 2005-01-07 15:07:50 UTC
Created attachment 109470 [details]
Content of /proc/net/bonding/bond0

Comment 4 Jeremy Katz 2005-01-07 17:03:51 UTC
*** Bug 144473 has been marked as a duplicate of this bug. ***

Comment 5 Bill Nottingham 2005-01-07 17:43:58 UTC
...
- However, if I manually turn on network by doing:
   service network stop
   ifconfig bond0 up
   ifenslave bond0 eth0
   ifenslave bond0 eth1
It works as expected
...

At what stage here are you getting the IP address via DHCP?

Comment 6 Danny Trinh 2005-01-07 18:31:17 UTC
I received IP addr. after "ifconfig bond0 up". Below is output of 
ifconfig.
bond0     Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet addr:10.9.162.144  Bcast:10.9.167.255  
Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Comment 7 Bill Nottingham 2005-01-07 20:37:43 UTC
ifconfig bond0 up doesn't request an IP address... that would be the
IP that would already be on the downed interface.

Comment 8 Danny Trinh 2005-01-07 21:26:28 UTC
I think you right, that is the IP already be on the down interface. 
But if I don't do that way, I'll see only one nic (eth0) is up. Check 
comment#3.



Comment 9 Amit Bhutani 2005-01-25 20:59:28 UTC
Changing the title to reflect the Update in which a fix for this 
issue has been committed (from RH) or being tracked for..

Comment 11 Bill Nottingham 2005-06-17 22:00:14 UTC
I can't reproduce this here. With a config of:

ifcfg-bond0:
DEVICE=bond0
TYPE=Bonding
BOOTPROTO=dhcp
ONBOOT=yes

ifcfg-eth0:
DEVICE=eth0
MASTER=bond0
SLAVE=yes
HWADDR=<whatever>

ifcfg-eth1:
DEVICE=eth1
MASTER=bond0
SLAVE=yes
HWADDR=<whatever>

and:
alias bond0 bonding
options bonding miimon=500 mode=6

the interface works for me.

Comment 12 Amit Bhutani 2005-06-17 22:23:03 UTC
Bill- Thomas Chenault from the Networking team at Dell did a root cause 
analysis of this issue. Please review it and see if it sheds light to the 
issue.

###############################################################################
Symptom:
One of the physical adapters that is configured to be a member of a channel 
bonding team is not enslaved to the team following a reboot or restart of the 
network. The problem has been observed on a two-member balance-alb bond using 
DHCP for address assignment. See "steps to reproduce" for further detail.


Cause:
The problem is caused by a race between the bonding driver and the ifup 
script. The ifup script brings the bond interface up and enslaves physical 
adapters early when DHCP is in use. After a DHCP lease has been acquired each 
of the physical interfaces is, in turn, removed from and re-enslaved to the 
bond. It is during the re-enslaving process that the second adapter may fail 
to join the team.

When the first interface is removed from the bond the interface's MAC address 
remains in use by the bond. After a brief delay, bonding then attempts to 
assign the team's MAC address, which is also the first interface's MAC 
address, to the second interface. Assuming that this reassignment is 
successful, when the first interface attempts to rejoin the team its MAC 
address is already in use by the team and the second physical adapter. Bonding 
assigns the MAC address of the second physical adapter to the first and allows 
it to join the team. Now the second physical adapter is removed from the team. 
When the second physical adapter attempts to rejoin the team there is a chance 
that its MAC address will still be in use by the first physical adapter. If 
the MAC addresses do collide at this point, the second adapter will be denied 
admittance to the team.

The race condition occurs at the point when the second physical adapter 
attempts to rejoin the team. If the adapter has been out of the team for a 
long enough duration, the MAC address of the first adapter will have been 
changed to match that of the team and no collision will occur. On the other 
hand, if the second adapter attempts to rejoin the team too quickly, its MAC 
address will still be in use by the first adapter and the collision occurs.

Work-around:
Adding a "sleep 2" immediately following line 420 of 
/etc/sysconfig/network-scripts/ifup resolves the problem in my test scenario.

Hardware:
    Focus (PE 420SC)
    BIOS A00
    Intel P4, 3.4GHz 
    2GB memory
    LOM (tg3, eth0, not used in bond)
    slot2 - Broadcom 5704 (tg3, eth1, eth2, these are the bonded interfaces)
    slot4 - Adaptec ASC-39320

Steps to Reproduce:
1. Install Red Hat Enterprise Linux 4 on a server with more than one physical 
orlogical processor and boot to kernel 2.6.9-5.ELsmp.
2. Install two Ethernet adapters.
3. Create a configuration file (/etc/sysconfig/network-scripts/ifcfg-bond0) for
a channel bonding interface and set its BOOTPROTO entry to "dhcp".
4. Create configuration files for each of the physical adapter such that 
SLAVE=yes and MASTER=bond0.
5. Connect the network adapters to a network segment that has a DHCP server and
reboot.
6. Once the server boots, check the flags of each of the network interfaces 
with ifconfig and/or ifconfig  -a. 

If the problem has been successfully reproduced, the second physical adapter 
will not have its SLAVE property set. Note that, because the problem is rooted 
in a race condition, it may not be possible to reproduce in some 
configurations.

Comment 13 Bill Nottingham 2005-06-17 22:40:06 UTC
Attempting to guess how long the kernel might take to respond seems fraugh with
danger; this seems like a flaw in the driver.

Comment 16 John W. Linville 2005-07-11 19:08:32 UTC
Could I see the actual changed copy of /etc/sysconfig/network-scripts/ifup (or 
a diff) from comment 12.  I think the line numbers may differ between his copy 
and what I'm looking at... 

Comment 17 Thomas Chenault 2005-07-11 21:12:54 UTC
Created attachment 116624 [details]
This patch implements the workaround described in comment 12.

Diff requested in comment 16 attached as ifuphack.patch.

Comment 18 John W. Linville 2005-07-12 17:54:30 UTC
Created attachment 116667 [details]
jwltest-bond_alb-mac-collision.patch

Experimental patch to improve handling of MAC address collision during
ifenslave for bonding mode 6...

Comment 19 John W. Linville 2005-07-12 18:00:29 UTC
Test kernels w/ above patch are available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give them a try (w/ unpatched ifup) to see if they improve the 
situation, and post the results here...thanks! 

Comment 20 Ritesh Raj Sarraf 2005-07-14 15:54:41 UTC
The test kernels have fixed the bug.

Comment 21 Thomas Chenault 2005-07-14 16:12:14 UTC
In testing with kernel 2.6.9-11.27.EL.jwltest.45smp I have not been able to 
reproduce problem. The issue appears to be resolved.

Comment 24 John W. Linville 2005-07-28 19:02:30 UTC
Patch posted upstream on 7/28...awaiting commentary and/or acceptance...  

Comment 25 John W. Linville 2005-07-29 14:48:45 UTC
Seems to have been accepted upstream... (that was quick!)  I'll propose this 
for U3... 

Comment 30 Ritesh Raj Sarraf 2005-09-02 13:13:14 UTC
This defect isn't fixed at all.
I've tried it on RHEL 4 U2 B1 x86_64 with kernel 2.6.9-17.Elsmp. It is not fixed
in this kernel.


John Linville's kernel packages have this fixed.
http://people.redhat.com/linville/kernels/rhel4/
The patch needs to be included into Red Hat's default kernels.
This is the exact patch jwltest-bond_alb-mac-collision.patch


Comment 31 John W. Linville 2005-09-02 14:14:02 UTC
Please see comment 25, "I'll propose this for U3..." :-) 

Comment 32 Samuel Benjamin 2005-10-10 20:17:19 UTC
This fix is ready and verified by Dell. Please add the acks to move this forward
to include into U3. Thanks.

Comment 37 Jay Turner 2006-01-03 20:14:22 UTC
Please test and confirm resolution with 2.6.9-27.EL or later.  Thanks.

Comment 38 Ritesh Raj Sarraf 2006-01-04 14:09:58 UTC
Thanks, it's fixed in RHEL3 U7 Beta1.

Comment 39 Samuel Benjamin 2006-02-06 20:38:39 UTC
Does previous comment mean fixed in RHEL4-U3 ? RHEL3 it86692/bug#178885 is
proposed for U8.

Comment 40 Charles Rose 2006-02-09 07:09:00 UTC
Comment #38 above is a typo. It should read RHEL4 U3 Beta1.

I have confirmed that RHEL4 U3 Beta kernel-2.6.9-30.EL fixes this issue. 

Comment 42 Red Hat Bugzilla 2006-03-07 18:37:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html