Bug 144477 - bonding mode=6 + dhcp doesn't work correctly
bonding mode=6 + dhcp doesn't work correctly
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: John W. Linville
Brian Brock
:
: 144473 (view as bug list)
Depends On:
Blocks: 168429
  Show dependency treegraph
 
Reported: 2005-01-07 09:57 EST by Danny Trinh
Modified: 2007-11-30 17:07 EST (History)
7 users (show)

See Also:
Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-07 13:37:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dmesg of failed system (15.12 KB, text/plain)
2005-01-07 10:01 EST, Danny Trinh
no flags Details
/var/log/messages of failed system (152.49 KB, text/plain)
2005-01-07 10:01 EST, Danny Trinh
no flags Details
Content of /proc/net/bonding/bond0 (326 bytes, text/plain)
2005-01-07 10:07 EST, Danny Trinh
no flags Details
This patch implements the workaround described in comment 12. (215 bytes, patch)
2005-07-11 17:12 EDT, Thomas Chenault
no flags Details | Diff
jwltest-bond_alb-mac-collision.patch (853 bytes, patch)
2005-07-12 13:54 EDT, John W. Linville
no flags Details | Diff

  None (edit)
Description Danny Trinh 2005-01-07 09:57:31 EST
Description of problem:
After modified modprobe.conf, create ifcfg-bond0, ifcfg-eth0,
ifcfg-eth1, I reboot the system. I see only eth0 is active. And it
generated a lot of warnings, errors in dmesg, and /var/log/messages.

Below are samples of errors:
bonding: bond0: link status definitely up for interface eth1.
bonding: Warning: the permanent HWaddr of eth0 - 00:06:5B:0F:6F:F8 -
is still in use by bond0. Set the HWaddr of eth0 to a different
address to avoid conflicts.
bonding: bond0: releasing active interface eth0
bonding: bond0: making interface eth1 the new active one.
bonding: Warning: the hw address of slave eth0 is in use by the bond;
giving it the hw address of eth1
bonding: bond0: enslaving eth0 as an active interface with a down link.
bonding: bond0: releasing active interface eth1
bonding: bond0: now running without any active interface !
bonding: Error: the hw address of slave eth1 is not unique - cannot
enslave it!<3>bonding: Error: the hw address of slave eth1 is not
unique - cannot enslave it!<6>e1000: eth0: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.


Version-Release number of selected component (if applicable):
RHEL4-prerc2

How reproducible:
always

Steps to Reproduce:
1) Do a fresh install of RHEL4-pre-rc2.
B) Modify your modprobe.conf file to make and alias for the bonding
module named bond0 with options miimon=200 and mode=6 i.e. balance-alb.
C) Create ifcfg-bond0, ifcfg-eth0 and ifcfg-eth1 configuration files
pre se the redhat manuals. Please note that I'm using DHCP for bond0
interface where as the RH manuals aren't specific to either DHCP or
static, so that shouldn't be a problem.
D) Reboot a server or do a network restart using service network restart
  
Actual results:
Only eth0 is active, and alot of warning and errors in dmesg, and
/var/log/messages.

Expected results:
Should work as expected.

Additional info:
- This bonding mode=6 works fine, if I use static IP addr.
- This bonding mode=6 only fails when I do service network restart, or
when I reboot the system.
- However, if I manually turn on network by doing:
   service network stop
   ifconfig bond0 up
   ifenslave bond0 eth0
   ifenslave bond0 eth1
It works as expected
- Also, I tried using bonding mode=1. This work fine regardless of
with or without dhcp.
- I have checked bug#91399 and add TYPE=Bonding to ifcfg-bond0. Well,
It still failed as described above.
Comment 1 Danny Trinh 2005-01-07 10:01:11 EST
Created attachment 109468 [details]
dmesg of failed system
Comment 2 Danny Trinh 2005-01-07 10:01:41 EST
Created attachment 109469 [details]
/var/log/messages of failed system
Comment 3 Danny Trinh 2005-01-07 10:07:50 EST
Created attachment 109470 [details]
Content of /proc/net/bonding/bond0
Comment 4 Jeremy Katz 2005-01-07 12:03:51 EST
*** Bug 144473 has been marked as a duplicate of this bug. ***
Comment 5 Bill Nottingham 2005-01-07 12:43:58 EST
...
- However, if I manually turn on network by doing:
   service network stop
   ifconfig bond0 up
   ifenslave bond0 eth0
   ifenslave bond0 eth1
It works as expected
...

At what stage here are you getting the IP address via DHCP?
Comment 6 Danny Trinh 2005-01-07 13:31:17 EST
I received IP addr. after "ifconfig bond0 up". Below is output of 
ifconfig.
bond0     Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet addr:10.9.162.144  Bcast:10.9.167.255  
Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
Comment 7 Bill Nottingham 2005-01-07 15:37:43 EST
ifconfig bond0 up doesn't request an IP address... that would be the
IP that would already be on the downed interface.
Comment 8 Danny Trinh 2005-01-07 16:26:28 EST
I think you right, that is the IP already be on the down interface. 
But if I don't do that way, I'll see only one nic (eth0) is up. Check 
comment#3.

Comment 9 Amit Bhutani 2005-01-25 15:59:28 EST
Changing the title to reflect the Update in which a fix for this 
issue has been committed (from RH) or being tracked for..
Comment 11 Bill Nottingham 2005-06-17 18:00:14 EDT
I can't reproduce this here. With a config of:

ifcfg-bond0:
DEVICE=bond0
TYPE=Bonding
BOOTPROTO=dhcp
ONBOOT=yes

ifcfg-eth0:
DEVICE=eth0
MASTER=bond0
SLAVE=yes
HWADDR=<whatever>

ifcfg-eth1:
DEVICE=eth1
MASTER=bond0
SLAVE=yes
HWADDR=<whatever>

and:
alias bond0 bonding
options bonding miimon=500 mode=6

the interface works for me.
Comment 12 Amit Bhutani 2005-06-17 18:23:03 EDT
Bill- Thomas Chenault from the Networking team at Dell did a root cause 
analysis of this issue. Please review it and see if it sheds light to the 
issue.

###############################################################################
Symptom:
One of the physical adapters that is configured to be a member of a channel 
bonding team is not enslaved to the team following a reboot or restart of the 
network. The problem has been observed on a two-member balance-alb bond using 
DHCP for address assignment. See "steps to reproduce" for further detail.


Cause:
The problem is caused by a race between the bonding driver and the ifup 
script. The ifup script brings the bond interface up and enslaves physical 
adapters early when DHCP is in use. After a DHCP lease has been acquired each 
of the physical interfaces is, in turn, removed from and re-enslaved to the 
bond. It is during the re-enslaving process that the second adapter may fail 
to join the team.

When the first interface is removed from the bond the interface's MAC address 
remains in use by the bond. After a brief delay, bonding then attempts to 
assign the team's MAC address, which is also the first interface's MAC 
address, to the second interface. Assuming that this reassignment is 
successful, when the first interface attempts to rejoin the team its MAC 
address is already in use by the team and the second physical adapter. Bonding 
assigns the MAC address of the second physical adapter to the first and allows 
it to join the team. Now the second physical adapter is removed from the team. 
When the second physical adapter attempts to rejoin the team there is a chance 
that its MAC address will still be in use by the first physical adapter. If 
the MAC addresses do collide at this point, the second adapter will be denied 
admittance to the team.

The race condition occurs at the point when the second physical adapter 
attempts to rejoin the team. If the adapter has been out of the team for a 
long enough duration, the MAC address of the first adapter will have been 
changed to match that of the team and no collision will occur. On the other 
hand, if the second adapter attempts to rejoin the team too quickly, its MAC 
address will still be in use by the first adapter and the collision occurs.

Work-around:
Adding a "sleep 2" immediately following line 420 of 
/etc/sysconfig/network-scripts/ifup resolves the problem in my test scenario.

Hardware:
    Focus (PE 420SC)
    BIOS A00
    Intel P4, 3.4GHz 
    2GB memory
    LOM (tg3, eth0, not used in bond)
    slot2 - Broadcom 5704 (tg3, eth1, eth2, these are the bonded interfaces)
    slot4 - Adaptec ASC-39320

Steps to Reproduce:
1. Install Red Hat Enterprise Linux 4 on a server with more than one physical 
orlogical processor and boot to kernel 2.6.9-5.ELsmp.
2. Install two Ethernet adapters.
3. Create a configuration file (/etc/sysconfig/network-scripts/ifcfg-bond0) for
a channel bonding interface and set its BOOTPROTO entry to "dhcp".
4. Create configuration files for each of the physical adapter such that 
SLAVE=yes and MASTER=bond0.
5. Connect the network adapters to a network segment that has a DHCP server and
reboot.
6. Once the server boots, check the flags of each of the network interfaces 
with ifconfig and/or ifconfig  -a. 

If the problem has been successfully reproduced, the second physical adapter 
will not have its SLAVE property set. Note that, because the problem is rooted 
in a race condition, it may not be possible to reproduce in some 
configurations.
Comment 13 Bill Nottingham 2005-06-17 18:40:06 EDT
Attempting to guess how long the kernel might take to respond seems fraugh with
danger; this seems like a flaw in the driver.
Comment 16 John W. Linville 2005-07-11 15:08:32 EDT
Could I see the actual changed copy of /etc/sysconfig/network-scripts/ifup (or 
a diff) from comment 12.  I think the line numbers may differ between his copy 
and what I'm looking at... 
Comment 17 Thomas Chenault 2005-07-11 17:12:54 EDT
Created attachment 116624 [details]
This patch implements the workaround described in comment 12.

Diff requested in comment 16 attached as ifuphack.patch.
Comment 18 John W. Linville 2005-07-12 13:54:30 EDT
Created attachment 116667 [details]
jwltest-bond_alb-mac-collision.patch

Experimental patch to improve handling of MAC address collision during
ifenslave for bonding mode 6...
Comment 19 John W. Linville 2005-07-12 14:00:29 EDT
Test kernels w/ above patch are available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give them a try (w/ unpatched ifup) to see if they improve the 
situation, and post the results here...thanks! 
Comment 20 Ritesh Raj Sarraf 2005-07-14 11:54:41 EDT
The test kernels have fixed the bug.
Comment 21 Thomas Chenault 2005-07-14 12:12:14 EDT
In testing with kernel 2.6.9-11.27.EL.jwltest.45smp I have not been able to 
reproduce problem. The issue appears to be resolved.
Comment 24 John W. Linville 2005-07-28 15:02:30 EDT
Patch posted upstream on 7/28...awaiting commentary and/or acceptance...  
Comment 25 John W. Linville 2005-07-29 10:48:45 EDT
Seems to have been accepted upstream... (that was quick!)  I'll propose this 
for U3... 
Comment 30 Ritesh Raj Sarraf 2005-09-02 09:13:14 EDT
This defect isn't fixed at all.
I've tried it on RHEL 4 U2 B1 x86_64 with kernel 2.6.9-17.Elsmp. It is not fixed
in this kernel.


John Linville's kernel packages have this fixed.
http://people.redhat.com/linville/kernels/rhel4/
The patch needs to be included into Red Hat's default kernels.
This is the exact patch jwltest-bond_alb-mac-collision.patch
Comment 31 John W. Linville 2005-09-02 10:14:02 EDT
Please see comment 25, "I'll propose this for U3..." :-) 
Comment 32 Samuel Benjamin 2005-10-10 16:17:19 EDT
This fix is ready and verified by Dell. Please add the acks to move this forward
to include into U3. Thanks.
Comment 37 Jay Turner 2006-01-03 15:14:22 EST
Please test and confirm resolution with 2.6.9-27.EL or later.  Thanks.
Comment 38 Ritesh Raj Sarraf 2006-01-04 09:09:58 EST
Thanks, it's fixed in RHEL3 U7 Beta1.
Comment 39 Samuel Benjamin 2006-02-06 15:38:39 EST
Does previous comment mean fixed in RHEL4-U3 ? RHEL3 it86692/bug#178885 is
proposed for U8.
Comment 40 Charles Rose 2006-02-09 02:09:00 EST
Comment #38 above is a typo. It should read RHEL4 U3 Beta1.

I have confirmed that RHEL4 U3 Beta kernel-2.6.9-30.EL fixes this issue. 
Comment 42 Red Hat Bugzilla 2006-03-07 13:37:58 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html

Note You need to log in before you can comment on or make changes to this bug.