Bug 144477
Summary: | bonding mode=6 + dhcp doesn't work correctly | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Danny Trinh <danny_trinh> | ||||||||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4.0 | CC: | davej, jbaron, jturner, notting, tao, thomas_chenault, wwlinuxengineering | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i686 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2006-03-07 18:37:58 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 168429 | ||||||||||||||
Attachments: |
|
Description
Danny Trinh
2005-01-07 14:57:31 UTC
Created attachment 109468 [details]
dmesg of failed system
Created attachment 109469 [details]
/var/log/messages of failed system
Created attachment 109470 [details]
Content of /proc/net/bonding/bond0
*** Bug 144473 has been marked as a duplicate of this bug. *** ... - However, if I manually turn on network by doing: service network stop ifconfig bond0 up ifenslave bond0 eth0 ifenslave bond0 eth1 It works as expected ... At what stage here are you getting the IP address via DHCP? I received IP addr. after "ifconfig bond0 up". Below is output of ifconfig. bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:10.9.162.144 Bcast:10.9.167.255 Mask:255.255.248.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) ifconfig bond0 up doesn't request an IP address... that would be the IP that would already be on the downed interface. I think you right, that is the IP already be on the down interface. But if I don't do that way, I'll see only one nic (eth0) is up. Check comment#3. Changing the title to reflect the Update in which a fix for this issue has been committed (from RH) or being tracked for.. I can't reproduce this here. With a config of: ifcfg-bond0: DEVICE=bond0 TYPE=Bonding BOOTPROTO=dhcp ONBOOT=yes ifcfg-eth0: DEVICE=eth0 MASTER=bond0 SLAVE=yes HWADDR=<whatever> ifcfg-eth1: DEVICE=eth1 MASTER=bond0 SLAVE=yes HWADDR=<whatever> and: alias bond0 bonding options bonding miimon=500 mode=6 the interface works for me. Bill- Thomas Chenault from the Networking team at Dell did a root cause analysis of this issue. Please review it and see if it sheds light to the issue. ############################################################################### Symptom: One of the physical adapters that is configured to be a member of a channel bonding team is not enslaved to the team following a reboot or restart of the network. The problem has been observed on a two-member balance-alb bond using DHCP for address assignment. See "steps to reproduce" for further detail. Cause: The problem is caused by a race between the bonding driver and the ifup script. The ifup script brings the bond interface up and enslaves physical adapters early when DHCP is in use. After a DHCP lease has been acquired each of the physical interfaces is, in turn, removed from and re-enslaved to the bond. It is during the re-enslaving process that the second adapter may fail to join the team. When the first interface is removed from the bond the interface's MAC address remains in use by the bond. After a brief delay, bonding then attempts to assign the team's MAC address, which is also the first interface's MAC address, to the second interface. Assuming that this reassignment is successful, when the first interface attempts to rejoin the team its MAC address is already in use by the team and the second physical adapter. Bonding assigns the MAC address of the second physical adapter to the first and allows it to join the team. Now the second physical adapter is removed from the team. When the second physical adapter attempts to rejoin the team there is a chance that its MAC address will still be in use by the first physical adapter. If the MAC addresses do collide at this point, the second adapter will be denied admittance to the team. The race condition occurs at the point when the second physical adapter attempts to rejoin the team. If the adapter has been out of the team for a long enough duration, the MAC address of the first adapter will have been changed to match that of the team and no collision will occur. On the other hand, if the second adapter attempts to rejoin the team too quickly, its MAC address will still be in use by the first adapter and the collision occurs. Work-around: Adding a "sleep 2" immediately following line 420 of /etc/sysconfig/network-scripts/ifup resolves the problem in my test scenario. Hardware: Focus (PE 420SC) BIOS A00 Intel P4, 3.4GHz 2GB memory LOM (tg3, eth0, not used in bond) slot2 - Broadcom 5704 (tg3, eth1, eth2, these are the bonded interfaces) slot4 - Adaptec ASC-39320 Steps to Reproduce: 1. Install Red Hat Enterprise Linux 4 on a server with more than one physical orlogical processor and boot to kernel 2.6.9-5.ELsmp. 2. Install two Ethernet adapters. 3. Create a configuration file (/etc/sysconfig/network-scripts/ifcfg-bond0) for a channel bonding interface and set its BOOTPROTO entry to "dhcp". 4. Create configuration files for each of the physical adapter such that SLAVE=yes and MASTER=bond0. 5. Connect the network adapters to a network segment that has a DHCP server and reboot. 6. Once the server boots, check the flags of each of the network interfaces with ifconfig and/or ifconfig -a. If the problem has been successfully reproduced, the second physical adapter will not have its SLAVE property set. Note that, because the problem is rooted in a race condition, it may not be possible to reproduce in some configurations. Attempting to guess how long the kernel might take to respond seems fraugh with danger; this seems like a flaw in the driver. Could I see the actual changed copy of /etc/sysconfig/network-scripts/ifup (or a diff) from comment 12. I think the line numbers may differ between his copy and what I'm looking at... Created attachment 116624 [details] This patch implements the workaround described in comment 12. Diff requested in comment 16 attached as ifuphack.patch. Created attachment 116667 [details]
jwltest-bond_alb-mac-collision.patch
Experimental patch to improve handling of MAC address collision during
ifenslave for bonding mode 6...
Test kernels w/ above patch are available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try (w/ unpatched ifup) to see if they improve the situation, and post the results here...thanks! The test kernels have fixed the bug. In testing with kernel 2.6.9-11.27.EL.jwltest.45smp I have not been able to reproduce problem. The issue appears to be resolved. Patch posted upstream on 7/28...awaiting commentary and/or acceptance... Seems to have been accepted upstream... (that was quick!) I'll propose this for U3... This defect isn't fixed at all. I've tried it on RHEL 4 U2 B1 x86_64 with kernel 2.6.9-17.Elsmp. It is not fixed in this kernel. John Linville's kernel packages have this fixed. http://people.redhat.com/linville/kernels/rhel4/ The patch needs to be included into Red Hat's default kernels. This is the exact patch jwltest-bond_alb-mac-collision.patch Please see comment 25, "I'll propose this for U3..." :-) This fix is ready and verified by Dell. Please add the acks to move this forward to include into U3. Thanks. Please test and confirm resolution with 2.6.9-27.EL or later. Thanks. Thanks, it's fixed in RHEL3 U7 Beta1. Does previous comment mean fixed in RHEL4-U3 ? RHEL3 it86692/bug#178885 is proposed for U8. Comment #38 above is a typo. It should read RHEL4 U3 Beta1. I have confirmed that RHEL4 U3 Beta kernel-2.6.9-30.EL fixes this issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |