Bug 144477
| Summary: | bonding mode=6 + dhcp doesn't work correctly | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Danny Trinh <danny_trinh> | ||||||||||||
| Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||
| Priority: | medium | ||||||||||||||
| Version: | 4.0 | CC: | davej, jbaron, jturner, notting, tao, thomas_chenault, wwlinuxengineering | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | i686 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2006-03-07 18:37:58 UTC | Type: | --- | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 168429 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Danny Trinh
2005-01-07 14:57:31 UTC
Created attachment 109468 [details]
dmesg of failed system
Created attachment 109469 [details]
/var/log/messages of failed system
Created attachment 109470 [details]
Content of /proc/net/bonding/bond0
*** Bug 144473 has been marked as a duplicate of this bug. *** ... - However, if I manually turn on network by doing: service network stop ifconfig bond0 up ifenslave bond0 eth0 ifenslave bond0 eth1 It works as expected ... At what stage here are you getting the IP address via DHCP? I received IP addr. after "ifconfig bond0 up". Below is output of
ifconfig.
bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:10.9.162.144 Bcast:10.9.167.255
Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
ifconfig bond0 up doesn't request an IP address... that would be the IP that would already be on the downed interface. I think you right, that is the IP already be on the down interface. But if I don't do that way, I'll see only one nic (eth0) is up. Check comment#3. Changing the title to reflect the Update in which a fix for this issue has been committed (from RH) or being tracked for.. I can't reproduce this here. With a config of: ifcfg-bond0: DEVICE=bond0 TYPE=Bonding BOOTPROTO=dhcp ONBOOT=yes ifcfg-eth0: DEVICE=eth0 MASTER=bond0 SLAVE=yes HWADDR=<whatever> ifcfg-eth1: DEVICE=eth1 MASTER=bond0 SLAVE=yes HWADDR=<whatever> and: alias bond0 bonding options bonding miimon=500 mode=6 the interface works for me. Bill- Thomas Chenault from the Networking team at Dell did a root cause
analysis of this issue. Please review it and see if it sheds light to the
issue.
###############################################################################
Symptom:
One of the physical adapters that is configured to be a member of a channel
bonding team is not enslaved to the team following a reboot or restart of the
network. The problem has been observed on a two-member balance-alb bond using
DHCP for address assignment. See "steps to reproduce" for further detail.
Cause:
The problem is caused by a race between the bonding driver and the ifup
script. The ifup script brings the bond interface up and enslaves physical
adapters early when DHCP is in use. After a DHCP lease has been acquired each
of the physical interfaces is, in turn, removed from and re-enslaved to the
bond. It is during the re-enslaving process that the second adapter may fail
to join the team.
When the first interface is removed from the bond the interface's MAC address
remains in use by the bond. After a brief delay, bonding then attempts to
assign the team's MAC address, which is also the first interface's MAC
address, to the second interface. Assuming that this reassignment is
successful, when the first interface attempts to rejoin the team its MAC
address is already in use by the team and the second physical adapter. Bonding
assigns the MAC address of the second physical adapter to the first and allows
it to join the team. Now the second physical adapter is removed from the team.
When the second physical adapter attempts to rejoin the team there is a chance
that its MAC address will still be in use by the first physical adapter. If
the MAC addresses do collide at this point, the second adapter will be denied
admittance to the team.
The race condition occurs at the point when the second physical adapter
attempts to rejoin the team. If the adapter has been out of the team for a
long enough duration, the MAC address of the first adapter will have been
changed to match that of the team and no collision will occur. On the other
hand, if the second adapter attempts to rejoin the team too quickly, its MAC
address will still be in use by the first adapter and the collision occurs.
Work-around:
Adding a "sleep 2" immediately following line 420 of
/etc/sysconfig/network-scripts/ifup resolves the problem in my test scenario.
Hardware:
Focus (PE 420SC)
BIOS A00
Intel P4, 3.4GHz
2GB memory
LOM (tg3, eth0, not used in bond)
slot2 - Broadcom 5704 (tg3, eth1, eth2, these are the bonded interfaces)
slot4 - Adaptec ASC-39320
Steps to Reproduce:
1. Install Red Hat Enterprise Linux 4 on a server with more than one physical
orlogical processor and boot to kernel 2.6.9-5.ELsmp.
2. Install two Ethernet adapters.
3. Create a configuration file (/etc/sysconfig/network-scripts/ifcfg-bond0) for
a channel bonding interface and set its BOOTPROTO entry to "dhcp".
4. Create configuration files for each of the physical adapter such that
SLAVE=yes and MASTER=bond0.
5. Connect the network adapters to a network segment that has a DHCP server and
reboot.
6. Once the server boots, check the flags of each of the network interfaces
with ifconfig and/or ifconfig -a.
If the problem has been successfully reproduced, the second physical adapter
will not have its SLAVE property set. Note that, because the problem is rooted
in a race condition, it may not be possible to reproduce in some
configurations.
Attempting to guess how long the kernel might take to respond seems fraugh with danger; this seems like a flaw in the driver. Could I see the actual changed copy of /etc/sysconfig/network-scripts/ifup (or a diff) from comment 12. I think the line numbers may differ between his copy and what I'm looking at... Created attachment 116624 [details] This patch implements the workaround described in comment 12. Diff requested in comment 16 attached as ifuphack.patch. Created attachment 116667 [details]
jwltest-bond_alb-mac-collision.patch
Experimental patch to improve handling of MAC address collision during
ifenslave for bonding mode 6...
Test kernels w/ above patch are available here: http://people.redhat.com/linville/kernels/rhel4/ Please give them a try (w/ unpatched ifup) to see if they improve the situation, and post the results here...thanks! The test kernels have fixed the bug. In testing with kernel 2.6.9-11.27.EL.jwltest.45smp I have not been able to reproduce problem. The issue appears to be resolved. Patch posted upstream on 7/28...awaiting commentary and/or acceptance... Seems to have been accepted upstream... (that was quick!) I'll propose this for U3... This defect isn't fixed at all. I've tried it on RHEL 4 U2 B1 x86_64 with kernel 2.6.9-17.Elsmp. It is not fixed in this kernel. John Linville's kernel packages have this fixed. http://people.redhat.com/linville/kernels/rhel4/ The patch needs to be included into Red Hat's default kernels. This is the exact patch jwltest-bond_alb-mac-collision.patch Please see comment 25, "I'll propose this for U3..." :-) This fix is ready and verified by Dell. Please add the acks to move this forward to include into U3. Thanks. Please test and confirm resolution with 2.6.9-27.EL or later. Thanks. Thanks, it's fixed in RHEL3 U7 Beta1. Does previous comment mean fixed in RHEL4-U3 ? RHEL3 it86692/bug#178885 is proposed for U8. Comment #38 above is a typo. It should read RHEL4 U3 Beta1. I have confirmed that RHEL4 U3 Beta kernel-2.6.9-30.EL fixes this issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |