Bug 613828
Summary: | bond0 only works in promisc mode | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | jghobrial |
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> |
Status: | CLOSED DUPLICATE | QA Contact: | Network QE <network-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.5 | CC: | agospoda, anton, cdupuis, clusterman, cww, dhoward, dwu, gbarros, GR-Linux-NIC-Dev, herrold, hjia, imusayev, ivan.borghetti, jentrena, jpirko, jwest, jwilson, peterm, rajesh.borundia, redhat.com, syeghiay, tao, tcamuso, vincew |
Target Milestone: | rc | Keywords: | Reopened, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-11-02 15:21:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
jghobrial
2010-07-12 21:53:02 UTC
My real issue is one of the NICs had no link and was contributing to this problem. Too bad that there are no explicit error messages related to a link not being available when a bond is started and the machine would have no network connections even if the bond was not properly running. After further investigation. I can confirm that rebooting with the bond started upon boot using the normal RedHat startup methods causes the network to not work at all. I'll do some further debugging. This definitely seems odd. I've seen RR-mode bonding work just fine, so I know it is not totally broken. With a fix of possible bonding and routing issues mixed together this might be a bit tricky to debug. Are the two IP addresses used (one for eth0 and one for bond0) in the same subnet? Are they in the same broadcast domain? What about the hosts trying to connect? Where are they? The key will be to first make sure that bonding is working properly. This will be best accomplished by taking down eth0 and only using bond0 with a host on the same network. Is there any chance that your NICs do not support configuration of their MAC address, so bond0 only works correctly when the slaves are told to receive all traffic rather than traffic destined for the bond0 interface's MAC address? Once you can confirm this is working you can bring up eth0 and take a look at some of the sysctrl options that control ARP as it can be problematic if eth0 and bond0 are on the same broadcast domain. Things like: arp_filter - BOOLEAN 1 - Allows you to have multiple network interfaces on the same subnet, and have the ARPs for each interface be answered based on whether or not the kernel would route a packet from the ARP'd IP out that interface (therefore you must use source based routing for this to work). In other words it allows control of which cards (usually 1) will respond to an arp request. 0 - (default) The kernel can respond to arp requests with addresses from other interfaces. This may seem wrong but it usually makes sense, because it increases the chance of successful communication. IP addresses are owned by the complete host on Linux, not by particular interfaces. Only for more complex setups like load- balancing, does this behaviour cause problems. arp_filter for the interface will be enabled if at least one of conf/{all,interface}/arp_filter is set to TRUE, it will be disabled otherwise arp_announce - INTEGER Define different restriction levels for announcing the local source IP address from IP packets in ARP requests sent on interface: 0 - (default) Use any local address, configured on any interface 1 - Try to avoid local addresses that are not in the target's subnet for this interface. This mode is useful when target hosts reachable via this interface require the source IP address in ARP requests to be part of their logical network configured on the receiving interface. When we generate the request we will check all our subnets that include the target IP and will preserve the source address if it is from such subnet. If there is no such subnet we select source address according to the rules for level 2. 2 - Always use the best local address for this target. In this mode we ignore the source address in the IP packet and try to select local address that we prefer for talks with the target host. Such local address is selected by looking for primary IP addresses on all our subnets on the outgoing interface that include the target IP address. If no suitable local address is found we select the first local address we have on the outgoing interface or on all other interfaces, with the hope we will receive reply for our request and even sometimes no matter the source IP address we announce. The max value from conf/{all,interface}/arp_announce is used. Increasing the restriction level gives more chance for receiving answer from the resolved target while decreasing the level announces more valid sender's information. arp_ignore - INTEGER Define different modes for sending replies in response to received ARP requests that resolve local target IP addresses: 0 - (default): reply for any local target IP address, configured on any interface 1 - reply only if the target IP address is local address configured on the incoming interface 2 - reply only if the target IP address is local address configured on the incoming interface and both with the sender's IP address are part from same subnet on this interface 3 - do not reply for local addresses configured with scope host, only resolutions for global and link addresses are replied 4-7 - reserved 8 - do not reply for all local addresses The max value from conf/{all,interface}/arp_ignore is used when ARP request is received on the {interface} I set the following in my /etc/sysctrl.conf on all hosts with more than one network interface: net.ipv4.conf.all.arp_filter = 1 net.ipv4.conf.all.arp_ignore = 1 and taking interfaces up and down does not impact anything. I also do not know what drivers and NICs are being used or even the kernel version. That bit of info would be helpful. RHEL 5.5
Linux 2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:52:51 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Please note this is my custom modprobe.conf order as the original install had the nexten_nic nics before the igb nics. Also I've modified the MAC addresses of the ifcfg-eth? to get them reordered for my sanity.
alias eth0 igb
alias eth1 igb
alias eth2 netxen_nic
alias eth3 netxen_nic
alias eth4 netxen_nic
alias eth5 netxen_nic
alias eth6 netxen_nic
alias eth7 netxen_nic
alias eth8 netxen_nic
alias eth9 netxen_nic
eth0 and bond0 are on the same subnet and connected to the same switch
eth1, eth2, eth3 are slaves to bond0
eth6, eth7, eth8, and eth9 are connected to a different switch each with their own different non-routable addresses. In this case they are being for aoe purposes.
> Is there any chance that your NICs do not support
> configuration of their MAC address, so bond0 only works correctly when the
> slaves are told to receive all traffic rather than traffic destined for the
> bond0 interface's MAC address?
How would I know if they don't support this?
lspci | grep Ethernet
04:00.0 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.1 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.2 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
04:00.3 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
05:00.0 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
05:00.1 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
05:00.2 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
05:00.3 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42)
07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
I'll try the arp options and see what happens.
Ah, netxen nics. This might explain the problem. Older versions of the netxen driver didn't handle the MAC address being set and I wonder if there are still some lingering issues. The two upstream commits that I wanted to make sure were include in RHEL were: commit 5d09e534bbb94e1fdc8e4783ac822bc172465a91 Author: Narender Kumar <narender.kumar> Date: Fri Nov 20 22:08:57 2009 +0000 netxen : fix BOND_MODE_TLB/ALB mode. commit 3d0a3cc9d72047e4baa76021c897f64fc84cc543 Author: Dhananjay Phadke <dhananjay> Date: Tue May 5 19:05:08 2009 +0000 netxen: fix bonding support but both appear to be included in RHEL5.5 and both appear to be applied. I just tried this on some netxen hardware we have and mode 0 bonding worked just fine. [root@hp-dl580g7-01 network-scripts]# lspci -s 0000:04:00.0 -n 04:00.0 0200: 4040:0100 (rev 42) [root@hp-dl580g7-01 network-scripts]# lspci -vv -s 0000:04:00.0 04:00.0 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42) Subsystem: Hewlett-Packard Company NC375i Integrated Quad Port Multifunction Gigabit Server Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 154 Region 0: Memory at d0000000 (64-bit, non-prefetchable) [size=2M] Region 4: Memory at d2000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [40] MSI-X: Enable+ Mask- TabSize=64 Vector table: BAR=0 offset=00090000 PBA: BAR=0 offset=00090800 Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a0] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [c0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 256 bytes Link: Supported Speed unknown, Width x8, ASPM L0s, Port 0 Link: Latency L0s <64ns, L1 <1us Link: ASPM L0s Enabled RCB 64 bytes CommClk- ExtSynch- Link: Speed unknown, Width x8 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 75-73-48-6e-61-46-69-59 [root@hp-dl580g7-01 network-scripts]# ethtool -i eth0 driver: netxen_nic version: 4.0.65 firmware-version: 4.0.520 bus-info: 0000:04:00.0 [root@hp-dl580g7-01 network-scripts]# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: load balancing (round-robin) MII Status: up MII Polling Interval (ms): 1000 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: d8:d3:85:62:c2:54 Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: d8:d3:85:62:c2:56 [root@hp-dl580g7-01 network-scripts]# more ifcfg-bond0 DEVICE=bond0 BOOTPROTO=dhcp ONBOOT=yes BONDING_OPTS="mode=0 miimon=1000" [root@hp-dl580g7-01 network-scripts]# more /etc/modprobe.conf alias eth0 netxen_nic alias eth1 netxen_nic alias eth2 netxen_nic alias eth3 netxen_nic alias scsi_hostadapter cciss alias scsi_hostadapter1 ata_piix alias bond0 bonding Can you cut and paste the output of: # ethtool -i eth0 Maybe your card needs new firmware? Hi Andy, Output from 'ethtool -i eth4': driver: netxen_nic version: 4.0.65 firmware-version: 4.0.520 bus-info: 0000:04:00.0 Firmware is the same version as yours. FYI, the customer doesn't experience the issue in all their netxen equipped servers, only on some of them. But they are experiencing the issue in a DL580 G7, the same server where you tried (they are using active-backup mode 1 bonding though). They are experiencing the issue even with only one nic in the bonding: $ cat proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth4 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth4 MII Status: up Link Failure Count: 0 Permanent HW addr: d8:d3:85:62:43:70 ]$ cat etc/sysconfig/network-scripts/ifcfg-bond0 # # bond0 interface configuration file # DEVICE=bond0 IPADDR=x.xxx.xxx.xx (customer information, hidden). NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes BONDING_OPTS="miimon=100 mode=1" $ cat etc/modprobe.conf alias eth0 bnx2 alias eth1 bnx2 alias eth2 bnx2 alias eth3 bnx2 alias eth4 netxen_nic alias eth5 netxen_nic alias eth6 netxen_nic alias eth7 netxen_nic alias scsi_hostadapter cciss alias scsi_hostadapter1 ata_piix alias scsi_hostadapter2 qla2xxx alias scsi_hostadapter3 usb-storage options ipv6 disable=1 # configuration updates during build process alias bond0 bonding # alias bond1 bonding # disable ipv6 alias net-pf-10 off lspci entry: 04:00.0 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42) 04:00.1 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42) 04:00.2 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42) 04:00.3 Ethernet controller: NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter (rev 42) ... 04:00.0 0200: 4040:0100 (rev 42) Subsystem: 103c:705a Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 170 Region 0: Memory at a0200000 (64-bit, non-prefetchable) [size=2M] Region 4: Memory at a2000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [40] MSI-X: Enable+ Mask- TabSize=64 Vector table: BAR=0 offset=00090000 PBA: BAR=0 offset=00090800 Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a0] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [c0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal+ Fatal+ Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes Link: Supported Speed unknown, Width x8, ASPM L0s, Port 0 Link: Latency L0s <64ns, L1 <1us Link: ASPM L0s Enabled RCB 64 bytes CommClk- ExtSynch- Link: Speed unknown, Width x8 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 75-73-48-6e-61-46-69-59 I also tried active-backup and it did not have any problems. What does the dmidecode information look like? Is it different on servers that work compared to those that do not? # dmidecode 2.10 SMBIOS 2.6 present. 308 structures occupying 8294 bytes. Table at 0xBF7FD000. Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: Hewlett-Packard Version: P65 Release Date: 02/09/2010 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 8192 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed ESCD support is available Boot from CD is supported Selectable boot is supported EDD is supported 5.25"/360 kB floppy services are supported (int 13h) 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported Firmware Revision: 1.5 If the BIOS versions are the same on all of them, I would suggest they check their switch configuration. As odd as that seems, I have seen quite a few bonding cases resolved due to switch configuation issues. Bonding can be quite sensitive sometimes. Unfortunatelly, the server that doesn't experience the issue is a completely different model (DL380 G6) using different Netxen cards (10GB ones, instead of Gigabit). This is the dmidecode of the DL580 G7 that has the issue: # dmidecode 2.10 SMBIOS 2.6 present. 308 structures occupying 8320 bytes. Table at 0x7F7FD000. Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: Hewlett-Packard Version: P65 Release Date: 05/07/2010 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 8192 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed ESCD support is available Boot from CD is supported Selectable boot is supported EDD is supported 5.25"/360 kB floppy services are supported (int 13h) 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported Firmware Revision: 1.5 I'm going to request them to upgrade the BIOS of the server and try again, but according to HP, the new one only adds support for the latest Xeon processors... Let's try. I'm not sure a BIOS update is the best thing as I'm runnig *OLDER* BIOS than they are. Right now I'm short on suggestions other than double-and-triple checking the switches. Though it seems unlikely that this is the case based on the fact that there are 3 reports of this on RHEL5.5, it might be something to consider. I've also heard reports that performing a 'service network restart' will make bonding work. If that is the case for anyone I would also encourage them to try a simple: # ifconfig bond0 down ; ifconfig bond0 up and # ifdown bond0 ; ifup bond0 and see if the device functions properly after that. Reports that either one of those do or do not work will help narrow down the area of code where we can look for problems. When the customer replaces the RHEL provided GPL netxen_nic driver by the QLogic/HP, commercial nx_nix one, bonding works like a charm. Customer has tried several bonding configurations, none that involves a Netxen nic works with the netxen_nic driver, and they always work with the nx_nic one. (In reply to comment #11) > When the customer replaces the RHEL provided GPL netxen_nic driver by the > QLogic/HP, commercial nx_nix one, bonding works like a charm. > > Customer has tried several bonding configurations, none that involves a Netxen > nic works with the netxen_nic driver, and they always work with the nx_nic one. Did they even try this from comment #10? I've also heard reports that performing a 'service network restart' will make bonding work. If that is the case for anyone I would also encourage them to try a simple: # ifconfig bond0 down ; ifconfig bond0 up and # ifdown bond0 ; ifup bond0 Depending on which one of those procedures work, I may be able to come up with a way to make the GPL driver work. Otherwise we are at the mercy or HP to post those changes upstream so they can be used in our driver as well. Andy, I managed to setup a reproducer. hp-dl580g7-01.lab.bos.redhat.com has been setup with eth0 as the only member of a simple, mode 1 bonding. eth0 is the only nic connected on that box, but that setup is enough to reproduce the issue. When the system boots up there's no network connectivity. A simple /etc/init.d/network restart restores the network connectivity. I forgot to mention that you'll need to use console access for logging into the server. (In reply to comment #13) > Andy, I managed to setup a reproducer. > hp-dl580g7-01.lab.bos.redhat.com has been setup with eth0 as the only member of > a simple, mode 1 bonding. eth0 is the only nic connected on that box, but that > setup is enough to reproduce the issue. > > When the system boots up there's no network connectivity. A simple > /etc/init.d/network restart restores the network connectivity. OK, I will take a look right now. This is interesting. I'm quite sure I used DHCP for my test, but using a static IP this fails as described. More relevant info: Entering the commands: # ifconfig eth0 down && ifconfig eth0 up put the interface in a state where it will start receiving frames. My guess is that napi is involved here and not all msi-x queues are started right away. I'm guessing that initscripts or dhclient does an ifdown/ifup at some point and that is why I didn't see this there. As an aside: I also do not have to reboot to reproduce the failure. A simple: # service network stop && rmmod netxen_nic && service network start will also reproduce the problem once the NIC is working again. After some code inspection and after looking at ethtool stats before and after sending some ping floods: [root@hp-dl580g7-01 ~]# ethtool -S eth0 NIC statistics: xmit_called: 100 xmit_finished: 100 rx_dropped: 0 tx_dropped: 0 csummed: 1 rx_pkts: 3 lro_pkts: 0 rx_bytes: 231 tx_bytes: 8910 [root@hp-dl580g7-01 ~]# ping -f 10.16.47.254 PING 10.16.47.254 (10.16.47.254) 56(84) bytes of data. .............................................. --- 10.16.47.254 ping statistics --- 46 packets transmitted, 0 received, 100% packet loss, time 830ms [root@hp-dl580g7-01 ~]# ethtool -S eth0 NIC statistics: xmit_called: 103 xmit_finished: 103 rx_dropped: 0 tx_dropped: 0 csummed: 1 rx_pkts: 3 lro_pkts: 0 rx_bytes: 231 tx_bytes: 9036 it appears this is probably not a NAPI issue and more likely an issue with the hardware initialization on probe + open vs probe + open + close + open. Here is the configuration on the box that could reproduce the bonding failure on a fresh boot or the first time after the module was loaded. [root@hp-dl580g7-01 ~]# more /etc/sysconfig/network-scripts/ifcfg-bond0 # NetXen Incorporated NX3031 Multifunction 1/10-Gigabit Server Adapter DEVICE=bond0 BOOTPROTO=none IPADDR=10.16.42.162 NETMASK=255.255.248.0 USERCTL=no ONBOOT=yes BONDING_OPTS="miimon=100 mode=1" [root@hp-dl580g7-01 ~]# more /etc/modprobe.conf alias eth0 netxen_nic alias eth1 netxen_nic alias eth2 netxen_nic alias eth3 netxen_nic alias scsi_hostadapter cciss alias scsi_hostadapter1 ata_piix alias bond0 bonding Interestingly, when removing bonding from the configuration, the system works just fine. Something that is done in the init process for bonding must be the cause of the device failure. I verified that MSI-X has nothing to do with this issue, by booting with pci=nomsi and still saw the failure. We have newer netxen_nic (4.0.73) that fixes a similar issue. Would you like to give it a try? Ameen, is there a specfic patch from upstream that may have resolved this? If so, you can feel free to post just the full SHA1 object value or a URL to the link in Linus' tree at http://git.kernel.org/linus and we can take a look at it. Andy, Here you go. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d49c9640975355c79f346869831bf9780d185de0 Andy, Another customer has this issue and our GPS on side engineer had verified that it worked well with the kernel patched commit d49c9640975355c79f346869831bf9780d185de0. And they also found that, running "ifdown bond0 ; ifup bond0" manually after boot made network work. Thanks, Ameen for the patch and Mark for the positive test feedback. I was quite sure this was somehow related to the multicast list in the hardware (the usual culprit when promisc-mode works but standard mode does not). I will work to see if I can get this added to the next RHEL5 update. Looks like this patch was included in the RHEL5.6 update planned for bug 562937. Closing this as a duplicate. *** This bug has been marked as a duplicate of bug 562937 *** Andy, Thanks for the update. Does this have to wait until RHEL 5.6? We had the exact issue this weekend when trying to configure bonding on a DL580 G7. After a few lost hours we found this bug. We ended up having to add additional network cards to the server to meet our objective over the weekend. Seeing that this was identified in July and a known fix was available in August does not make me a happy customer; particularly when you tell me I have to wait until at least January of 2011. (In reply to comment #33) > Does this have to wait until RHEL 5.6? We had the exact issue this weekend > when trying to configure bonding on a DL580 G7. After a few lost hours we > found this bug. We ended up having to add additional network cards to the > server to meet our objective over the weekend. Seeing that this was identified > in July and a known fix was available in August does not make me a happy > customer; particularly when you tell me I have to wait until at least January > of 2011. Greg, the wheels are already in-motion to try and have this resolved in a 5.5 errata kernel before 5.6 ships. Im having the same issue with proliant DL585 G7 and redhat linux 5.5 kernel 2.6.18-194.26.1.el5 #1 SMP. The only way is working is restarting network service after it boot or unloading and loading bonding module or ifdown ifup bond0. Ivan, if you are using a netxen-based card this should be fixed in 2.6.18-194.27.1.el5 and in the kernel that ships with RHEL5.6. Please let us know if those kernels or later do not resolve the issue. I hate to be the one who brings the bad news, bad this issue is back on RHEL5.7, specifically latest kernel 2.6.18-274.7.1.el5.. My setup as follows: i have 2 x DL585G7 with 4 NetXen NICs on each. eth0 and eth1 = bond0 (mode=1 miimon=100) eth2 and eth3 = bond1 (mode=1 miimon=100) eth0/eth1 is connected to a juniper switch and functions normally eth2/eth3 on host1 are connected to eth2/eth3 on host2 (crossover cable). With this setup, bonding is not stable and exhibits the same issue that are noted in this BZ. ifdown bond1 && ifup bond1 - addresses the issue sometimes - not always. (In reply to comment #40) > I hate to be the one who brings the bad news, bad this issue is back on > RHEL5.7, specifically latest kernel 2.6.18-274.7.1.el5.. > > My setup as follows: > > i have 2 x DL585G7 with 4 NetXen NICs on each. > > eth0 and eth1 = bond0 (mode=1 miimon=100) > eth2 and eth3 = bond1 (mode=1 miimon=100) > > eth0/eth1 is connected to a juniper switch and functions normally > eth2/eth3 on host1 are connected to eth2/eth3 on host2 (crossover cable). > > With this setup, bonding is not stable and exhibits the same issue that are > noted in this BZ. > > ifdown bond1 && ifup bond1 - addresses the issue sometimes - not always. Are you saying that bonding only works in promisc mode or not all all? |