Bug 584872
| Summary: | NIC bonding arp monitoring method doesn't work when a bond is added to a bridge | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Tom <xiaohm> |
| Component: | kernel | Assignee: | Neil Horman <nhorman> |
| Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 5.4 | CC: | agospoda, bigsow, david, fleitner, jcavallaro, jinzishuai, m.vandelande, pamadio, peterm, purpleidea, wmealing, wu_chulin, xen-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-05-27 18:25:43 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tom
2010-04-22 15:47:20 UTC
Missed following two steps on how to reproduce the problem, a) append "alias bond0 bonding" at the end of /etc/modprobe.conf b) add "GATEWAYDEV=bridge1" at the end of /etc/sysconfig/network Patches were just posted to netdev that may resolve this: http://permalink.gmane.org/gmane.linux.network/159403 Andy, Thanks for providing the patch in such a short time. I didn't get time to verify it yet. Does Red Hat have any plan to provide an updated bonding.ko or updated kernel to include this fix? I can build this kernel module myself and apply it. However, I am wondering whether it is a proper way of applying this kind of kernel patch in a production environment. Regards, Hongming Hongming (Tom), right now we do not have immediate plans to include this (but I suspect we will if enough demand it). Since we *just* released RHEL5.5, RHEL5.6 would be the earliest a backported version of the patch in comment #2 would be included. Hi Andy, I tried to apply your patch and saw following error , [hongming@dom0 linux-2.6.18.x86_64]$ patch -p1 < ~/rpmbuild/SPECS/arp.patch patching file drivers/net/bonding/bond_main.c Hunk #1 succeeded at 2000 (offset 60 lines). patch: **** malformed patch at line 20: net_device *orig_dev) where, line 19 and 20 of the patch file is #line 19: -static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct #line 20: net_device *orig_dev) I am guessing somehow the patch file was reformated by the HTML page which caused line 19 and line 20 are separated into two lines. To avoid this problem, can you attach the patch file directly to here? Thanks, Hongming Andy, refer to my previous message. I think the patch you provided simply cannot be applied to RedHat 5.4 directly! Can you confirm? Regards, Hongming Tom, you are correct. The patch cannot be directly applies to RHEL5.4. Someone (most-likely me) will have to do the work to add this feature to RHEL5.6. I do no think this feature will be added to RHEL5.4 or RHEL5.5. Andy, thanks for the information. What I can do to ensure that this feature will be really scheduled for 5.6? I want to verify whether the patch you provided really solves the problem or not. What should I do? Download the latest Fedora source and apply your patch? If this is not the case, may you give some instructions? Regards, Hongming Hi there, I am having similar problems on RHEL-6 with briding+bonding, although in my case I am running KVM instead of Xen. Basically the ARP table get screed up and I cannot ping the VM IP address from other machines. It actually returns the MAC address of the eth0 of the physical host instead of the VM MAC address when asked for the ARP on the VM IP. (In reply to comment #13) > Hi there, > > I am having similar problems on RHEL-6 with briding+bonding, although in my > case I am running KVM instead of Xen. > Basically the ARP table get screed up and I cannot ping the VM IP address from > other machines. It actually returns the MAC address of the eth0 of the physical > host instead of the VM MAC address when asked for the ARP on the VM IP. Active-backup with ARP monitoring on a bond placed in a bridge will simply not work. I would suggest disabling ARP monitoring or using 802.3ad (mode 4) bonding. Any mode where the switch might broadcast frames to an inactive link is one that could cause problems and will ruin the forwarding database in the kernel's bridge. Thank you Andy. I was using ==> ifcfg-bond0 <== DEVICE=bond0 ONBOOT=yes BOOTPROTO=none USERCTL=no BRIDGE=br0 BONDING_OPTS="mode=6 miimon=100" This is not the active-backup mode but the balance-alb(mode 6). Do I still need to disable arp monitoring? I think the default arp_internal=0 and it is only for the active-backup mode, right? Anyway, the real question is that can I ever do bridge on bond on mode 6 and if so, how should I do it? Thank you very much. (In reply to comment #15) > Thank you Andy. > > I was using > ==> ifcfg-bond0 <== > DEVICE=bond0 > ONBOOT=yes > BOOTPROTO=none > USERCTL=no > BRIDGE=br0 > BONDING_OPTS="mode=6 miimon=100" > > This is not the active-backup mode but the balance-alb(mode 6). Do I still need > to disable arp monitoring? I think the default arp_internal=0 and it is only > for the active-backup mode, right? > Anyway, the real question is that can I ever do bridge on bond on mode 6 and if > so, how should I do it? Thank you very much. Sorry, I assumed you were using active-backup with arp monitoring since this bug was trying to address those issues. I have seen reports of mode 5 (balance-tlb) and mode 4 (802.3ad) working well in a bridge. Mode 6 is not a good solution because of the ARP frames that is sends trying to direct and balance traffic. Thank you. I can confirm that both mode 1 and 5 works with bridge. * mode 1: BONDING_OPTS="mode=1 primary=eth0 miimon=100" * mode 5: BONDING_OPTS="mode=5 miimon=100" mode 6 indeed does not work with bridging due to the ARP problem. However, I do see about 30 seconds delay at link failure before the connection resumes in mode 5. There is no obvious delay in mode 1. Is this the expected behavior? Thanks a lot. (In reply to comment #17) > Thank you. > I can confirm that both mode 1 and 5 works with bridge. > * mode 1: BONDING_OPTS="mode=1 primary=eth0 miimon=100" > * mode 5: BONDING_OPTS="mode=5 miimon=100" > mode 6 indeed does not work with bridging due to the ARP problem. > > However, I do see about 30 seconds delay at link failure before the connection > resumes in mode 5. There is no obvious delay in mode 1. Is this the expected > behavior? > > Thanks a lot. The 30 second delay sounds a lot like spanning tree. I would guess you tested this the same switch and host, so I suspect this is not the case. My guess is that 30 seconds is the timeout for the forwarding database in the kernel's bridge and due to the way mode 5 will transmit on all interfaces, you are still getting a short period where the forwarding database is wrong. Can you compare the output of: brctl showmacs br0 (or similar) when the system is working, after the failover when it does not have connectivity, and after 30sec when it works? Output of: brctl show at any time would also be helpful. My interpretation to your words is, the patch given in comment #2 wasn't backported to 6.0 or doesn't work as expected after being backported to 6.0. Can you confirm? (In reply to comment #14) > (In reply to comment #13) > > Hi there, > > > > I am having similar problems on RHEL-6 with briding+bonding, although in my > > case I am running KVM instead of Xen. > > Basically the ARP table get screed up and I cannot ping the VM IP address from > > other machines. It actually returns the MAC address of the eth0 of the physical > > host instead of the VM MAC address when asked for the ARP on the VM IP. > Active-backup with ARP monitoring on a bond placed in a bridge will simply not > work. I would suggest disabling ARP monitoring or using 802.3ad (mode 4) > bonding. > Any mode where the switch might broadcast frames to an inactive link is one > that could cause problems and will ruin the forwarding database in the kernel's > bridge. What does your backport look like? You uploaded an srpm to brew without a cvs or git reference, so I can't see it, and the upstream patch from comment 2 is completely un-appliable to RHEL5 without major modification. Please attach the patch here. I'm tempted to say we should just close this as a wontfix, given that the working upstream solution is way to invasive to take in RHEL5 at this late stage in the lifecycle, and anything else is more or less just a hack (coupled with the fact that bridging + bonding has non-fixable problems in other operational modes), but please attach the patch here, maybe is contained and safe enough that we can take it. (In reply to comment #21) > What does your backport look like? You uploaded an srpm to brew without a cvs > or git reference, so I can't see it, and the upstream patch from comment 2 is > completely un-appliable to RHEL5 without major modification. Please attach the > patch here. Hi Neil, Indeed, I already asked wmealing (backport author) to attach the patch here. He is in APAC timezone afaik, so might take a while for him to attach. fbl Ok, please don't clear the needinfo flag when updating a bz without the needed info. Thank you. Thank you wade, I'm hesitant to introduce that changeset this late in RHEL5's life cycle, especially since it introduces changes to the common receive path. If the customer is the only one thats tested it and just to confirm that the one problem is fixed, I'm really not comfortable with the change unless it gets lots more testing. What would be even better is if we could just convince the customer to use 802.3ad mode, so that they just won't have this problem. Is that a possiblity? I think I'm experiencing this bug on latest 6.2. Can someone confirm if this should be fixed? I don't have all the information to fully figure out if this is what is occuring. If so could someone bump this to 6.2 and a high severity as this is a big regression. Thanks, James On 6.3 bonding mode 1 with the following settings doesn't work BONDING_OPTS="mode=1 arp_interval=100 arp_validate=all arp_ip_target=172.16.117.10,172.16.117.11,172.16.117.20,172.16.117.21" In my case the bonding interface is not connected to a bridge. Best regards, Maurits (In reply to comment #34) > On 6.3 bonding mode 1 with the following settings doesn't work > > BONDING_OPTS="mode=1 arp_interval=100 arp_validate=all > arp_ip_target=172.16.117.10,172.16.117.11,172.16.117.20,172.16.117.21" > > In my case the bonding interface is not connected to a bridge. > > Best regards, > > Maurits Without any additional information there is no way anyone can really help out. I'm also not sure this is the best place to ask for support since you are not using bridging. I would suggest opening a new bug to address this. Just as a note, ARP monitoring does work on 6.3 but will be different from 6.2 as the hosts used for monitoring must be on the same subnet as the bond device. I've been able to reproduce active-backup bonding to fail whenever a bonding interface is added to a bridge AND arp_validate is set to
- active(1): the current active slave interface starts flapping up and down
- all(3) : the current active slave interface goes down and stays down
Situation:
HOST1 HOST2
+---------------+ +---------------+
| eth1 \ +-----[ switch1 ]-----+ / eth2 |
| --bond0 | | bond0-- |
| eth0 / +-----[ switch2 ]-----+ \ eth0 |
+---------------+ +---------------+
Running 6.3 with kernel 2.6.32-279.2.1
How to reproduce:
1) Configure 2 hosts with bond0 interfaces, mode=active-backup(1), each other as arp_ip_target, arp_validate=0
2) This setup works, according to /proc/net/bonding/bond0 *) on host2, eth0 is down and eth2 is up. Ping works. Fine.
3) Set arp_validate=3. This setup works.
4) Set arp_validate=0, and (on host2) add bond0 to a bridge
[root@host2 ~]# echo 0 > /sys/class/net/bond0/bonding/arp_validate
[root@host2 ~]# ifconfig bond0 0.0.0.0
[root@host2 ~]# brctl addbr br0
[root@host2 ~]# brctl addif br0 bond0
[root@host2 ~]# ifconfig br0 10.254.239.200/24
This setup still works
5) Now set arp_validate=3 on bond0:
[root@host2 ~]# echo 3 > /sys/class/net/bond0/bonding/arp_validate
6) /var/log/messages reports:
Aug 10 11:02:34 brug01 kernel: bonding: bond0: setting arp_validate to all (3).
Aug 10 11:02:34 brug01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Aug 10 11:02:34 brug01 kernel: device eth0 left promiscuous mode
Aug 10 11:02:34 brug01 kernel: bonding: bond0: now running without any active interface !
Aug 10 11:02:34 brug01 kernel: br0: port 1(bond0) entering disabled state
7) This setup had stopped working.
/proc/net/bonding/bond0 **) says both eth0 and eth2 are down.
8) With 2 UTP cables instead of switches this behaviour remains.
--
*) Output from /proc/net/bonding/bond0 with arp_validate=3 in active-backup mode WITHOUT bridge
[root@host2 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.254.240.14
Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:11:0a:5f:95:a4
Slave queue ID: 0
Slave Interface: eth2
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:30:48:73:b0:24
Slave queue ID: 0
--
**) Output from /proc/net/bonding/bond0 with arp_validate=3 in active-backup mode WITH bridge:
[root@brug01 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: None
MII Status: down
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.254.240.14
Slave Interface: eth0
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:11:0a:5f:95:a4
Slave queue ID: 0
Slave Interface: eth2
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:30:48:73:b0:24
Slave queue ID: 0
>Just as a note, ARP monitoring does work on 6.3 but will be different from 6.2 >as the hosts used for monitoring must be on the same subnet as the bond device.
Thanks,
I just modified the ifcfg-bond0 file to use IP targets on the same subnet as the bonding interface.It looks like this works.
I want to know the problem solved or not,I also encounter this problem now. I think, that I have got similar problem on my RHEL7.2 I set bonding + vlan on kvm host, My bonding mode is active-backup, when I turn off one of bonding slave device, everything is ok but only on the host. On my kvm guest machines I have problems with connections. I lost connection to them. What is iteresting I can ping the gateway from my guest machine, but I can not ping this guest machine from any other host. |