Description of problem: Suppose we have two network interfaces. The first one is bonding in balance-alb mode (bond0), the second is bridge interface (br0). Adding bond0 interface to bridge cause network connectivity to be lost. Version-Release number of selected component (if applicable): # uname -a Linux host 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:59:00 EST 2009 i686 athlon i386 GNU/Linux How reproducible: Steps to Reproduce: # cat /etc/modprobe.conf ... alias bond0 bonding options bond0 mode=balance-alb miimon=100 ... # modprobe bond0 # ifconfig bond0 up # ifenslave bond0 eth1 # ifenslave bond0 eth0 # brctl addif br0 bond0 # brctl show bridge name bridge id STP enabled interfaces br0 8000.0030485bab35 no bond0 Then somehow assign ip address to br0 # ping a.b.c.d PING a.b.c.d (a.b.c.d) 56(84) bytes of data. 64 bytes from a.b.c.d: icmp_seq=1 ttl=64 time=5.53 ms 64 bytes from a.b.c.d: icmp_seq=2 ttl=64 time=0.223 ms 64 bytes from a.b.c.d: icmp_seq=3 ttl=64 time=0.222 ms 64 bytes from a.b.c.d: icmp_seq=4 ttl=64 time=0.237 ms 64 bytes from a.b.c.d: icmp_seq=5 ttl=64 time=0.240 ms 64 bytes from a.b.c.d: icmp_seq=6 ttl=64 time=0.223 ms 64 bytes from a.b.c.d: icmp_seq=49 ttl=64 time=0.223 ms 64 bytes from a.b.c.d: icmp_seq=50 ttl=64 time=0.227 ms 64 bytes from a.b.c.d: icmp_seq=51 ttl=64 time=0.219 ms 64 bytes from a.b.c.d: icmp_seq=52 ttl=64 time=0.260 ms 64 bytes from a.b.c.d: icmp_seq=53 ttl=64 time=0.235 ms 64 bytes from a.b.c.d: icmp_seq=54 ttl=64 time=0.240 ms 64 bytes from a.b.c.d: icmp_seq=89 ttl=64 time=0.222 ms 64 bytes from a.b.c.d: icmp_seq=90 ttl=64 time=0.232 ms 64 bytes from a.b.c.d: icmp_seq=91 ttl=64 time=0.237 ms 64 bytes from a.b.c.d: icmp_seq=92 ttl=64 time=0.229 ms 64 bytes from a.b.c.d: icmp_seq=93 ttl=64 time=0.233 ms 64 bytes from a.b.c.d: icmp_seq=94 ttl=64 time=0.224 ms --- a.b.c.d ping statistics --- 99 packets transmitted, 18 received, 81% packet loss, time 97994ms rtt min/avg/max/mdev = 0.219/0.525/5.534/1.215 ms Actual results: 99 packets transmitted, 18 received, 81% packet loss Expected results: 99 packets transmitted, 99 received, 0% packet loss Additional info:
I've investigated this issue and here is my results: - adding a particular interface to the bridge makes bridge code to add interface's MAC address in forward database (fdb) as well as mark this record as "local" record. - when the process of handling an ingress packet is in progress bridge code looks through fdb looking for destination MAC and then checks if that MAC address is marked as "local", if yes - it treats this packet as a local one, if no - it forwards or floods the packet. - in case of bonding interface is in balance-alb mode an ingress packet may have destination MAC address which is equal to one of the MAC addresses that are belong to physical interfaces in bond. But bridge code marked only one of that MAC addresses as "local" and if the packet doesn't have marked MAC it is not considered as "local".
Created attachment 333526 [details] look through all interfaces in bond May be this patch is not a solution or workaround but it is rather an additional explanation of that I've said before. It serves for making mе more understandable.
I have the exact same problem. If the primary interface is not the MAC address used, connectivity breaks. In the below example, we see ping failing, we check the arp cache and see a secondary MAC is being used. We delete the arp entry, enter a new arp entry with the primary interface MAC and ping works. This is a big problem for us as alb has proved a lifesaver for us in a fairly complicated environment. We need to deploy KVM but this bridging/bonding issue is a big problem. Is there any workaround? Thanks - John jsullivan@jaspav:~$ ping 172.30.10.22 PING 172.30.10.22 (172.30.10.22) 56(84) bytes of data. --- 172.30.10.22 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3000ms jsullivan@jaspav:~$ arp -n Address HWtype HWaddress Flags Mask Iface 172.30.10.64 ether 00:30:48:C6:91:2C C eth0 172.30.10.22 ether 00:30:48:C6:A0:E9 C eth0 172.30.10.9 ether 00:1F:FE:49:3D:E0 C eth0 172.30.10.8 ether 00:1F:FE:49:69:A0 C eth0 192.168.223.1 ether 00:D0:CF:04:05:BE C wlan0 172.30.10.1 ether 00:15:17:90:3D:7E C eth0 jsullivan@jaspav:~$ sudo arp -d 172.30.10.22 jsullivan@jaspav:~$ sudo arp -s 172.30.10.22 00:30:48:C6:A0:E8 jsullivan@jaspav:~$ ping 172.30.10.22 PING 172.30.10.22 (172.30.10.22) 56(84) bytes of data. 64 bytes from 172.30.10.22: icmp_seq=1 ttl=64 time=0.133 ms 64 bytes from 172.30.10.22: icmp_seq=2 ttl=64 time=0.122 ms 64 bytes from 172.30.10.22: icmp_seq=3 ttl=64 time=0.128 ms --- 172.30.10.22 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.122/0.127/0.133/0.013 ms
I've just tested that rhel5.3 behaviour is the same as current upstream (2.6.29-rc6)
My apologies for my ignorance but does that mean it is the same as 2.6.18 or that it has been fixed? Thanks - John
(In reply to comment #5) > My apologies for my ignorance but does that mean it is the same as 2.6.18 or > that it has been fixed? Thanks - John Yes John, it is the same. Currently I'm working on solution which I want to push upstream.
draft of patch sent for comments: http://lkml.org/lkml/2009/3/13/372 I'm currently building rhel5.4 kernel with its backport for testing.
Created attachment 335216 [details] first draft of the patch - kabi breaking This patch solves the issue, however breaks kabi (don't pose a problem for now...). I'm now thinking about another approach which can be maybe done. It would be easier and even without kabi breaking.
One comment upstream that I found interesting was the idea that bonding should be fixing that code is the one doing the MAC rewriting. I tend to agree, but that might be too much code for skb_bond_should_drop for each frame. You could probably work around this temporarily by using some ebtables prerouting/input rules to rewrite the MAC addresses from the slave devices to the MAC of the primary interface.
Yes Andy you are right. I've tested it and it seems to work fine. There is need to do it by following command: ebtables -t nat -I PREROUTING -d MACOFSLAVE -j dnat --to-destination MACOFBONDDEVICE
Created attachment 337207 [details] workaround script This is simple perl script which provides workaround for this issue using ebtables dnat for mac rewriting.
The script attached in comment #11 looks like a good one. One thing to note, is that ebtables is unfortunately not included as a base package in RHEL5. It can be easily obtained from EPEL however. http://fedoraproject.org/wiki/EPEL You can either add the EPEL yum repo to your system (I do this on ALL of my RHEL5 systems), or download the ebtables rpm directly: http://download.fedora.redhat.com/pub/epel/5/i386/repoview/ebtables.html http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/ebtables.html I can't remember if there are any dependencies, but you can always use the 'yum localintall' command to install the rpm and have the deps satisfied automatically.
My customer is running into this issue and I have temporarily had to split up their bond and pass multiple interfaces to their VM's instead. Any packages not shipped with RHEL are not allowed by this customer (Common Criteria and other package approval forms needed), so the ebtables workaround is probably not an option. Is there an actual kernel patch in works for this as I saw for 4.9? The only other possible workaround is to have them use Mode 0 for their bond if mode 6 is the only one that has this problem...
Well I missed the patch posted in the comment right above mine so ignore that :p I'm assuming it's too late for this to be considered for 5.4?
Can anybody comment on that. As far as I can see, there were patches for 2.6.30. Does that means that this problem is probably solved in 2.6.30 ? Thank you.
(In reply to comment #16) > Can anybody comment on that. As far as I can see, there were patches for > 2.6.30. > Does that means that this problem is probably solved in 2.6.30 ? Yes, it is. There have been also posted patches for both rhel5 and rhel4. > > Thank you.
Thank you Jiri for comment. Can you please also specify rhel5 kernel version contains this patches ? Thank you once again.
Thanks. Already found 2.6.18-128.1.10.el5 Thank you.
(In reply to comment #19) > Thanks. Already found 2.6.18-128.1.10.el5 I doubt that. This wasn't even proposed for z-stream. The patch is in the queue for rhel5.5 so it's not in any rhel kernel atm. Jirka > > Thank you.
Any news about your patch. Maybe you can send me this patch over email, I'll try to test it on our systems. Or https://bugzilla.redhat.com/attachment.cgi?id=335216 is the final version of this patch ?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-168.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
------- Comment From linuxram.com 2009-10-12 18:26 EDT------- > You can download this test kernel from http://people.redhat.com/dzickus/el5 The above kernel fixes the issue.
------- Comment From linuxram.com 2009-11-16 14:43 EDT------- Applied the following patch to 2.6.18-164.el5 kernel, The problem disappears. http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=5d4e039b2cb1ca4de9774344ea7b61ad7fa1b0a1 Can this patch be considered for zstream kernel please?
(In reply to comment #29) > ------- Comment From linuxram.com 2009-11-16 14:43 EDT------- > Applied the following patch to 2.6.18-164.el5 kernel, The problem disappears. > > http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=5d4e039b2cb1ca4de9774344ea7b61ad7fa1b0a1 > > Can this patch be considered for zstream kernel please? Hi... In conversation with one of the kernel developers (agospoda) the following was mentioned: ----------- "Jiri's patch will resolve the case where a bridge simply contains a balance-alb bond device. Since the bridge interface will only send and receive traffic based on it's own MAC address (which should be the same as the bond), there will be no issues with the hosts connectivity through the bridged device. Problems arise when one uses balance-alb in a bridge and and expect it to properly forward the traffic from another host that is connected to another port on the host that is included in the bridge. That is really what 506770 is about. The 'receive load balancing' part of balance-alb (which is really what differentiates it from balance-tlb) is the problem and one of the main reasons we suggest using balance-tlb instead of balance-alb when the host is using bonding. As you can hopefully see the patch in bug 487763 does fix a problem that customers complain about, but I do not feel it is wise to place a balance-alb mode bonding interface inside a bridge as the code is right now." ------------ So it would be good to know from IBM what is expected from this fix. I know IBM has tested this Jiri's patch and it seems to resolve the issue you currently have, but in the context of Andy's comments above is it solving enough of your bonding issues to be useful?
------- Comment From linuxram.com 2009-11-17 14:20 EDT------- our use case does not require us to add one other port to the bridge to which the alb-bond is associated. Hence I am ok with this patch since it should not effect us. But without this patch or any other solution, we will be handicapped.
(In reply to comment #32) > ------- Comment From linuxram.com 2009-11-17 14:20 EDT------- > our use case does not require us to add one other port to the bridge to which > the alb-bond is associated. > > Hence I am ok with this patch since it should not effect us. But without this > patch or any other solution, we will be handicapped. Can I ask why you are putting the bond in a bridge if there are no other ports in it? I cannot come up with a compelling technical reason (spanning-tree usage doesn't seem that critical), so there must be something I'm missing.
------- Comment From linuxram.com 2009-11-18 11:10 EDT------- the bond has two ports. Each port is connected through a different switch. So this gives redundancy if any one port fails. Now the bond is in the bridge because the guest VMs use the bridge to reach the outside world.
(In reply to comment #34) > ------- Comment From linuxram.com 2009-11-18 11:10 EDT------- > the bond has two ports. Each port is connected through a different switch. So > this gives redundancy if any one port fails. Now the bond is in the bridge > because the guest VMs use the bridge to reach the outside world. The 'virt case' is exactly the problem that I was trying to address with bug 506770. I observed connectivity problems between guests and external hosts if any of the hosts on the network did not send traffic for a period greater than the forwarding database age-time on the switch or as soon as the guest send broadcast traffic. A simple 'arping -b <remote ip>' from the guest could break it. This is a problem with balance-alb and balance-rr. In a multi-switch configuration this might not be as much of an issue. Especially if under normal circumstances, the switches are in different broadcast domains or somehow traffic doesn't flow between the two switches and cause forwarding database moves.
proposing for 5.4.z cause this solves customer's problem with bonging + VMs.
------- Comment From linuxram.com 2010-02-01 03:08 EDT------- Will this bug be fixed in 5.4.z or 5.5 ?? What is the status of this bug?
(In reply to comment #37) > ------- Comment From linuxram.com 2010-02-01 03:08 EDT------- > Will this bug be fixed in 5.4.z or 5.5 ?? What is the status of this bug? Yes, the fix is included in 5.5 kernel. If it will be also in 5.4.z haven't been determined yet.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html