Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1381110

Summary:	[RFE] Update switches for VMs MACs on Active-Backup/LACP bond failover - Fast Train
Product:	Red Hat Enterprise Linux 9	Reporter:	Germano Veit Michel <gveitmic>
Component:	libvirt	Assignee:	Laine Stump <laine>
libvirt sub component:	Networking	QA Contact:	yalzhang <yalzhang>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aadam, ailan, amorenoz, chayang, ctrautma, cww, danken, dgilbert, dholler, fleitner, fsoppels, germano, hpopal, imomin, ipilcher, ivecera, jasowang, jeharris, jsuchane, juzhang, knoel, linville, lmen, lsurette, mjankula, mjtarsel, mkalinin, mkletzan, mleitner, mprivozn, mst, mtessun, pezhang, phoracek, ptalbert, qding, rbalakri, rgarcia, srevivo, sukulkar, tredaelli, virt-maint, xuzhang
Version:	9.0	Keywords:	FutureFeature, Triaged
Target Milestone:	rc	Flags:	ribarry: mirror+
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:	FutureFeature
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-04-21 22:33:21 UTC	Type:	Feature Request
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Network	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1438106
Bug Blocks:	828502, 1094398, 1477664, 1504784

Description Germano Veit Michel 2016-10-03 03:10:18 UTC

1. Proposed title of this feature request
* Improve network failover times on active-backup bonds in complex networks.

3. What is the nature and description of the request?

Please see the diagram below. The Hypervisor is connected to the network via a bond device, which masters two physical interfaces, each connected to a different switch for maximum availability.

+----+ +----+ +-----------+
|SW A+---------------+SW B+---------+ CLIENT |
+-+--+ +--+-+ +-----------+
| |
| |
| |
| |
+-+--+ +--+-+
|SW C+---------------+SW D|
+-+--+ +--+-+
| MODE 1 BOND |
+----------+----------+
|
+------------+------------+
| HYPERVISOR |
+-------------------------+

Considerations:
1. SW A, B, C and D run RSTP.
2. The Hypervisor Bridge does not run STP, because it only supports Legacy STP with a minimum of 2x Forwarding Time convergence time, which is too long for mission critical applications.

Now see the following scenario:
1. The link between C and D is blocked by STP, to avoid loops.
2. Topology for Client <-> VMs communication is
CLIENT -> SW B -> SW D -> Hypervisor Bridge -> VM
2. The link between Hypervisor and SW D fails
3. Hypervisor bond switches it active slave towards SW C.
4. STP Topology Remains unchanged, no TCN sent by any switch/bridge.
5. CLIENT sends a frame to VM
5.1. Frame reaches SW B, table points to port connecting to SW D. Frame sent.
5.2. Frame reaches SW D, table does not contain destination MAC of the VM, as the carrier for that link is down and the entry was wiped.
5.3. No other ports to forward the frame to, frame is drooped

Comments:
* Until SW B MAC entry for the VM MAC ages or is updated via a frame from SW A, the CLIENT will fail to reach the VM. This can take up to 5 minutes.
* Since the RHV bridge only supports legacy STP, it may not participate in the STP topology, therefore no TCNs, which would help to re-establish connection faster are sent when the failover occurs.
* In more complex scenarios, this can be worse.

The kernel does a very similar job for all the devices/vlans with IP addresses on top of the bond.

From https://www.kernel.org/doc/Documentation/networking/bonding.txt

"In bonding version 2.6.2 or later, when a failover
occurs in active-backup mode, bonding will issue one
or more gratuitous ARPs on the newly active slave.
One gratuitous ARP is issued for the bonding master
interface and each VLAN interfaces configured above
it, provided that the interface has at least one IP
address configured. Gratuitous ARPs issued for VLAN
interfaces are tagged with the appropriate VLAN id."

If we could do the same for VMs, it would provide higher network availability for the VMs. A similar mechanism to what is done at the end of migrations by qemu would suffice.

Additionally, this might be extended for other bonding modes/topologies that possibly could suffer from similar problems, or be user-configurable as a bonding option in the Administration Portal.

4. Why does the customer need this? (List the business requirements here)
* Actual failover time for VMs can reach up to 5 minutes of outage (MAC aging time in the switches).

5. How would the customer like to achieve this? (List the functional requirements here)
* maintain a data structure mapping physical devices to respective tap devices
* monitor physical interface link changes (via netlink?)
* on bond failover, send a frame on behalf of VMs related to those tap devices, which is flooded and updates all switches tables.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
* unplug cable from active-backup device, tcpdump for RARP for each VM running in that logical network.

10. List any affected packages or components.
* vdsm, supervdsm, possibly up to libvirt and qemu

Comment 4 Marina Kalinin 2016-12-29 17:52:21 UTC

(In reply to Germano Veit Michel from comment #0)
> 5. How would the customer like to achieve this? (List the functional
> requirements here)
>    * maintain a data structure mapping physical devices to respective tap
> devices
>    * monitor physical interface link changes (via netlink?)
>    * on bond failover, send a frame on behalf of VMs related to those tap
> devices, which is flooded and updates all switches tables.

Who would maintain that data structure? vdsm?
And how would vdsm know about the change in the link status? I think networking is transparent to vdsm, it either works or not.

But I am anywhere a network expert to weigh on this.

Comment 5 Germano Veit Michel 2016-12-29 20:10:09 UTC

(In reply to Marina from comment #4)
> Who would maintain that data structure? vdsm?
> And how would vdsm know about the change in the link status? I think
> networking is transparent to vdsm, it either works or not.

Libvirt or vdsm sounds the most appropriate to me. On libvirt perhaps other products could also use the functionality (OpenStack?), so that would be a plus.

Any userspace code can listen on a netlink socket and watch for the bond events.

Qemu generates this required probe frame at the end of migration, so the code is already there. We could also just expose "qemu_announce_self()" via some command and let libvirt/vdsm use it to.

Another option would be to implement this in the kernel but I doubt it's the correct way to do it as this can be done from userspace.

Comment 7 Germano Veit Michel 2016-12-29 20:23:51 UTC

Thanks to Marina, the description in comment #0 above can be simplified :)

When the VM migrates, we send a probe frame to notify the whole network that the VM's MAC has changed location.

But, when a Host bond fails over, it might change to a different switch. In this case we also need to notify the whole network as the active bond might be in a completely different position in the topology (same as a migration).

Comment 8 Dan Kenigsberg 2017-01-02 10:01:37 UTC

(In reply to Germano Veit Michel from comment #0)

> The kernel does a very similar job for all the devices/vlans with IP
> addresses on top of the bond.
> 
> From https://www.kernel.org/doc/Documentation/networking/bonding.txt
> 
>  	       "In bonding version 2.6.2 or later, when a failover
> 		occurs in active-backup mode, bonding will issue one
> 		or more gratuitous ARPs on the newly active slave.
> 		One gratuitous ARP is issued for the bonding master
> 		interface and each VLAN interfaces configured above
> 		it, provided that the interface has at least one IP
> 		address configured.  Gratuitous ARPs issued for VLAN
> 		interfaces are tagged with the appropriate VLAN id."
> 
> If we could do the same for VMs, it would provide higher network
> availability for the VMs.

Hanes, could the kernel do the same for a bridge? Should openvswitch do that? The use case does not seem to be limited to VMs, it is true for any mac address reachable via the bond.

Comment 10 Hannes Frederic Sowa 2017-01-04 16:39:18 UTC

Hi Dan,

Comment 11 Yaniv Lavi 2017-01-04 16:45:53 UTC

The comment was cut restoring needinfo

Comment 12 Hannes Frederic Sowa 2017-01-04 16:53:35 UTC

Sorry for misusing my mouse above. :)

I don't think that arp_notify or ndisc_notify sysctls help you here as they only broadcast garp requests for locally changed mac addresses. In the case of a bridge we would have to maintain information about which MAC addresses are certainly "synthesized" addresses and act accordingly, because sending out GARP on behalf of other systems feels a bit like a policy violation to me.

I wonder if libvirt could simply use arping based on its configuration files and emit the proper GARP packets again. I will think about this more.

Comment 13 Hannes Frederic Sowa 2017-01-04 19:11:10 UTC

I thought about it a bit more:

If a kernel solution should be done, we also should think about more complex topologies, which seem very much reasonable to me. For example, we could have a bridge within the VMs that either connect to nested VMs or to other containers, that should share a L2 segment.

Even though because of different timing settings, we cannot simply walk the fdb table of the bridge to announce the MAC addresses as we don't know if outer switches are in sync to the aging time of MAC addresses on the bridge. If this should be a fail-safe scenario we somehow also need to announce those events up to the VMs, which makes all a bit more complicated. Maybe I am overthinking/engineering this too much, but a simple knob doesn't seem to be the solution for me so far.

Comment 14 Dan Kenigsberg 2017-01-05 11:33:16 UTC

(In reply to Hannes Frederic Sowa from comment #12)
> Sorry for misusing my mouse above. :)
> 
> I don't think that arp_notify or ndisc_notify sysctls help you here as they
> only broadcast garp requests for locally changed mac addresses. In the case
> of a bridge we would have to maintain information about which MAC addresses
> are certainly "synthesized" addresses and act accordingly, because sending
> out GARP on behalf of other systems feels a bit like a policy violation to
> me.

I understand - for example, an in-guest bridge may funnel traffic from completely different parts of the network into the bond. Sending GARP for those is more likely to be out of date.

I suppose that one could limit GARP for mac addresses of tap devices connected to the bridge, and fresh enough in the bridge cache, and avoid complex topologies.

Comment 15 Hannes Frederic Sowa 2017-01-05 16:13:00 UTC

(In reply to Dan Kenigsberg from comment #14)
> I understand - for example, an in-guest bridge may funnel traffic from
> completely different parts of the network into the bond. Sending GARP for
> those is more likely to be out of date.

Yes, we need to maintain correct behavior even in the most strange setups users can do. You e.g. also don't know if a VM uses sriov and virtio with a bridge to the host.

> I suppose that one could limit GARP for mac addresses of tap devices
> connected to the bridge, and fresh enough in the bridge cache, and avoid
> complex topologies.

Even tap devices get used to connect to physical networks. We once had examples on the upstream list of users doing so. If at all, those interfaces must be specifically marked. Anyway, as it is very easy to get all the mac addresses and interface information from the kernel, I wonder why this can't be done in user space?

Comment 16 Germano Veit Michel 2017-01-16 03:27:03 UTC

(In reply to Hannes Frederic Sowa from comment #15)
> Even tap devices get used to connect to physical networks. We once had
> examples on the upstream list of users doing so. If at all, those interfaces
> must be specifically marked. Anyway, as it is very easy to get all the mac
> addresses and interface information from the kernel, I wonder why this can't
> be done in user space?

It can be done in userspace. In one of the tickets attached to this BZ we implemented this using a combination of bash scripts and some C code copied out of qemu.

Dan, let me know if you need help testing/experimenting with this. I thought the first step was to expose qemu's qemu_announce_self().

Comment 31 Germano Veit Michel 2017-02-07 04:28:33 UTC

Hannes,

Maybe you can help on this one =D

eth1 --                 
       | ---- bond0 
eth2 --

Active Slave is eth2. I pull it's cable the cable and I get:

# ip -s -d monitor link

4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond_slave state ACTIVE mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 
4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond_slave state BACKUP mii_status DOWN link_failure_count 21 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 
4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 0 
    bond_slave state BACKUP mii_status DOWN link_failure_count 21 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond_slave state BACKUP mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:19 queue_id 0 
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond_slave state ACTIVE mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:19 queue_id 0 
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
    bridge_slave 
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
    bridge_slave 
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP 
    link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 
    bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
    bridge_slave 


Summary:
3 down events for eth1 (cable pulled)
2 up events for eth2 (nothing done, just became active slave)
3 up events for bond0.

Currently I'm listening to up events of bond0. But I'm getting 3 of these. I don't want to trigger the announce for the VMs 3 times and I would prefer not having to implement logic to handle this as it may change in the future. Any idea what I can filter on to get a single, reliable, event?

Comment 53 Marian Jankular 2018-11-09 15:28:29 UTC

Hello,

is there any progress with this bugzilla?

Comment 70 Ademar Reis 2020-02-05 22:43:07 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 85 Dr. David Alan Gilbert 2020-11-02 09:22:42 UTC

I wonder if qemu now has everythign we need for this, but we need some wiring above it.
Last year, we added teh qmp 'announce-self' command to trigger a set of arp/rarps on demand.
If you could use that, then does it solve the problem?

Comment 86 Adrián Moreno 2020-11-03 09:11:06 UTC

(In reply to Dr. David Alan Gilbert from comment #85)
> I wonder if qemu now has everythign we need for this, but we need some
> wiring above it.
> Last year, we added teh qmp 'announce-self' command to trigger a set of
> arp/rarps on demand.
> If you could use that, then does it solve the problem?

IIUC, one of the possible solutions would be to have libvirt (or maybe another daemon but libvirt seems as a good candidate) listen to NOTIFY_PEER rtnl events and send 'announce-self' commands to qemu. So yes.
Another option that has been discussed is to add RSTP.

Comment 93 John Ferlan 2021-09-08 13:19:32 UTC

Bulk update - Move RHEL-AV bugs to RHEL

Comment 100 Ian Pilcher 2024-12-02 16:35:07 UTC

Initial user space daemon to monitor mode 1 bonds and send gratuitous ARP responses can be found here.

https://github.com/ipilcher/b1b