Hide Forgot
1. Proposed title of this feature request * Improve network failover times on active-backup bonds in complex networks. 3. What is the nature and description of the request? Please see the diagram below. The Hypervisor is connected to the network via a bond device, which masters two physical interfaces, each connected to a different switch for maximum availability. +----+ +----+ +-----------+ |SW A+---------------+SW B+---------+ CLIENT | +-+--+ +--+-+ +-----------+ | | | | | | | | +-+--+ +--+-+ |SW C+---------------+SW D| +-+--+ +--+-+ | MODE 1 BOND | +----------+----------+ | +------------+------------+ | HYPERVISOR | +-------------------------+ Considerations: 1. SW A, B, C and D run RSTP. 2. The Hypervisor Bridge does not run STP, because it only supports Legacy STP with a minimum of 2x Forwarding Time convergence time, which is too long for mission critical applications. Now see the following scenario: 1. The link between C and D is blocked by STP, to avoid loops. 2. Topology for Client <-> VMs communication is CLIENT -> SW B -> SW D -> Hypervisor Bridge -> VM 2. The link between Hypervisor and SW D fails 3. Hypervisor bond switches it active slave towards SW C. 4. STP Topology Remains unchanged, no TCN sent by any switch/bridge. 5. CLIENT sends a frame to VM 5.1. Frame reaches SW B, table points to port connecting to SW D. Frame sent. 5.2. Frame reaches SW D, table does not contain destination MAC of the VM, as the carrier for that link is down and the entry was wiped. 5.3. No other ports to forward the frame to, frame is drooped Comments: * Until SW B MAC entry for the VM MAC ages or is updated via a frame from SW A, the CLIENT will fail to reach the VM. This can take up to 5 minutes. * Since the RHV bridge only supports legacy STP, it may not participate in the STP topology, therefore no TCNs, which would help to re-establish connection faster are sent when the failover occurs. * In more complex scenarios, this can be worse. The kernel does a very similar job for all the devices/vlans with IP addresses on top of the bond. From https://www.kernel.org/doc/Documentation/networking/bonding.txt "In bonding version 2.6.2 or later, when a failover occurs in active-backup mode, bonding will issue one or more gratuitous ARPs on the newly active slave. One gratuitous ARP is issued for the bonding master interface and each VLAN interfaces configured above it, provided that the interface has at least one IP address configured. Gratuitous ARPs issued for VLAN interfaces are tagged with the appropriate VLAN id." If we could do the same for VMs, it would provide higher network availability for the VMs. A similar mechanism to what is done at the end of migrations by qemu would suffice. Additionally, this might be extended for other bonding modes/topologies that possibly could suffer from similar problems, or be user-configurable as a bonding option in the Administration Portal. 4. Why does the customer need this? (List the business requirements here) * Actual failover time for VMs can reach up to 5 minutes of outage (MAC aging time in the switches). 5. How would the customer like to achieve this? (List the functional requirements here) * maintain a data structure mapping physical devices to respective tap devices * monitor physical interface link changes (via netlink?) * on bond failover, send a frame on behalf of VMs related to those tap devices, which is flooded and updates all switches tables. 6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented. * unplug cable from active-backup device, tcpdump for RARP for each VM running in that logical network. 10. List any affected packages or components. * vdsm, supervdsm, possibly up to libvirt and qemu
(In reply to Germano Veit Michel from comment #0) > 5. How would the customer like to achieve this? (List the functional > requirements here) > * maintain a data structure mapping physical devices to respective tap > devices > * monitor physical interface link changes (via netlink?) > * on bond failover, send a frame on behalf of VMs related to those tap > devices, which is flooded and updates all switches tables. Who would maintain that data structure? vdsm? And how would vdsm know about the change in the link status? I think networking is transparent to vdsm, it either works or not. But I am anywhere a network expert to weigh on this.
(In reply to Marina from comment #4) > Who would maintain that data structure? vdsm? > And how would vdsm know about the change in the link status? I think > networking is transparent to vdsm, it either works or not. Libvirt or vdsm sounds the most appropriate to me. On libvirt perhaps other products could also use the functionality (OpenStack?), so that would be a plus. Any userspace code can listen on a netlink socket and watch for the bond events. Qemu generates this required probe frame at the end of migration, so the code is already there. We could also just expose "qemu_announce_self()" via some command and let libvirt/vdsm use it to. Another option would be to implement this in the kernel but I doubt it's the correct way to do it as this can be done from userspace.
Thanks to Marina, the description in comment #0 above can be simplified :) When the VM migrates, we send a probe frame to notify the whole network that the VM's MAC has changed location. But, when a Host bond fails over, it might change to a different switch. In this case we also need to notify the whole network as the active bond might be in a completely different position in the topology (same as a migration).
(In reply to Germano Veit Michel from comment #0) > The kernel does a very similar job for all the devices/vlans with IP > addresses on top of the bond. > > From https://www.kernel.org/doc/Documentation/networking/bonding.txt > > "In bonding version 2.6.2 or later, when a failover > occurs in active-backup mode, bonding will issue one > or more gratuitous ARPs on the newly active slave. > One gratuitous ARP is issued for the bonding master > interface and each VLAN interfaces configured above > it, provided that the interface has at least one IP > address configured. Gratuitous ARPs issued for VLAN > interfaces are tagged with the appropriate VLAN id." > > If we could do the same for VMs, it would provide higher network > availability for the VMs. Hanes, could the kernel do the same for a bridge? Should openvswitch do that? The use case does not seem to be limited to VMs, it is true for any mac address reachable via the bond.
Hi Dan,
The comment was cut restoring needinfo
Sorry for misusing my mouse above. :) I don't think that arp_notify or ndisc_notify sysctls help you here as they only broadcast garp requests for locally changed mac addresses. In the case of a bridge we would have to maintain information about which MAC addresses are certainly "synthesized" addresses and act accordingly, because sending out GARP on behalf of other systems feels a bit like a policy violation to me. I wonder if libvirt could simply use arping based on its configuration files and emit the proper GARP packets again. I will think about this more.
I thought about it a bit more: If a kernel solution should be done, we also should think about more complex topologies, which seem very much reasonable to me. For example, we could have a bridge within the VMs that either connect to nested VMs or to other containers, that should share a L2 segment. Even though because of different timing settings, we cannot simply walk the fdb table of the bridge to announce the MAC addresses as we don't know if outer switches are in sync to the aging time of MAC addresses on the bridge. If this should be a fail-safe scenario we somehow also need to announce those events up to the VMs, which makes all a bit more complicated. Maybe I am overthinking/engineering this too much, but a simple knob doesn't seem to be the solution for me so far.
(In reply to Hannes Frederic Sowa from comment #12) > Sorry for misusing my mouse above. :) > > I don't think that arp_notify or ndisc_notify sysctls help you here as they > only broadcast garp requests for locally changed mac addresses. In the case > of a bridge we would have to maintain information about which MAC addresses > are certainly "synthesized" addresses and act accordingly, because sending > out GARP on behalf of other systems feels a bit like a policy violation to > me. I understand - for example, an in-guest bridge may funnel traffic from completely different parts of the network into the bond. Sending GARP for those is more likely to be out of date. I suppose that one could limit GARP for mac addresses of tap devices connected to the bridge, and fresh enough in the bridge cache, and avoid complex topologies.
(In reply to Dan Kenigsberg from comment #14) > I understand - for example, an in-guest bridge may funnel traffic from > completely different parts of the network into the bond. Sending GARP for > those is more likely to be out of date. Yes, we need to maintain correct behavior even in the most strange setups users can do. You e.g. also don't know if a VM uses sriov and virtio with a bridge to the host. > I suppose that one could limit GARP for mac addresses of tap devices > connected to the bridge, and fresh enough in the bridge cache, and avoid > complex topologies. Even tap devices get used to connect to physical networks. We once had examples on the upstream list of users doing so. If at all, those interfaces must be specifically marked. Anyway, as it is very easy to get all the mac addresses and interface information from the kernel, I wonder why this can't be done in user space?
(In reply to Hannes Frederic Sowa from comment #15) > Even tap devices get used to connect to physical networks. We once had > examples on the upstream list of users doing so. If at all, those interfaces > must be specifically marked. Anyway, as it is very easy to get all the mac > addresses and interface information from the kernel, I wonder why this can't > be done in user space? It can be done in userspace. In one of the tickets attached to this BZ we implemented this using a combination of bash scripts and some C code copied out of qemu. Dan, let me know if you need help testing/experimenting with this. I thought the first step was to expose qemu's qemu_announce_self().
Hannes, Maybe you can help on this one =D eth1 -- | ---- bond0 eth2 -- Active Slave is eth2. I pull it's cable the cable and I get: # ip -s -d monitor link 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond_slave state ACTIVE mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond_slave state BACKUP mii_status DOWN link_failure_count 21 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 4: eth2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 state DOWN link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 0 bond_slave state BACKUP mii_status DOWN link_failure_count 21 perm_hwaddr 00:1a:4a:61:18:1a queue_id 0 3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond_slave state BACKUP mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:19 queue_id 0 3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond_slave state ACTIVE mii_status UP link_failure_count 20 perm_hwaddr 00:1a:4a:61:18:19 queue_id 0 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 bridge_slave 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 bridge_slave 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovirtmgmt state UP link/ether 00:1a:4a:61:18:19 brd ff:ff:ff:ff:ff:ff promiscuity 1 bond mode active-backup active_slave eth1 miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 bridge_slave Summary: 3 down events for eth1 (cable pulled) 2 up events for eth2 (nothing done, just became active slave) 3 up events for bond0. Currently I'm listening to up events of bond0. But I'm getting 3 of these. I don't want to trigger the announce for the VMs 3 times and I would prefer not having to implement logic to handle this as it may change in the future. Any idea what I can filter on to get a single, reliable, event?
Hello, is there any progress with this bugzilla?
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
I wonder if qemu now has everythign we need for this, but we need some wiring above it. Last year, we added teh qmp 'announce-self' command to trigger a set of arp/rarps on demand. If you could use that, then does it solve the problem?
(In reply to Dr. David Alan Gilbert from comment #85) > I wonder if qemu now has everythign we need for this, but we need some > wiring above it. > Last year, we added teh qmp 'announce-self' command to trigger a set of > arp/rarps on demand. > If you could use that, then does it solve the problem? IIUC, one of the possible solutions would be to have libvirt (or maybe another daemon but libvirt seems as a good candidate) listen to NOTIFY_PEER rtnl events and send 'announce-self' commands to qemu. So yes. Another option that has been discussed is to add RSTP.
Bulk update - Move RHEL-AV bugs to RHEL