Description of problem: This issue can be reproduced with the upstream tempest test test_server_connectivity_live_migration, but it needs to be updated with this change (otherwise it wrongly passes): https://review.opendev.org/c/openstack/tempest/+/880719 It only fails on BGP setups and that's the reason why the component is set initially to ovn-bgp-agent, although the fix may be implemented in neutron or somewhere else. The manual reproduction is simple: - create a VM connected to a provider network with external connectivity - start a ping from the VM to an external IP (8.8.8.8) - by default one ping is sent per second - run the following command: openstack server migrate --live-migration vm0 - stop the ping command and check the number of pings not replied In a non-BGP setup, only one ping is lost (~1 second of connectivity downtime). In a BGP setup, the downtime takes between 15 and 20 seconds. The reason is that the default GWs MAC address changes when the VM is migrated to a different compute, because it corresponds with the compute's br-ex interface in case of BGP setups. This change doesn't happen immediately in the VM's ARP table. It happens when the VM sends an ARP asking for the MAC of that default GW. Packets captured at the VM eth0 interface before the migration from comp-0 to comp-1 (the MAC a6:e1:df:19:b3:45 corresponds with the comp-0 br-ex interface) show that the pings are successfully replied: 09:41:20.683835 fa:16:3e:27:ba:5b > a6:e1:df:19:b3:45, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 2493, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 22, length 64 09:41:20.756554 a6:e1:df:19:b3:45 > fa:16:3e:27:ba:5b, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 51, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 172.24.100.16: ICMP echo reply, id 1, seq 22, length 64 09:41:21.685383 fa:16:3e:27:ba:5b > a6:e1:df:19:b3:45, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 3418, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 23, length 64 09:41:21.757542 a6:e1:df:19:b3:45 > fa:16:3e:27:ba:5b, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 51, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 172.24.100.16: ICMP echo reply, id 1, seq 23, length 64 When the VM is migrated to comp-1, the following ARP is captured (46:fd:fb:5d:e1:41 is comp-1 br-ex MAC): 09:41:22.794658 fa:16:3e:27:ba:5b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.100.16 tell 172.24.100.16, length 28 09:41:23.028863 46:fd:fb:5d:e1:41 > fa:16:3e:27:ba:5b, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 172.24.100.16 is-at 46:fd:fb:5d:e1:41, length 28 After that, pings are not replied during ~17 seconds - they are sent to the wrong destination MAC (a6:e1:df:19:b3:45 is from comp-0 br-ex, but the VM is running on comp-1 now): 09:41:23.734689 fa:16:3e:27:ba:5b > a6:e1:df:19:b3:45, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 4027, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 25, length 64 09:41:24.758616 fa:16:3e:27:ba:5b > a6:e1:df:19:b3:45, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 4042, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 26, length 64 ... 09:41:40.118865 fa:16:3e:27:ba:5b > a6:e1:df:19:b3:45, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 12990, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 41, length 64 Then, the following ARP fixes the problem with the destination MAC (46:fd:fb:5d:e1:41 is from comp-1 br-ex): 09:41:41.143189 fa:16:3e:27:ba:5b > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 172.24.100.1 tell 172.24.100.16, length 28 09:41:41.917087 46:fd:fb:5d:e1:41 > fa:16:3e:27:ba:5b, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 172.24.100.1 is-at 46:fd:fb:5d:e1:41, length 28 09:41:41.917114 fa:16:3e:27:ba:5b > 46:fd:fb:5d:e1:41, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 13055, offset 0, flags [DF], proto ICMP (1), length 84) 172.24.100.16 > 8.8.8.8: ICMP echo request, id 1, seq 42, length 64 09:41:41.990308 46:fd:fb:5d:e1:41 > fa:16:3e:27:ba:5b, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 51, id 0, offset 0, flags [none], proto ICMP (1), length 84) 8.8.8.8 > 172.24.100.16: ICMP echo reply, id 1, seq 42, length 64 The tempest test test_server_connectivity_live_migration covers the scenario of a VM with a port from a tenant network and with a FIP. It fails too. I will add a comment when I test the scenario with a tenant network and no FIP. Version-Release number of selected component (if applicable): RHOS-17.1-RHEL-9-20230404.n.1 How reproducible: 100% Actual results: Connectivity downtime of 15+ seconds Expected results: Lower connectivity downtime during/after live-migration
The downtime without FIP is 3 seconds or less. Even if the VM is migrated to a compute from a different rack (connected to different leaf/s), the downtime is low because the destination MAC corresponds with the router gateway (typically the IP X.X.X.1), which doesn't change during the migration.
This bug only occurs when no other VM is running on the destination compute. If another VM was running on that compute before the VM under test is migrated, the flows from [1] already existed and the measured downtime is 2 seconds or less If no previous VM was running on that compute, this flows would not exist until they would be created by the sync process. [1] [root@cmp-1-0 ~]# ovs-ofctl dump-flows br-ex cookie=0x3e7, duration=75.948s, table=0, n_packets=1, n_bytes=90, priority=900,ip,in_port="patch-provnet-4" actions=mod_dl_dst:46:fd:fb:5d:e1:41,NORMAL cookie=0x3e7, duration=75.938s, table=0, n_packets=0, n_bytes=0, priority=900,ipv6,in_port="patch-provnet-4" actions=mod_dl_dst:46:fd:fb:5d:e1:41,NORMAL cookie=0x0, duration=592819.333s, table=0, n_packets=6151, n_bytes=823797, priority=0 actions=NORMAL
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577