Description of problem: Customer is migration one instance by command openstack server migrate --live-migration --host and he is observing a bunch of packets lost (using ping) during migration proccess Migration ends succesfully Version-Release number of selected component (if applicable): Distro: [redhat-release] Red Hat Enterprise Linux release 8.2 (Ootpa) [rhosp-release] Red Hat OpenStack Platform release 16.1.0 GA (Train) [os-release] Red Hat Enterprise Linux 8.2 (Ootpa) 8.2 (Ootpa) network-scripts-openvswitch2.13-2.13.0-39.el8fdp.x86_64 Wed Jul 22 11:38:29 2020 openvswitch2.13-2.13.0-39.el8fdp.x86_64 Wed Jul 22 11:39:05 2020 openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch Wed Jul 22 11:38:33 2020 rhosp-openvswitch-2.13-8.el8ost.noarch Wed Jul 22 11:39:46 2020 puppet-nova-15.5.1-0.20200608173427.c15e37c.el8ost.noarch Wed Jul 22 11:39:28 2020 python3-novaclient-15.1.0-0.20200310162625.cd396b8.el8ost.noarch Wed Jul 22 11:34:50 2020 puppet-neutron-15.5.1-0.20200514103419.0a45ec7.el8ost.noarch Wed Jul 22 11:39:29 2020 python3-neutronclient-6.14.0-0.20200310192910.115f60f.el8ost.noarch Wed Jul 22 11:35:09 2020 How reproducible: Create a new instance and move from one compute node to another Steps to Reproduce: 1. Create instance 2. Migrate instance 3. Packet lost during time migration 4 Instance Migrated Actual results: Instance migrated with packet lost Expected results: Instance migrated without packet lost Additional info: instance ID : cf6dc934-4cd6-4195-8a81-54201bb2c4b7 request : req-f5910bd3-2063-464b-8c1d-c00f8d4e423c /ovn_controller.log 2020-12-02T10:47:55.031474209+01:00 stderr F 2020-12-02T09:47:55Z|00221|binding|INFO|Dropped 22 log messages in last 18 seconds (most recently, 18 seconds ago) due to excessive rate 2020-12-02T10:47:55.031554442+01:00 stderr F 2020-12-02T09:47:55Z|00222|binding|INFO|Not claiming lport 428caa60-f0cc-4b3e-878f-f048b28173ba, chassis xxxx requested-chassis <hostname>.<domain> ovs-vswitchd.log 2020-12-02T02:22:01.692Z|00212|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2020-12-02T09:45:20.809Z|00213|bridge|INFO|bridge br-int: added interface tap428caa60-f0 on port 37 2020-12-02T09:45:20.818Z|00214|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88 on port 38 2020-12-02T09:45:20.818Z|00215|bridge|INFO|bridge br-ex: added interface patch-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88-to-br-int on port 12 2020-12-02T09:45:30.821Z|00216|connmgr|INFO|br-int<->unix#0: 365 flow_mods 10 s ago (365 adds) 2020-12-02T09:46:30.821Z|00217|connmgr|INFO|br-int<->unix#0: 6 flow_mods 59 s ago (6 adds) 2020-12-02T09:47:58.869Z|00218|bridge|INFO|bridge br-int: added interface tap28039bf1-30 on port 39 2020-12-02T09:48:05.040Z|00219|connmgr|INFO|br-int<->unix#0: 23 flow_mods in the 4 s starting 10 s ago (20 adds, 2 deletes, 1 modifications) 2020-12-02T09:53:55.995Z|00220|timeval|WARN|Unreasonably long 1485ms poll interval (1ms user, 195ms system) 2020-12-02T09:53:55.995Z|00221|timeval|WARN|context switches: 1 voluntary, 0 involuntary ovs-vswitchd.log 2020-12-02T02:43:01.816Z|00200|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log 2020-12-02T09:25:54.293Z|00201|bridge|INFO|bridge br-int: added interface <span style="background-color:#FCE94F"><font color="#2E3436">tap428caa60</font></span>-f0 on port 34 2020-12-02T09:25:54.305Z|00202|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88 on port 35 2020-12-02T09:25:54.305Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88-to-br-int on port 11 2020-12-02T09:25:55.100Z|00204|bridge|INFO|bridge br-int: added interface tap28039bf1-30 on port 36 2020-12-02T09:26:04.300Z|00205|connmgr|INFO|br-int<->unix#0: 387 flow_mods in the 1 s starting 10 s ago (386 adds, 1 modifications) 2020-12-02T09:31:29.732Z|00206|timeval|WARN|Unreasonably long 3136ms poll interval (0ms user, 124ms system) 2020-12-02T09:31:29.732Z|00207|timeval|WARN|context switches: 2 voluntary, 0 involuntary </pre> nova-compute.log 22020-12-02 10:47:59.619 8 DEBUG nova.network.neutronv2.api [req-f5910bd3-2063-464b-8c1d-c00f8d4e423c 967706eb9b764d738d231e6c018c6fc8 f131d5d7dd2b4bb78179fef91daf95f8 - default default] Deleted binding for port 428caa60-f0cc-4b3e-878f-f048b28173ba and host <hostname>.<domain>. delete_port_binding /usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py:1341 2020-12-02 10:48:00.543 8 DEBUG nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9272 2020-12-02 10:48:00.543 8 DEBUG oslo_concurrency.lockutils [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" acquired by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 2020-12-02 10:48:00.543 8 DEBUG oslo_concurrency.lockutils [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" released by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: held 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2020-12-02 10:48:00.544 8 DEBUG nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] No waiting events found dispatching network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba pop_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:361 2020-12-02 10:48:00.544 8 WARNING nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received unexpected event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba for instance with vm_state active and task_state None. 2020-12-02 10:48:02.601 8 DEBUG nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9272 2020-12-02 10:48:02.602 8 DEBUG oslo_concurrency.lockutils [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" acquired by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327 2020-12-02 10:48:02.602 8 DEBUG oslo_concurrency.lockutils [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" released by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: held 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2020-12-02 10:48:02.602 8 DEBUG nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] No waiting events found dispatching network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba pop_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:361 2020-12-02 10:48:02.603 8 WARNING nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received unexpected event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba for instance with vm_state active and task_state None. 2020-12-02 10:48:02.871 8 DEBUG oslo_service.periodic_task [req-af3534f6-3452-4cdf-91c4-4356e6c7b13c - - - - -] Running periodic task ComputeManager._poll_rebooting_instances run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217 /var/log/messages Dec 2 10:45:20 <hostname> ovs-vsctl[62116]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port tap428caa60-f0 -- add-port br-int tap428caa60-f0 -- set I:q! nterface tap428caa60-f0 "external-ids:attached-mac=\"fa:16:3e:b8:6a:9f\"" -- set Interface tap428caa60-f0 "external-ids:iface-id=\"428caa60-f0cc-4b3e-878f-f048b28173ba\"" -- set Interface tap428caa60-f0 "external-ids:vm-id=\"cf6dc934-4cd6-4195-8a81-54201bb2c4b7\"" -- set Interface tap428caa60-f0 external-ids:iface-status=active Dec 2 10:45:20 <hostname> NetworkManager[8361]: <info> [1606902320.8099] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring Dec 2 10:45:20 <hostname> kernel: device tap428caa60-f0 entered promiscuous mode Dec 2 10:45:20 <hostname> NetworkManager[8361]: <info> [1606902320.8125] device (tap428caa60-f0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Dec 2 10:45:20 <hostname> NetworkManager[8361]: <info> [1606902320.8135] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring Dec 2 10:45:20 <hostname> NetworkManager[8361]: <info> [1606902320.8137] device (tap428caa60-f0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external') var/log/messages Dec 2 10:25:54 <hostname> systemd[1]: Starting nova_libvirt healthcheck... Dec 2 10:25:54 <hostname> systemd[1]: Starting ovn_metadata_agent healthcheck... Dec 2 10:25:54 <hostname> NetworkManager[6738]: <info> [1606901154.2830] manager: (tap428caa60-f0): new Tun device (/org/freedesktop/NetworkManager/Devices/43) Dec 2 10:25:54 <hostname> systemd-udevd[43585]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Dec 2 10:25:54 <hostname> ovs-vsctl[43582]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port tap428caa60-f0 -- add-port br-int tap428caa60-f0 -- set Interface tap428caa60-f0 "external-ids:attached-mac=\"fa:16:3e:b8:6a:9f\"" -- set Interface tap428caa60-f0 "external-ids:iface-id=\"428caa60-f0cc-4b3e-878f-f048b28173ba\"" -- set Interface tap428caa60-f0 "external-ids:vm-id=\"cf6dc934-4cd6-4195-8a81-54201bb2c4b7\"" -- set Interface tap428caa60-f0 external-ids:iface-status=active Dec 2 10:25:54 <hostname> kernel: device tap428caa60-f0 entered promiscuous mode Dec 2 10:25:54 <hostname> NetworkManager[6738]: <info> [1606901154.2937] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring Dec 2 10:25:54 <hostname> NetworkManager[6738]: <info> [1606901154.2981] device (tap428caa60-f0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Dec 2 10:25:54 <hostname> NetworkManager[6738]: <info> [1606901154.2990] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring Dec 2 10:25:54 <hostname> NetworkManager[6738]: <info> [1606901154.2991] device (tap428caa60-f0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')
How long is the downtime? As per OSP documentation, there is no guarantee of 0 packet loss and downtime is expected. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/instances_and_images_guide/migrating-virtual-machines-between-compute-nodes#migration-types You can try to tune the nova parameters: live_migration_permit_post_copy=true live_migration_timeout_action=force_complete live_migration_permit_auto_converge=true You will have to set it on all controllers in /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf and restart nova services. Please let me know if the provided steps helped in any way and also what are the customer's expectations and what downtime they observe.
To summarise where we are at the moment, the change in behavior we're seeting here is due to the change in networking backend and default plugging behavior between OSP 13 and OSP 16.x. In OSP 13, we used ML2/OVS with hybrid plug, meaning iptables was used for enforcing firewalling and security groups (a VM would be connected to the vSwitch through a linux bridge). OSP 16 uses ML2/OVN which does not use hybrid plug (the VM is connected directly to the vSwitch). With hybrid plug, libvirt did not create the ports that are actually attached to OVS dataplane. Instead, this was done by os-vif in pre-live migration. In 16.x, the port than libvirt creates is directly attached to the OVS bridge and OVN observes this and creates necessary flow rules. However, this port creation only happens when the instance is created on the destination host, right before it is resumed. This results in a much reduced interval in which to install the OpenFlow rules. This matters becase on resume, QEMU sends MAC learning frames (reverse ARP packets). Because OVN has not yet installed the flow rules, these are dropped. With patch that has been referenced previously, we're changing the libvirt type from bridge (a TAP device wired up to an actual OVS or linux bridge) to ethernet (a generic, free floating TAP device type). os-vif creates the port in pre-live migrate (rather than during the migration) which means in theory than OVN can create the OpenFlow rules sooner. However, the open question is whether creating the OVS port earlier is sufficient or do we need an actual port on the dataplane? From talking to the engineers on the networking team, it would appear not because the flow rules depend on having the real port. This patch will help slightly, but it won't resolve the issue fully. We see four options: 1. Provide a way to revert to hybrid plugging with ML2/OVN. This will have performance implications. 2. Revert to using ML2/OVS. We're not sure if this will be maintainable long term. 3. See if os-vif can precreate the TAP device. We're not sure this will be backportable. 4. Modify OVN to not request the ofport in its ingress pipeline. We will discuss these options as a team and continue working on the backport in parallel.
This BZ is tracking Nova's qmp workaround. For QE, a simple smoke test that continuously pings the instance being live-migrated and compares ping loss before and after the fix.
Filed https://bugzilla.redhat.com/show_bug.cgi?id=2042162 for the Nova tracker.
closing this as cant fix. as neutron cant fix the lack of live migration support in ovn in 16.1. I have created 3 new bugs in nova for the workaround backport https://bugzilla.redhat.com/show_bug.cgi?id=2042165 is for osp 17 https://bugzilla.redhat.com/show_bug.cgi?id=2042163 is for osp 16.2 and https://bugzilla.redhat.com/show_bug.cgi?id=2042162 is for osp 16.1 we need to backport to all 3 release with separate bugs in that order. i have already submitted the upstream backports and started premetive backprots downstream for 17 I will do the 16.2 and 16.1 premtive backports shortly. for regression reasons we cannot release the fix on 16.1 until after it is merged in the 16.2 branch to avoid regressions. this is a latent issue so it does not qualify as a blocker for the next two osp 16 z stream release we can however potially do a hotfix once the patches are merged across all 3 downstream branches.
*** Bug 1872937 has been marked as a duplicate of this bug. ***