Bug 1903653 - Instance live-migration observes ping lost OVN
Summary: Instance live-migration observes ping lost OVN
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 2012179 2076356 2104522 2111956
Blocks: 1823988
TreeView+ depends on / blocked
 
Reported: 2020-12-02 15:10 UTC by Ivan Richart
Modified: 2022-08-04 13:30 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-18 22:18:22 UTC
Target Upstream Version:
Embargoed:
twilson: needinfo-
twilson: needinfo-
twilson: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1933517 0 None None None 2021-06-24 15:45:44 UTC
OpenStack gerrit 602432 0 None MERGED libvirt: Delegate OVS plug to os-vif 2021-06-18 17:53:01 UTC
OpenStack gerrit 741529 0 None NEW [PoC] libvirt: Send announce-self post live migration 2021-11-17 18:41:02 UTC
OpenStack gerrit 797142 0 None NEW objects: Fix VIFMigrateData.supports_os_vif_delegation setter 2021-06-18 17:53:01 UTC
Red Hat Bugzilla 2024300 1 unspecified CLOSED [RFE] expose public interface to trigger qemu announce-self 2024-01-21 04:25:48 UTC
Red Hat Issue Tracker OSP-1421 0 None None None 2021-11-10 23:44:15 UTC

Internal Links: 2024300

Description Ivan Richart 2020-12-02 15:10:51 UTC
Description of problem:

Customer is migration one instance by command openstack server migrate --live-migration --host  and he is observing a bunch of packets lost (using ping) during migration proccess
Migration ends succesfully

Version-Release number of selected component (if applicable):


  Distro:   [redhat-release] Red Hat Enterprise Linux release 8.2 (Ootpa)
            [rhosp-release] Red Hat OpenStack Platform release 16.1.0 GA (Train)
            [os-release] Red Hat Enterprise Linux 8.2 (Ootpa) 8.2 (Ootpa)


network-scripts-openvswitch2.13-2.13.0-39.el8fdp.x86_64     Wed Jul 22 11:38:29 2020
openvswitch2.13-2.13.0-39.el8fdp.x86_64                     Wed Jul 22 11:39:05 2020
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch       Wed Jul 22 11:38:33 2020
rhosp-openvswitch-2.13-8.el8ost.noarch                      Wed Jul 22 11:39:46 2020
puppet-nova-15.5.1-0.20200608173427.c15e37c.el8ost.noarch   Wed Jul 22 11:39:28 2020
python3-novaclient-15.1.0-0.20200310162625.cd396b8.el8ost.noarch Wed Jul 22 11:34:50 2020
puppet-neutron-15.5.1-0.20200514103419.0a45ec7.el8ost.noarch Wed Jul 22 11:39:29 2020
python3-neutronclient-6.14.0-0.20200310192910.115f60f.el8ost.noarch Wed Jul 22 11:35:09 2020



How reproducible:

Create a new instance and move from one compute node to another


Steps to Reproduce:
1. Create instance
2. Migrate instance
3. Packet lost during time migration
4 Instance Migrated

Actual results:

Instance migrated with packet lost

Expected results:

Instance migrated without packet lost

Additional info:

instance ID : cf6dc934-4cd6-4195-8a81-54201bb2c4b7
request : req-f5910bd3-2063-464b-8c1d-c00f8d4e423c 

/ovn_controller.log

2020-12-02T10:47:55.031474209+01:00 stderr F 2020-12-02T09:47:55Z|00221|binding|INFO|Dropped 22 log messages in last 18 seconds (most recently, 18 seconds ago) due to excessive rate
2020-12-02T10:47:55.031554442+01:00 stderr F 2020-12-02T09:47:55Z|00222|binding|INFO|Not claiming lport 428caa60-f0cc-4b3e-878f-f048b28173ba, chassis xxxx requested-chassis <hostname>.<domain>

ovs-vswitchd.log

2020-12-02T02:22:01.692Z|00212|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-12-02T09:45:20.809Z|00213|bridge|INFO|bridge br-int: added interface tap428caa60-f0 on port 37
2020-12-02T09:45:20.818Z|00214|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88 on port 38
2020-12-02T09:45:20.818Z|00215|bridge|INFO|bridge br-ex: added interface patch-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88-to-br-int on port 12
2020-12-02T09:45:30.821Z|00216|connmgr|INFO|br-int<->unix#0: 365 flow_mods 10 s ago (365 adds)
2020-12-02T09:46:30.821Z|00217|connmgr|INFO|br-int<->unix#0: 6 flow_mods 59 s ago (6 adds)
2020-12-02T09:47:58.869Z|00218|bridge|INFO|bridge br-int: added interface tap28039bf1-30 on port 39
2020-12-02T09:48:05.040Z|00219|connmgr|INFO|br-int<->unix#0: 23 flow_mods in the 4 s starting 10 s ago (20 adds, 2 deletes, 1 modifications)
2020-12-02T09:53:55.995Z|00220|timeval|WARN|Unreasonably long 1485ms poll interval (1ms user, 195ms system)
2020-12-02T09:53:55.995Z|00221|timeval|WARN|context switches: 1 voluntary, 0 involuntary

ovs-vswitchd.log

2020-12-02T02:43:01.816Z|00200|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2020-12-02T09:25:54.293Z|00201|bridge|INFO|bridge br-int: added interface <span style="background-color:#FCE94F"><font color="#2E3436">tap428caa60</font></span>-f0 on port 34
2020-12-02T09:25:54.305Z|00202|bridge|INFO|bridge br-int: added interface patch-br-int-to-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88 on port 35
2020-12-02T09:25:54.305Z|00203|bridge|INFO|bridge br-ex: added interface patch-provnet-ad580e33-e2a2-4f33-85f3-3b699b158e88-to-br-int on port 11
2020-12-02T09:25:55.100Z|00204|bridge|INFO|bridge br-int: added interface tap28039bf1-30 on port 36
2020-12-02T09:26:04.300Z|00205|connmgr|INFO|br-int&lt;-&gt;unix#0: 387 flow_mods in the 1 s starting 10 s ago (386 adds, 1 modifications)
2020-12-02T09:31:29.732Z|00206|timeval|WARN|Unreasonably long 3136ms poll interval (0ms user, 124ms system)
2020-12-02T09:31:29.732Z|00207|timeval|WARN|context switches: 2 voluntary, 0 involuntary
</pre>

nova-compute.log

22020-12-02 10:47:59.619 8 DEBUG nova.network.neutronv2.api [req-f5910bd3-2063-464b-8c1d-c00f8d4e423c 967706eb9b764d738d231e6c018c6fc8 f131d5d7dd2b4bb78179fef91daf95f8 - default default] Deleted binding for port 428caa60-f0cc-4b3e-878f-f048b28173ba and host <hostname>.<domain>. delete_port_binding /usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py:1341
2020-12-02 10:48:00.543 8 DEBUG nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9272
2020-12-02 10:48:00.543 8 DEBUG oslo_concurrency.lockutils [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" acquired by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2020-12-02 10:48:00.543 8 DEBUG oslo_concurrency.lockutils [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" released by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: held 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2020-12-02 10:48:00.544 8 DEBUG nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] No waiting events found dispatching network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba pop_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:361
2020-12-02 10:48:00.544 8 WARNING nova.compute.manager [req-53dc026d-dcaf-47ad-b967-e0169c83a340 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received unexpected event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba for instance with vm_state active and task_state None.
2020-12-02 10:48:02.601 8 DEBUG nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba external_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:9272
2020-12-02 10:48:02.602 8 DEBUG oslo_concurrency.lockutils [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" acquired by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2020-12-02 10:48:02.602 8 DEBUG oslo_concurrency.lockutils [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] Lock "cf6dc934-4cd6-4195-8a81-54201bb2c4b7-events" released by "nova.compute.manager.InstanceEvents.pop_instance_event.<locals>._pop_event" :: held 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2020-12-02 10:48:02.602 8 DEBUG nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] No waiting events found dispatching network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba pop_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:361
2020-12-02 10:48:02.603 8 WARNING nova.compute.manager [req-ce71e2ec-4625-4c7a-be27-5849e945b292 3d8b5ccbccc148a6a5ea6ccdd3b3f778 9d8322178b4d4d0093297a01615857a5 - default default] [instance: cf6dc934-4cd6-4195-8a81-54201bb2c4b7] Received unexpected event network-vif-plugged-428caa60-f0cc-4b3e-878f-f048b28173ba for instance with vm_state active and task_state None.
2020-12-02 10:48:02.871 8 DEBUG oslo_service.periodic_task [req-af3534f6-3452-4cdf-91c4-4356e6c7b13c - - - - -] Running periodic task ComputeManager._poll_rebooting_instances run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217

/var/log/messages

Dec  2 10:45:20 <hostname> ovs-vsctl[62116]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port tap428caa60-f0 -- add-port br-int tap428caa60-f0 -- set I:q!
nterface tap428caa60-f0 "external-ids:attached-mac=\"fa:16:3e:b8:6a:9f\"" -- set Interface tap428caa60-f0 "external-ids:iface-id=\"428caa60-f0cc-4b3e-878f-f048b28173ba\"" -- set Interface tap428caa60-f0 "external-ids:vm-id=\"cf6dc934-4cd6-4195-8a81-54201bb2c4b7\"" -- set Interface tap428caa60-f0 external-ids:iface-status=active
Dec  2 10:45:20 <hostname> NetworkManager[8361]: <info>  [1606902320.8099] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring
Dec  2 10:45:20 <hostname> kernel: device tap428caa60-f0 entered promiscuous mode
Dec  2 10:45:20 <hostname> NetworkManager[8361]: <info>  [1606902320.8125] device (tap428caa60-f0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Dec  2 10:45:20 <hostname> NetworkManager[8361]: <info>  [1606902320.8135] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring
Dec  2 10:45:20 <hostname> NetworkManager[8361]: <info>  [1606902320.8137] device (tap428caa60-f0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')


var/log/messages


Dec  2 10:25:54 <hostname> systemd[1]: Starting nova_libvirt healthcheck...
Dec  2 10:25:54 <hostname> systemd[1]: Starting ovn_metadata_agent healthcheck...
Dec  2 10:25:54 <hostname> NetworkManager[6738]: <info>  [1606901154.2830] manager: (tap428caa60-f0): new Tun device (/org/freedesktop/NetworkManager/Devices/43)
Dec  2 10:25:54 <hostname> systemd-udevd[43585]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec  2 10:25:54 <hostname> ovs-vsctl[43582]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port tap428caa60-f0 -- add-port br-int tap428caa60-f0 -- set Interface tap428caa60-f0 "external-ids:attached-mac=\"fa:16:3e:b8:6a:9f\"" -- set Interface tap428caa60-f0 "external-ids:iface-id=\"428caa60-f0cc-4b3e-878f-f048b28173ba\"" -- set Interface tap428caa60-f0 "external-ids:vm-id=\"cf6dc934-4cd6-4195-8a81-54201bb2c4b7\"" -- set Interface tap428caa60-f0 external-ids:iface-status=active
Dec  2 10:25:54 <hostname> kernel: device tap428caa60-f0 entered promiscuous mode
Dec  2 10:25:54 <hostname> NetworkManager[6738]: <info>  [1606901154.2937] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring
Dec  2 10:25:54 <hostname> NetworkManager[6738]: <info>  [1606901154.2981] device (tap428caa60-f0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Dec  2 10:25:54 <hostname> NetworkManager[6738]: <info>  [1606901154.2990] device (tap428caa60-f0): enslaved to non-master-type device ovs-system; ignoring
Dec  2 10:25:54 <hostname> NetworkManager[6738]: <info>  [1606901154.2991] device (tap428caa60-f0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')

Comment 1 Jakub Libosvar 2020-12-08 16:40:48 UTC
How long is the downtime? As per OSP documentation, there is no guarantee of 0 packet loss and downtime is expected.
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/instances_and_images_guide/migrating-virtual-machines-between-compute-nodes#migration-types

You can try to tune the nova parameters: 

live_migration_permit_post_copy=true
live_migration_timeout_action=force_complete
live_migration_permit_auto_converge=true

You will have to set it on all controllers in /var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf
and restart nova services.

Please let me know if the provided steps helped in any way and also what are the customer's expectations and what downtime they observe.

Comment 57 Stephen Finucane 2021-06-18 17:49:27 UTC
To summarise where we are at the moment, the change in behavior we're seeting here is due to the change in networking backend and default plugging behavior between OSP 13 and OSP 16.x. In OSP 13, we used ML2/OVS with hybrid plug, meaning iptables was used for enforcing firewalling and security groups (a VM would be connected to the vSwitch through a linux bridge). OSP 16 uses ML2/OVN which does not use hybrid plug (the VM is connected directly to the vSwitch). With hybrid plug, libvirt did not create the ports that are actually attached to OVS dataplane. Instead, this was done by os-vif in pre-live migration. In 16.x, the port than libvirt creates is directly attached to the OVS bridge and OVN observes this and creates necessary flow rules. However, this port creation only happens when the instance is created on the destination host, right before it is resumed. This results in a much reduced interval in which to install the OpenFlow rules. This matters becase on resume, QEMU sends MAC learning frames (reverse ARP packets). Because OVN has not yet installed the flow rules, these are dropped.

With patch that has been referenced previously, we're changing the libvirt type from bridge (a TAP device wired up to an actual OVS or linux bridge) to ethernet (a generic, free floating TAP device type). os-vif creates the port in pre-live migrate (rather than during the migration) which means in theory than OVN can create the OpenFlow rules sooner. However, the open question is whether creating the OVS port earlier is sufficient or do we need an actual port on the dataplane? From talking to the engineers on the networking team, it would appear not because the flow rules depend on having the real port. This patch will help slightly, but it won't resolve the issue fully.

We see four options:

1. Provide a way to revert to hybrid plugging with ML2/OVN. This will have performance implications.
2. Revert to using ML2/OVS. We're not sure if this will be maintainable long term.
3. See if os-vif can precreate the TAP device. We're not sure this will be backportable.
4. Modify OVN to not request the ofport in its ingress pipeline.

We will discuss these options as a team and continue working on the backport in parallel.

Comment 91 Artom Lifshitz 2021-11-24 16:22:37 UTC
This BZ is tracking Nova's qmp workaround. For QE, a simple smoke test that continuously pings the instance being live-migrated and compares ping loss before and after the fix.

Comment 96 Artom Lifshitz 2022-01-18 21:52:06 UTC
Filed https://bugzilla.redhat.com/show_bug.cgi?id=2042162 for the Nova tracker.

Comment 97 smooney 2022-01-18 22:18:22 UTC
closing this as cant fix. as neutron cant fix the lack of live migration support in ovn in 16.1.

I have created 3 new bugs in nova for the workaround backport

https://bugzilla.redhat.com/show_bug.cgi?id=2042165 is for osp 17
https://bugzilla.redhat.com/show_bug.cgi?id=2042163 is for osp 16.2
and
https://bugzilla.redhat.com/show_bug.cgi?id=2042162 is for osp 16.1

we need to backport to all 3 release with separate bugs in that order.

i have already submitted the upstream backports and started premetive backprots
downstream for 17

I will do the 16.2 and 16.1 premtive backports shortly.
for regression reasons we cannot release the fix on 16.1 until after it is merged in the 16.2 branch to avoid regressions.

this is a latent issue so it does not qualify as a blocker for the next two osp 16 z stream release
we can however potially do a hotfix once the patches are merged across all 3 downstream branches.

Comment 101 Rodolfo Alonso 2022-08-04 13:30:29 UTC
*** Bug 1872937 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.