1440462 – Connectivity issues after migrating an instance

Bug 1440462 - Connectivity issues after migrating an instance

Summary: Connectivity issues after migrating an instance

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	8.0 (Liberty)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	zstream
Target Release:	8.0 (Liberty)
Assignee:	Sahid Ferdjaoui
QA Contact:	Gabriel Szasz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1381612 1494026 1494030 1494031 1494983
TreeView+	depends on / blocked

Reported:	2017-04-09 04:59 UTC by VIKRANT
Modified:	2021-12-16 04:25 UTC (History)
CC List:	37 users (show)
Fixed In Version:	openstack-nova-12.0.6-20.el7ost
Doc Type:	Bug Fix
Doc Text:	The linux bridge installed for the particular VIF type `ovs-hybrid` should be configured to persistently retain the MAC address learned from the RARP packets that are sent by QEMU (after starting on the destination node). This is to avoid any break of the datapath during a live migration, since at some point during the process the live migration source and destination can be on the same L2 network, and could have the destination bridge learning from the source.
Clone Of:
Clones:	1494026 (view as bug list)
Environment:
Last Closed:	2017-10-25 17:10:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	501132	0	'None'	MERGED	ovs-hybrid: should permanently keep MAC entries	2021-02-06 08:49:59 UTC
Red Hat Bugzilla	1465492	1	None	None	None	2021-01-20 06:05:38 UTC
Red Hat Issue Tracker	OSP-8568	0	None	None	None	2021-11-29 02:28:00 UTC
Red Hat Knowledge Base (Solution)	2993141	0	None	None	None	2017-04-09 05:05:49 UTC
Red Hat Product Errata	RHBA-2017:3068	0	normal	SHIPPED_LIVE	openstack-nova bug fix advisory	2017-10-25 21:05:11 UTC

Internal Links: 1465492

Description VIKRANT 2017-04-09 04:59:13 UTC

Description of problem:
Connectivity issues after migrating an instance

Version-Release number of selected component (if applicable):

RHEL OSP 8

# awk '/neutron/ {print $1}' installed-rpms
neutron-ml2-driver-apic-2015.2.5-95.el7.noarch
neutron-opflex-agent-2015.2.3-35.el7.noarch
openstack-neutron-7.2.0-5.el7ost.noarch
openstack-neutron-common-7.2.0-5.el7ost.noarch
openstack-neutron-ml2-7.2.0-5.el7ost.noarch
openstack-neutron-openvswitch-7.2.0-5.el7ost.noarch
python-neutron-7.2.0-5.el7ost.noarch
python-neutronclient-3.1.0-2.el7ost.noarch

How reproducible:
Intermittently for customer.
55 compute nodes

Steps to Reproduce:
- Before live-migration instance was pingable.
- Traffic lost during the migration.
- Once the instance is spawned successfully on destination compute node still it not reachable for 5 mins. Which is equal to 300s default timeout for linux bridge entry.
- Traffic was reaching upto the qvb interface of linux bridge but was not reaching to tap interface.
- Security rules doesn't seem to be an issue here.
- Captured linux bridge mac entries at the time of issue indicates that wrong port and MAC address mapping happened.
- Newly populated after timeout of old entry, pick the right port corresponding to MAC address.
- After that instance was reachable.

Actual results:
It was showing wrong mapping between port and MAC address on compute node.

Expected results:
It should show correct mapping right after the migration instead of waiting for timeout value.

Additional info:

More information coming in next internal comments.

Comment 3 Assaf Muller 2017-04-09 11:57:14 UTC

This sounds similar to:
https://bugzilla.redhat.com/show_bug.cgi?id=1372384#c24

Is the customer using Emulex NICs?

Comment 8 VIKRANT 2017-04-25 04:20:32 UTC

Thanks Assaf.

Comment 12 Brian Haley 2017-04-26 21:47:51 UTC

I looked into this further and remembered the 'qbr' bridges are managed by the nova libvirt code.  After searching through the nova and os-vif repos I could not find a change related to a bug like this, so I'd like to re-assign this to the nova team to get some help with further debugging.

Comment 15 Sahid Ferdjaoui 2017-04-27 14:19:29 UTC

It's a knowing issue between Neutron and Nova. Where RARP packets sent by QEMU are dropped because Nova is starting migration while ports are not well tagged by Neutron Agent.

There is an initiative to fix the issue from Neutron by sending an event when agent is discovering that new port and finishing to setup it (tagging). But that is only working for OVS [0], the other mechanisms are using a tap device which will be created only when the migration occurs. The patch in Nova side to wait for that event has been abandoned [1].

Two related issues:

 - https://bugzilla.redhat.com/show_bug.cgi?id=1259749 (OVS)
 - https://bugzilla.redhat.com/show_bug.cgi?id=1420587 (linuxbridge)

In bug 1420587 David proposed a workaround in QEMU by increasing the number of RARP sent

  https://bugzilla.redhat.com/show_bug.cgi?id=1420587#c22

And it has been proposed upstream:

  http://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05586.html

[0] https://review.openstack.org/#/c/246898/
[1] https://review.openstack.org/#/c/246910/

Comment 16 VIKRANT 2017-04-30 01:49:48 UTC

Thanks Sahid, 

I skim through the Red Hat bugs mentioned by you. 

Bug # 1420587 which is related to linuxbridge as a mechanism driver doesn't seem to be a issue here as Cu. not using linuxbridge as mechansim driver. 

I started reading about following bug and stumble upon comment, again they are talking about ARP entries about ovs bridge level however in this case Cu. is facing issue due to linux bridge which is used for security groups. 

https://bugzilla.redhat.com/show_bug.cgi?id=1259749#c27


What should we do to make further progress in this bug?

Comment 17 VIKRANT 2017-05-02 04:47:31 UTC

Hi Sahid,

Can you please let me know, what next plan of action should we share with customer?

Comment 18 Sahid Ferdjaoui 2017-05-02 13:58:24 UTC

Hello Vikrant not sure to understand why you are thinking that issue is not related to 1420587 and 1259749. My thinking is that the problem is because the network is not setup by the time the guest CPUs start on destination host.

Basically virtio-net has feature (see: FS_GUEST_ANNOUNCE) to attempt self-announce after 1ms, 50ms, 150ms, 250ms and 350ms when guest is starting, but we are still setting up the network.

So the VM starts at 2017-04-26 11:30:49.643+0000

2017-04-26 11:30:49.643+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2016-11-10-04:43:57, x86-034.build.eng.bos.redhat.com), qemu version: 2.6.0 (qemu-kvm-rhev-2.6.0-27.el7), hostname: cfs1pnc50.infra.es.iaas.igrupobbva
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00034699,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-168-instance-00034699/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 8192 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 0ad52b3b-bac8-4659-bf95-ed14e8cb3045 -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=12.0.5-9.el7ost,serial=c928e5a8-3f47-4fa1-934a-90b9c1a009cc,uuid=0ad52b3b-bac8-4659-bf95-ed14e8cb3045,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-168-instance-00034699/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/0ad52b3b-bac8-4659-bf95-ed14e8cb3045/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=39,id=hostnet0,vhost=on,vhostfd=40 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:b5:c4:a9,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/0ad52b3b-bac8-4659-bf95-ed14e8cb3045/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:10 -k es -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
char device redirected to /dev/pts/16 (label charserial1)

Looking at var/log/messages we can see how network get configured on the host:

sferdjao@collab-shell log]$ grep cdd7b6e7 messages
Apr 26 13:30:32 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 2(tapcdd7b6e7-a1) entered disabled state
Apr 26 13:30:32 cfs1pnc50 kernel: device tapcdd7b6e7-a1 left promiscuous mode
Apr 26 13:30:32 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 2(tapcdd7b6e7-a1) entered disabled state
Apr 26 13:30:32 cfs1pnc50 lldpd[39294]: error while receiving frame on tapcdd7b6e7-a1: Network is down
Apr 26 13:30:32 cfs1pnc50 lldpd[39294]: removal request for tapcdd7b6e7-a1, but no knowledge of it
Apr 26 13:30:32 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 1(qvbcdd7b6e7-a1) entered disabled state
Apr 26 13:30:33 cfs1pnc50 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=120 -- --if-exists del-port br-int qvocdd7b6e7-a1
Apr 26 13:30:33 cfs1pnc50 lldpd[39294]: error while receiving frame on qvocdd7b6e7-a1: Network is down
Apr 26 13:30:33 cfs1pnc50 agent-ovs[39404]: [src/FSEndpointSource.cpp:448:deleted] Removed endpoint cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa|fa-16-3e-b5-c4-a9 at "/var/lib/opflex-agent-ovs/endpoints/cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa_fa:16:3e:b5:c4:a9.ep"
Apr 26 13:30:34 cfs1pnc50 ntpd[17859]: Deleting interface #150 qvbcdd7b6e7-a1, fe80::309d:91ff:fe4d:f1fc#123, interface stats: received=0, sent=0, dropped=0, active_time=1816762 secs
Apr 26 13:30:34 cfs1pnc50 ntpd[17859]: Deleting interface #149 qvocdd7b6e7-a1, fe80::606c:bff:feed:e1e2#123, interface stats: received=0, sent=0, dropped=0, active_time=1816762 secs
Apr 26 13:30:34 cfs1pnc50 ntpd[17859]: Deleting interface #148 tapcdd7b6e7-a1, fe80::fc16:3eff:feb5:c4a9#123, interface stats: received=0, sent=0, dropped=0, active_time=1816762 secs
Apr 26 13:30:48 cfs1pnc50 kernel: IPv6: ADDRCONF(NETDEV_UP): qvbcdd7b6e7-a1: link is not ready

VM STARTING and QEMU self-announce...

Apr 26 13:30:49 cfs1pnc50 kernel: device qvbcdd7b6e7-a1 entered promiscuous mode
Apr 26 13:30:49 cfs1pnc50 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): qvbcdd7b6e7-a1: link becomes ready
Apr 26 13:30:49 cfs1pnc50 kernel: device qvocdd7b6e7-a1 entered promiscuous mode
Apr 26 13:30:49 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 1(qvbcdd7b6e7-a1) entered forwarding state
Apr 26 13:30:49 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 1(qvbcdd7b6e7-a1) entered forwarding state

Apr 26 13:30:49 cfs1pnc50 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=120 -- --if-exists del-port qvocdd7b6e7-a1 -- add-port br-int qvocdd7b6e7-a1 -- set Interface qvocdd7b6e7-a1 external-ids:iface-id=cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:b5:c4:a9 external-ids:vm-uuid=0ad52b3b-bac8-4659-bf95-ed14e8cb3045

FINISHING network configuration... At this point, 850 ms could have elapsed and so all the ARP requests to instruct the bridge lost...

Apr 26 13:30:49 cfs1pnc50 kernel: device tapcdd7b6e7-a1 entered promiscuous mode
Apr 26 13:30:49 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 2(tapcdd7b6e7-a1) entered forwarding state
Apr 26 13:30:49 cfs1pnc50 kernel: qbrcdd7b6e7-a1: port 2(tapcdd7b6e7-a1) entered forwarding state
Apr 26 13:30:50 cfs1pnc50 agent-ovs[39404]: [src/FSEndpointSource.cpp:430:updated] Updated endpoint Endpoint[uuid=cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa|fa-16-3e-b5-c4-a9,ips=[192.168.10.135],ipAddressMappings=[192.168.10.135->10.48.232.101],eg=/PolicyUniverse/PolicySpace/_IaaS_S1P_Compute/GbpEpGroup/IaaS_S1P%7clbaas/,mac=fa:16:3e:b5:c4:a9,iface=qvocdd7b6e7-a1,dhcpv4 from "/var/lib/opflex-agent-ovs/endpoints/cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa_fa:16:3e:b5:c4:a9.ep"
Apr 26 13:30:52 cfs1pnc50 ntpd[17859]: Listen normally on 533 qvbcdd7b6e7-a1 fe80::1c63:1bff:fe30:8841 UDP 123
Apr 26 13:30:52 cfs1pnc50 ntpd[17859]: Listen normally on 534 tapcdd7b6e7-a1 fe80::fc16:3eff:feb5:c4a9 UDP 123
Apr 26 13:30:52 cfs1pnc50 ntpd[17859]: Listen normally on 535 qvocdd7b6e7-a1 fe80::9c4c:93ff:fe6c:2d01 UDP 123
Apr 26 13:30:54 cfs1pnc50 agent-ovs[39404]: [src/FSEndpointSource.cpp:430:updated] Updated endpoint Endpoint[uuid=cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa|fa-16-3e-b5-c4-a9,ips=[192.168.10.135],ipAddressMappings=[192.168.10.135->10.48.232.101],eg=/PolicyUniverse/PolicySpace/_IaaS_S1P_Compute/GbpEpGroup/IaaS_S1P%7clbaas/,mac=fa:16:3e:b5:c4:a9,iface=qvocdd7b6e7-a1,dhcpv4 from "/var/lib/opflex-agent-ovs/endpoints/cdd7b6e7-a15b-4b7f-bf84-9cc4987ce3aa_fa:16:3e:b5:c4:a9.ep"

To conclude I think 850ms is too short to consider the network to be well configured. Basically all of that work should have been done before we start the migration.

One solution is the hack on QEMU discussed in comment 15, also there is work in QEMU to provide a guest announce API [0] so we can try to trigger it at post-live-migration. I continue my investigation between neutron/nova to workaround that issue without the need of a QEMU change.

[0] http://lists.gnu.org/archive/html/qemu-devel/2017-05/msg00137.html

Comment 19 Sahid Ferdjaoui 2017-05-02 16:41:38 UTC

I discussed with David Gilbert who is suggesting to tcpdump destination for ARPs/RARPs before migration starts. So we should be able to see the packets that QEMU emits, then we should be able to trace them along the network and see where they disappear.

Comment 20 VIKRANT 2017-05-03 04:36:26 UTC

Many thanks Sahid for detailed description in C#18. I am just trying to reiterate what have you said to make sure that I have understood it correctly. 

- After the migration VM will self-announce about the MAC and port mapping when the network is set. VM will try to announce it at different intervals "1ms, 50ms, 150ms, 250ms and 350ms".

- But in this case, VM is spending 850ms in network setup due to which all intervals (1ms, 50ms, 150ms, 250ms and 350ms) are missed to announce the MAC and port mapping to linux bridge hence linux bridge was not able to refresh the port and MAC mapping. Once the linux bridge entry lease expire (5min), it was able to pop-up the new MAC and port mapping. 

- By increasing the announce time to 12s as mentioned in [1] we should be able to circumvent this issue because by that time VM will be having the network configured. 


Sure, we can take the tcpdump but can you please let me know on which interfaces I need to take the tcpdump on destination node? I am just bit cautions because it's not very easily reproducible. Cu. earlier tried to capture the ICMP traffic on interface tap, qvb and qvo of migrated instance when they were not able to ping the instance. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1420587#c22

Comment 21 Sahid Ferdjaoui 2017-05-03 09:28:02 UTC

(In reply to VIKRANT from comment #20)
> Many thanks Sahid for detailed description in C#18. I am just trying to
> reiterate what have you said to make sure that I have understood it
> correctly. 
> 
> - After the migration VM will self-announce about the MAC and port mapping
> when the network is set. VM will try to announce it at different intervals
> "1ms, 50ms, 150ms, 250ms and 350ms".
>
> - But in this case, VM is spending 850ms in network setup due to which all
> intervals (1ms, 50ms, 150ms, 250ms and 350ms) are missed to announce the MAC
> and port mapping to linux bridge hence linux bridge was not able to refresh
> the port and MAC mapping. Once the linux bridge entry lease expire (5min),
> it was able to pop-up the new MAC and port mapping. 

Yes except that the self-announce is not announcing port mapping. Basically it's RARP requests broadcasted on L2, The bridge is going to receive those packets only one port and so update its ARP table.
 
> - By increasing the announce time to 12s as mentioned in [1] we should be
> able to circumvent this issue because by that time VM will be having the
> network configured. 
> 
> 
> Sure, we can take the tcpdump but can you please let me know on which
> interfaces I need to take the tcpdump on destination node?

Well I'd say a tcpdump on any interface filtering by the VM mac address and arp.

> I am just bit
> cautions because it's not very easily reproducible. Cu. earlier tried to
> capture the ICMP traffic on interface tap, qvb and qvo of migrated instance
> when they were not able to ping the instance. 

Yes, probably having the system loaded is going to create delay to help reproducing the issue.

I captured the packets;

 4 packets sent at 0ms
 4 packets sent at 150ms
 4 packets sent at 400ms
 4 packets sent at 750ms

> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1420587#c22

Comment 98 Sahid Ferdjaoui 2017-09-12 07:34:05 UTC

Ok after further analysis it seems that at some point during the live migration we have on a same L2 a MAC address associated to two different linux bridge; source/destination would result that the learning table of the destination bridge could consider that to reach that MAC to use the uplink. Setting the ageing to zero makes the MAC on the table persistent and avoid any override of the learning table.

The fix pushed upstream seems to be in good shape to be accepted. The fix is on OS-VIF but can be easily backported on OSP versions which do not use that lib.

  https://review.openstack.org/#/c/501132/

We will probably have to consider backport it for OSP6, 7, 9 and 10 and so clone this issue.

Comment 120 errata-xmlrpc 2017-10-25 17:10:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3068

Note You need to log in before you can comment on or make changes to this bug.

amuller
awaugama
berrange
bhaley
chhu
chrisw
dasmith
dgilbert
dyuan
eglynn
gkadam
jdenemar
jhakimra
jishao
jraju
kchamart
knoel
lijin
mkalinin
mlopes
molasaga
mst
nchandek
nyechiel
pablo.iranzo
pmorey
saime
sbauza
sferdjao
sgordon
srevivo
stephenfin
vaggarwa
vromanso
xuzhang
yafu
zpeng