Bug 1976578 - IP connectivity is lost after migration (with multus)
Summary: IP connectivity is lost after migration (with multus)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 4.8.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: future
Assignee: Edward Haas
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-27 11:50 UTC by awax
Modified: 2023-06-29 12:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-29 12:29:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
migration_vmb_new.yaml (2.12 KB, text/plain)
2021-06-27 11:50 UTC, awax
no flags Details
migration_vma_new.yaml (2.12 KB, text/plain)
2021-06-27 11:54 UTC, awax
no flags Details
migration_ssh_service_for_vmb.yaml (250 bytes, text/plain)
2021-06-27 11:55 UTC, awax
no flags Details
migration_ssh_service_for_vma.yaml (250 bytes, text/plain)
2021-06-27 11:55 UTC, awax
no flags Details
migration_nad_new.yaml (656 bytes, text/plain)
2021-06-27 11:56 UTC, awax
no flags Details
migration_nncp_1.yaml (480 bytes, text/plain)
2021-06-27 11:56 UTC, awax
no flags Details
migration_virtualmachineinstancemigration.yaml (142 bytes, text/plain)
2021-06-27 11:58 UTC, awax
no flags Details
tcpdump_log.log (265.42 KB, text/plain)
2021-06-27 11:59 UTC, awax
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-12654 0 None None None 2022-12-15 08:39:35 UTC
Red Hat Issue Tracker CNV-17884 0 None None None 2022-12-15 08:39:37 UTC

Description awax 2021-06-27 11:50:30 UTC
Created attachment 1795082 [details]
migration_vmb_new.yaml

Created attachment 1795082 [details]
migration_vmb_new.yaml

Created attachment 1795082 [details]
migration_vmb_new.yaml

Description of problem:
a migrated vm takes a lot of time (between 10 to 60 seconds) to gain connectivity.
when pinging over a secondary interface from the migrated vm (with multus) to another vm (with multus) in the same cluster, there is a packet loss (with 'Destination Host Unreachable') during this period of time.


Version-Release number of selected component (if applicable):
CNV v.4.8.0
OCP v.4.8.0-fc.5
Kubernetes Version: v1.21.0-rc.0+88a3e8c


How reproducible:
Not always. I couldn't find a correlation to understand why.


Steps to Reproduce:
1. create a dedicated namespace for the resources that will be created in the next steps. Name it "anat-test-migration" to match the namespace defined in the files attached.
2. create bridge (use 'migration_nncp_1.yaml' and 'migration_nncp_2.yaml' files attached - make sure to change the node selector to match your cluster nodes)
3. create nad (use 'migration_nad_new.yaml' file attached)
4. create vma and vmb (use 'migration_vma_new.yaml' and 'migration_vmb_new.yaml' files attached).
5. run both VM's:
$ virtctl start vma
$ virtctl start vmb
6. expose services to allow ssh connection to both vms (use 'migration_ssh_service_for_vma.yaml' and 'migration_ssh_service_for_vmb.yaml' files attached).
7. migrate vmb (use 'migration_virtualmachineinstancemigration.yaml' file attached).
8. connect to vmb as soon as the migration finishes. To find the exact moment you can check when the vmi is assigned a new IP address using the command:
$ oc get vmi -w
9. ping from vmb to vma over the secondary interface (bridge):
 - enter VM vmb through ssh (the IP is the ip of the node on which vmb is running, '-p' is the port of the vmb's service which can be found using the command 'oc get service'):
$ ssh fedora.2.83 -p 30401
 - ping vma:
$ ping 10.200.0.1

* in order to reproduce, steps 8 and 9 should be performed as close to the migration ending as possible.



Actual results:
when bug occurs:
[fedora@vmb ~]$ ping 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
From 10.200.0.22 icmp_seq=10 Destination Host Unreachable
From 10.200.0.22 icmp_seq=11 Destination Host Unreachable
From 10.200.0.22 icmp_seq=12 Destination Host Unreachable
From 10.200.0.22 icmp_seq=13 Destination Host Unreachable
From 10.200.0.22 icmp_seq=14 Destination Host Unreachable
From 10.200.0.22 icmp_seq=15 Destination Host Unreachable
From 10.200.0.22 icmp_seq=16 Destination Host Unreachable
From 10.200.0.22 icmp_seq=17 Destination Host Unreachable
From 10.200.0.22 icmp_seq=18 Destination Host Unreachable
From 10.200.0.22 icmp_seq=19 Destination Host Unreachable
From 10.200.0.22 icmp_seq=20 Destination Host Unreachable
From 10.200.0.22 icmp_seq=21 Destination Host Unreachable
From 10.200.0.22 icmp_seq=22 Destination Host Unreachable
From 10.200.0.22 icmp_seq=23 Destination Host Unreachable
From 10.200.0.22 icmp_seq=24 Destination Host Unreachable
From 10.200.0.22 icmp_seq=25 Destination Host Unreachable
From 10.200.0.22 icmp_seq=26 Destination Host Unreachable
From 10.200.0.22 icmp_seq=27 Destination Host Unreachable
From 10.200.0.22 icmp_seq=28 Destination Host Unreachable
From 10.200.0.22 icmp_seq=29 Destination Host Unreachable
From 10.200.0.22 icmp_seq=30 Destination Host Unreachable
From 10.200.0.22 icmp_seq=31 Destination Host Unreachable
From 10.200.0.22 icmp_seq=32 Destination Host Unreachable
From 10.200.0.22 icmp_seq=33 Destination Host Unreachable
64 bytes from 10.200.0.1: icmp_seq=35 ttl=64 time=3.93 ms
64 bytes from 10.200.0.1: icmp_seq=34 ttl=64 time=1028 ms
64 bytes from 10.200.0.1: icmp_seq=36 ttl=64 time=1.36 ms
64 bytes from 10.200.0.1: icmp_seq=37 ttl=64 time=0.962 ms
64 bytes from 10.200.0.1: icmp_seq=38 ttl=64 time=1.30 ms
^C
--- 10.200.0.1 ping statistics ---
38 packets transmitted, 5 received, +24 errors, 86.8421% packet loss, time 37808ms
rtt min/avg/max/mdev = 0.962/207.169/1028.296/410.564 ms, pipe 4



Expected results:
no packet loss.



Additional info:
tcpdump of the secondary interface of the migrated VM (vmb) is included - steps to produce:
1. ssh to vmb:
$ ssh fedora.2.83 -p 30401
2. run tcpdump:
$ sudo tcpdump -i eth1 -xx >~/tcpdump_log.log

Comment 1 awax 2021-06-27 11:54:05 UTC
Created attachment 1795083 [details]
migration_vma_new.yaml

Comment 2 awax 2021-06-27 11:55:08 UTC
Created attachment 1795093 [details]
migration_ssh_service_for_vmb.yaml

Comment 3 awax 2021-06-27 11:55:36 UTC
Created attachment 1795094 [details]
migration_ssh_service_for_vma.yaml

Comment 4 awax 2021-06-27 11:56:04 UTC
Created attachment 1795095 [details]
migration_nad_new.yaml

Comment 5 awax 2021-06-27 11:56:42 UTC
Created attachment 1795096 [details]
migration_nncp_1.yaml

Comment 7 awax 2021-06-27 11:58:25 UTC
Created attachment 1795098 [details]
migration_virtualmachineinstancemigration.yaml

Comment 8 awax 2021-06-27 11:59:23 UTC
Created attachment 1795099 [details]
tcpdump_log.log

Comment 9 Petr Horáček 2021-06-28 07:52:36 UTC
Anat, could you confirm whether this is a regression introduced in 4.8? We recommend to live-migrate over the bridge network to assure no connectivity drop. It is alarming that we see this issue.

Comment 10 Edward Haas 2021-06-28 08:08:50 UTC
Who owns the address `10.200.0.22` ?

Comment 11 Edward Haas 2021-06-28 08:21:02 UTC
Could you also try to recreate this when KMP is disabled for these VMI/s?

Comment 12 awax 2021-06-29 11:11:42 UTC
(In reply to Edward Haas from comment #10)
> Who owns the address `10.200.0.22` ?

eth1 - see recreation (different cluster to the IP is different as well):

[fedora@vmb ~]$ ping 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
From 10.200.0.2 icmp_seq=10 Destination Host Unreachable
From 10.200.0.2 icmp_seq=11 Destination Host Unreachable
From 10.200.0.2 icmp_seq=12 Destination Host Unreachable
From 10.200.0.2 icmp_seq=13 Destination Host Unreachable
From 10.200.0.2 icmp_seq=14 Destination Host Unreachable
From 10.200.0.2 icmp_seq=15 Destination Host Unreachable
From 10.200.0.2 icmp_seq=16 Destination Host Unreachable
From 10.200.0.2 icmp_seq=17 Destination Host Unreachable
From 10.200.0.2 icmp_seq=18 Destination Host Unreachable
From 10.200.0.2 icmp_seq=19 Destination Host Unreachable
From 10.200.0.2 icmp_seq=20 Destination Host Unreachable
From 10.200.0.2 icmp_seq=21 Destination Host Unreachable
From 10.200.0.2 icmp_seq=22 Destination Host Unreachable
From 10.200.0.2 icmp_seq=23 Destination Host Unreachable
From 10.200.0.2 icmp_seq=24 Destination Host Unreachable
From 10.200.0.2 icmp_seq=25 Destination Host Unreachable
From 10.200.0.2 icmp_seq=26 Destination Host Unreachable
From 10.200.0.2 icmp_seq=27 Destination Host Unreachable
From 10.200.0.2 icmp_seq=28 Destination Host Unreachable
From 10.200.0.2 icmp_seq=29 Destination Host Unreachable
From 10.200.0.2 icmp_seq=30 Destination Host Unreachable
From 10.200.0.2 icmp_seq=31 Destination Host Unreachable
From 10.200.0.2 icmp_seq=32 Destination Host Unreachable
From 10.200.0.2 icmp_seq=33 Destination Host Unreachable
From 10.200.0.2 icmp_seq=34 Destination Host Unreachable
From 10.200.0.2 icmp_seq=35 Destination Host Unreachable
From 10.200.0.2 icmp_seq=36 Destination Host Unreachable
From 10.200.0.2 icmp_seq=37 Destination Host Unreachable
From 10.200.0.2 icmp_seq=38 Destination Host Unreachable
From 10.200.0.2 icmp_seq=39 Destination Host Unreachable
From 10.200.0.2 icmp_seq=40 Destination Host Unreachable
From 10.200.0.2 icmp_seq=41 Destination Host Unreachable
From 10.200.0.2 icmp_seq=42 Destination Host Unreachable
From 10.200.0.2 icmp_seq=43 Destination Host Unreachable
From 10.200.0.2 icmp_seq=44 Destination Host Unreachable
From 10.200.0.2 icmp_seq=45 Destination Host Unreachable
From 10.200.0.2 icmp_seq=46 Destination Host Unreachable
From 10.200.0.2 icmp_seq=47 Destination Host Unreachable
From 10.200.0.2 icmp_seq=48 Destination Host Unreachable
From 10.200.0.2 icmp_seq=49 Destination Host Unreachable
From 10.200.0.2 icmp_seq=50 Destination Host Unreachable
From 10.200.0.2 icmp_seq=51 Destination Host Unreachable
From 10.200.0.2 icmp_seq=52 Destination Host Unreachable
From 10.200.0.2 icmp_seq=53 Destination Host Unreachable
From 10.200.0.2 icmp_seq=54 Destination Host Unreachable
From 10.200.0.2 icmp_seq=55 Destination Host Unreachable
From 10.200.0.2 icmp_seq=56 Destination Host Unreachable
From 10.200.0.2 icmp_seq=57 Destination Host Unreachable
From 10.200.0.2 icmp_seq=58 Destination Host Unreachable
From 10.200.0.2 icmp_seq=59 Destination Host Unreachable
From 10.200.0.2 icmp_seq=60 Destination Host Unreachable
From 10.200.0.2 icmp_seq=61 Destination Host Unreachable
From 10.200.0.2 icmp_seq=62 Destination Host Unreachable
From 10.200.0.2 icmp_seq=63 Destination Host Unreachable
From 10.200.0.2 icmp_seq=64 Destination Host Unreachable
From 10.200.0.2 icmp_seq=65 Destination Host Unreachable
From 10.200.0.2 icmp_seq=66 Destination Host Unreachable
From 10.200.0.2 icmp_seq=67 Destination Host Unreachable
From 10.200.0.2 icmp_seq=68 Destination Host Unreachable
From 10.200.0.2 icmp_seq=69 Destination Host Unreachable
From 10.200.0.2 icmp_seq=70 Destination Host Unreachable
From 10.200.0.2 icmp_seq=71 Destination Host Unreachable
From 10.200.0.2 icmp_seq=72 Destination Host Unreachable
From 10.200.0.2 icmp_seq=73 Destination Host Unreachable
From 10.200.0.2 icmp_seq=74 Destination Host Unreachable
From 10.200.0.2 icmp_seq=75 Destination Host Unreachable
From 10.200.0.2 icmp_seq=76 Destination Host Unreachable
From 10.200.0.2 icmp_seq=77 Destination Host Unreachable
From 10.200.0.2 icmp_seq=78 Destination Host Unreachable
From 10.200.0.2 icmp_seq=79 Destination Host Unreachable
From 10.200.0.2 icmp_seq=80 Destination Host Unreachable
From 10.200.0.2 icmp_seq=81 Destination Host Unreachable
From 10.200.0.2 icmp_seq=82 Destination Host Unreachable
From 10.200.0.2 icmp_seq=83 Destination Host Unreachable
From 10.200.0.2 icmp_seq=84 Destination Host Unreachable
From 10.200.0.2 icmp_seq=85 Destination Host Unreachable
From 10.200.0.2 icmp_seq=86 Destination Host Unreachable
From 10.200.0.2 icmp_seq=87 Destination Host Unreachable
From 10.200.0.2 icmp_seq=88 Destination Host Unreachable
From 10.200.0.2 icmp_seq=89 Destination Host Unreachable
From 10.200.0.2 icmp_seq=90 Destination Host Unreachable
From 10.200.0.2 icmp_seq=91 Destination Host Unreachable
From 10.200.0.2 icmp_seq=92 Destination Host Unreachable
From 10.200.0.2 icmp_seq=93 Destination Host Unreachable
From 10.200.0.2 icmp_seq=94 Destination Host Unreachable
From 10.200.0.2 icmp_seq=95 Destination Host Unreachable
From 10.200.0.2 icmp_seq=96 Destination Host Unreachable
From 10.200.0.2 icmp_seq=97 Destination Host Unreachable
From 10.200.0.2 icmp_seq=98 Destination Host Unreachable
From 10.200.0.2 icmp_seq=99 Destination Host Unreachable
From 10.200.0.2 icmp_seq=100 Destination Host Unreachable
From 10.200.0.2 icmp_seq=101 Destination Host Unreachable
From 10.200.0.2 icmp_seq=102 Destination Host Unreachable
64 bytes from 10.200.0.1: icmp_seq=103 ttl=64 time=1026 ms
64 bytes from 10.200.0.1: icmp_seq=104 ttl=64 time=2.41 ms
64 bytes from 10.200.0.1: icmp_seq=105 ttl=64 time=1.22 ms
64 bytes from 10.200.0.1: icmp_seq=106 ttl=64 time=2.50 ms
64 bytes from 10.200.0.1: icmp_seq=107 ttl=64 time=0.832 ms
64 bytes from 10.200.0.1: icmp_seq=108 ttl=64 time=1.10 ms
64 bytes from 10.200.0.1: icmp_seq=109 ttl=64 time=0.710 ms
64 bytes from 10.200.0.1: icmp_seq=110 ttl=64 time=0.878 ms
64 bytes from 10.200.0.1: icmp_seq=111 ttl=64 time=0.891 ms
64 bytes from 10.200.0.1: icmp_seq=112 ttl=64 time=0.833 ms
^C
--- 10.200.0.1 ping statistics ---
112 packets transmitted, 10 received, +93 errors, 91.0714% packet loss, time 113473ms
rtt min/avg/max/mdev = 0.710/103.761/1026.243/307.494 ms, pipe 4
[fedora@vmb ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:32:a6:00:00:02 brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86309468sec preferred_lft 86309468sec
    inet6 fe80::b00f:71c1:3fa2:140e/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:32:a6:00:00:03 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::32:a6ff:fe00:3/64 scope link 
       valid_lft forever preferred_lft forever

Comment 13 Edward Haas 2021-06-29 13:58:29 UTC
Thank you for the clarification on the IP.

We still need answers on comment 9 and 11.
Thanks.

Comment 14 awax 2021-06-30 07:47:15 UTC
(In reply to Petr Horáček from comment #9)
> Anat, could you confirm whether this is a regression introduced in 4.8? We
> recommend to live-migrate over the bridge network to assure no connectivity
> drop. It is alarming that we see this issue.

I managed to recreate the bug on a 2.6.6 cluster.

results:
[fedora@vmb ~]$ ping 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
From 10.200.0.2 icmp_seq=10 Destination Host Unreachable
From 10.200.0.2 icmp_seq=11 Destination Host Unreachable
From 10.200.0.2 icmp_seq=12 Destination Host Unreachable
From 10.200.0.2 icmp_seq=13 Destination Host Unreachable
From 10.200.0.2 icmp_seq=14 Destination Host Unreachable
From 10.200.0.2 icmp_seq=15 Destination Host Unreachable
From 10.200.0.2 icmp_seq=16 Destination Host Unreachable
From 10.200.0.2 icmp_seq=17 Destination Host Unreachable
From 10.200.0.2 icmp_seq=18 Destination Host Unreachable
From 10.200.0.2 icmp_seq=19 Destination Host Unreachable
From 10.200.0.2 icmp_seq=20 Destination Host Unreachable
From 10.200.0.2 icmp_seq=21 Destination Host Unreachable
From 10.200.0.2 icmp_seq=22 Destination Host Unreachable
From 10.200.0.2 icmp_seq=23 Destination Host Unreachable
From 10.200.0.2 icmp_seq=24 Destination Host Unreachable
From 10.200.0.2 icmp_seq=25 Destination Host Unreachable
From 10.200.0.2 icmp_seq=26 Destination Host Unreachable
From 10.200.0.2 icmp_seq=27 Destination Host Unreachable
From 10.200.0.2 icmp_seq=28 Destination Host Unreachable
From 10.200.0.2 icmp_seq=29 Destination Host Unreachable
From 10.200.0.2 icmp_seq=30 Destination Host Unreachable
64 bytes from 10.200.0.1: icmp_seq=32 ttl=64 time=1027 ms
64 bytes from 10.200.0.1: icmp_seq=33 ttl=64 time=3.31 ms
64 bytes from 10.200.0.1: icmp_seq=31 ttl=64 time=2052 ms
64 bytes from 10.200.0.1: icmp_seq=34 ttl=64 time=1.03 ms
64 bytes from 10.200.0.1: icmp_seq=35 ttl=64 time=1.08 ms
^C
--- 10.200.0.1 ping statistics ---
35 packets transmitted, 5 received, +21 errors, 85.7143% packet loss, time 34810ms
rtt min/avg/max/mdev = 1.029/616.855/2051.523/819.955 ms, pipe 4
[fedora@vmb ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:31:f1:1c brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86313057sec preferred_lft 86313057sec
    inet6 fe80::4ddc:8a67:d50d:ddab/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 16:3f:fc:f1:00:f4 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::143f:fcff:fef1:f4/64 scope link 
       valid_lft forever preferred_lft forever

Comment 15 Petr Horáček 2021-06-30 09:30:58 UTC
Thanks, that helps triaging this bug.

Adding back the needinfo for https://bugzilla.redhat.com/show_bug.cgi?id=1976578#c13.

Comment 16 awax 2021-06-30 11:40:28 UTC
(In reply to Edward Haas from comment #11)
> Could you also try to recreate this when KMP is disabled for these VMI/s?

This bug is flaky, so I'm saying this with caution - I wasn't able to recreate the bug when KMP was disabled on the namespace (on a 4.8 cluster).

Comment 17 awax 2021-06-30 11:51:26 UTC
(In reply to awax from comment #16)
> (In reply to Edward Haas from comment #11)
> > Could you also try to recreate this when KMP is disabled for these VMI/s?
> 
> This bug is flaky, so I'm saying this with caution - I wasn't able to
> recreate the bug when KMP was disabled on the namespace (on a 4.8 cluster).

Did manage to reproduce it now:

[fedora@vmb ~]$ ping 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
From 10.200.0.2 icmp_seq=9 Destination Host Unreachable
From 10.200.0.2 icmp_seq=10 Destination Host Unreachable
From 10.200.0.2 icmp_seq=11 Destination Host Unreachable
From 10.200.0.2 icmp_seq=12 Destination Host Unreachable
From 10.200.0.2 icmp_seq=13 Destination Host Unreachable
From 10.200.0.2 icmp_seq=14 Destination Host Unreachable
From 10.200.0.2 icmp_seq=15 Destination Host Unreachable
From 10.200.0.2 icmp_seq=16 Destination Host Unreachable
From 10.200.0.2 icmp_seq=17 Destination Host Unreachable
From 10.200.0.2 icmp_seq=18 Destination Host Unreachable
From 10.200.0.2 icmp_seq=19 Destination Host Unreachable
From 10.200.0.2 icmp_seq=20 Destination Host Unreachable
From 10.200.0.2 icmp_seq=21 Destination Host Unreachable
From 10.200.0.2 icmp_seq=22 Destination Host Unreachable
From 10.200.0.2 icmp_seq=23 Destination Host Unreachable
From 10.200.0.2 icmp_seq=24 Destination Host Unreachable
From 10.200.0.2 icmp_seq=25 Destination Host Unreachable
From 10.200.0.2 icmp_seq=26 Destination Host Unreachable
From 10.200.0.2 icmp_seq=27 Destination Host Unreachable
From 10.200.0.2 icmp_seq=28 Destination Host Unreachable
From 10.200.0.2 icmp_seq=29 Destination Host Unreachable
From 10.200.0.2 icmp_seq=30 Destination Host Unreachable
From 10.200.0.2 icmp_seq=31 Destination Host Unreachable
From 10.200.0.2 icmp_seq=32 Destination Host Unreachable
From 10.200.0.2 icmp_seq=33 Destination Host Unreachable
From 10.200.0.2 icmp_seq=34 Destination Host Unreachable
From 10.200.0.2 icmp_seq=35 Destination Host Unreachable
From 10.200.0.2 icmp_seq=36 Destination Host Unreachable
From 10.200.0.2 icmp_seq=37 Destination Host Unreachable
From 10.200.0.2 icmp_seq=38 Destination Host Unreachable
From 10.200.0.2 icmp_seq=39 Destination Host Unreachable
From 10.200.0.2 icmp_seq=40 Destination Host Unreachable
From 10.200.0.2 icmp_seq=41 Destination Host Unreachable
From 10.200.0.2 icmp_seq=42 Destination Host Unreachable
From 10.200.0.2 icmp_seq=43 Destination Host Unreachable
From 10.200.0.2 icmp_seq=44 Destination Host Unreachable
From 10.200.0.2 icmp_seq=45 Destination Host Unreachable
From 10.200.0.2 icmp_seq=46 Destination Host Unreachable
From 10.200.0.2 icmp_seq=47 Destination Host Unreachable
From 10.200.0.2 icmp_seq=48 Destination Host Unreachable
From 10.200.0.2 icmp_seq=49 Destination Host Unreachable
From 10.200.0.2 icmp_seq=50 Destination Host Unreachable
From 10.200.0.2 icmp_seq=51 Destination Host Unreachable
From 10.200.0.2 icmp_seq=52 Destination Host Unreachable
From 10.200.0.2 icmp_seq=53 Destination Host Unreachable
From 10.200.0.2 icmp_seq=54 Destination Host Unreachable
From 10.200.0.2 icmp_seq=55 Destination Host Unreachable
From 10.200.0.2 icmp_seq=56 Destination Host Unreachable
From 10.200.0.2 icmp_seq=57 Destination Host Unreachable
From 10.200.0.2 icmp_seq=58 Destination Host Unreachable
From 10.200.0.2 icmp_seq=59 Destination Host Unreachable
From 10.200.0.2 icmp_seq=60 Destination Host Unreachable
From 10.200.0.2 icmp_seq=61 Destination Host Unreachable
From 10.200.0.2 icmp_seq=62 Destination Host Unreachable
From 10.200.0.2 icmp_seq=63 Destination Host Unreachable
From 10.200.0.2 icmp_seq=64 Destination Host Unreachable
From 10.200.0.2 icmp_seq=65 Destination Host Unreachable
From 10.200.0.2 icmp_seq=66 Destination Host Unreachable
From 10.200.0.2 icmp_seq=67 Destination Host Unreachable
From 10.200.0.2 icmp_seq=68 Destination Host Unreachable
From 10.200.0.2 icmp_seq=69 Destination Host Unreachable
From 10.200.0.2 icmp_seq=70 Destination Host Unreachable
From 10.200.0.2 icmp_seq=71 Destination Host Unreachable
From 10.200.0.2 icmp_seq=72 Destination Host Unreachable
From 10.200.0.2 icmp_seq=73 Destination Host Unreachable
From 10.200.0.2 icmp_seq=74 Destination Host Unreachable
From 10.200.0.2 icmp_seq=75 Destination Host Unreachable
From 10.200.0.2 icmp_seq=76 Destination Host Unreachable
From 10.200.0.2 icmp_seq=77 Destination Host Unreachable
From 10.200.0.2 icmp_seq=78 Destination Host Unreachable
From 10.200.0.2 icmp_seq=79 Destination Host Unreachable
From 10.200.0.2 icmp_seq=80 Destination Host Unreachable
From 10.200.0.2 icmp_seq=81 Destination Host Unreachable
From 10.200.0.2 icmp_seq=82 Destination Host Unreachable
From 10.200.0.2 icmp_seq=83 Destination Host Unreachable
From 10.200.0.2 icmp_seq=84 Destination Host Unreachable
From 10.200.0.2 icmp_seq=85 Destination Host Unreachable
From 10.200.0.2 icmp_seq=86 Destination Host Unreachable
From 10.200.0.2 icmp_seq=87 Destination Host Unreachable
From 10.200.0.2 icmp_seq=88 Destination Host Unreachable
From 10.200.0.2 icmp_seq=89 Destination Host Unreachable
From 10.200.0.2 icmp_seq=90 Destination Host Unreachable
From 10.200.0.2 icmp_seq=91 Destination Host Unreachable
From 10.200.0.2 icmp_seq=92 Destination Host Unreachable
From 10.200.0.2 icmp_seq=93 Destination Host Unreachable
From 10.200.0.2 icmp_seq=94 Destination Host Unreachable
From 10.200.0.2 icmp_seq=95 Destination Host Unreachable
From 10.200.0.2 icmp_seq=96 Destination Host Unreachable
From 10.200.0.2 icmp_seq=97 Destination Host Unreachable
From 10.200.0.2 icmp_seq=98 Destination Host Unreachable
From 10.200.0.2 icmp_seq=99 Destination Host Unreachable
From 10.200.0.2 icmp_seq=100 Destination Host Unreachable
From 10.200.0.2 icmp_seq=101 Destination Host Unreachable
From 10.200.0.2 icmp_seq=102 Destination Host Unreachable
From 10.200.0.2 icmp_seq=103 Destination Host Unreachable
From 10.200.0.2 icmp_seq=104 Destination Host Unreachable
From 10.200.0.2 icmp_seq=105 Destination Host Unreachable
From 10.200.0.2 icmp_seq=106 Destination Host Unreachable
From 10.200.0.2 icmp_seq=107 Destination Host Unreachable
From 10.200.0.2 icmp_seq=108 Destination Host Unreachable
From 10.200.0.2 icmp_seq=109 Destination Host Unreachable
From 10.200.0.2 icmp_seq=110 Destination Host Unreachable
From 10.200.0.2 icmp_seq=111 Destination Host Unreachable
From 10.200.0.2 icmp_seq=112 Destination Host Unreachable
From 10.200.0.2 icmp_seq=113 Destination Host Unreachable
From 10.200.0.2 icmp_seq=114 Destination Host Unreachable
From 10.200.0.2 icmp_seq=115 Destination Host Unreachable
From 10.200.0.2 icmp_seq=116 Destination Host Unreachable
From 10.200.0.2 icmp_seq=117 Destination Host Unreachable
From 10.200.0.2 icmp_seq=118 Destination Host Unreachable
From 10.200.0.2 icmp_seq=119 Destination Host Unreachable
From 10.200.0.2 icmp_seq=120 Destination Host Unreachable
From 10.200.0.2 icmp_seq=121 Destination Host Unreachable
From 10.200.0.2 icmp_seq=122 Destination Host Unreachable
64 bytes from 10.200.0.1: icmp_seq=125 ttl=64 time=4.47 ms
64 bytes from 10.200.0.1: icmp_seq=124 ttl=64 time=1028 ms
64 bytes from 10.200.0.1: icmp_seq=123 ttl=64 time=2053 ms
64 bytes from 10.200.0.1: icmp_seq=126 ttl=64 time=1.31 ms
64 bytes from 10.200.0.1: icmp_seq=127 ttl=64 time=1.06 ms
64 bytes from 10.200.0.1: icmp_seq=128 ttl=64 time=0.817 ms
^C
--- 10.200.0.1 ping statistics ---
128 packets transmitted, 6 received, +114 errors, 95.3125% packet loss, time 129999ms
rtt min/avg/max/mdev = 0.817/514.774/2052.604/783.243 ms, pipe 4
[fedora@vmb ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:bd:12:03 brd ff:ff:ff:ff:ff:ff
    altname enp1s0
    inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
       valid_lft 86311137sec preferred_lft 86311137sec
    inet6 fe80::a7b8:96eb:5aaf:4358/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether e2:80:2b:97:54:d5 brd ff:ff:ff:ff:ff:ff
    altname enp2s0
    inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::e080:2bff:fe97:54d5/64 scope link 
       valid_lft forever preferred_lft forever
[fedora@vmb ~]$ exit
logout
-bash: ssh-dss: command not found
Connection to 192.168.2.83 closed.
[cnv-qe-jenkins@awax-48-3-jvl6x-executor anat_files]$ oc describe namespaces anat-test-migration 
Name:         anat-test-migration
Labels:       kubernetes.io/metadata.name=anat-test-migration
              mutatevirtualmachines.kubemacpool.io=ignore
Annotations:  openshift.io/description: 
              openshift.io/display-name: 
              openshift.io/requester: system:admin
              openshift.io/sa.scc.mcs: s0:c28,c2
              openshift.io/sa.scc.supplemental-groups: 1000760000/10000
              openshift.io/sa.scc.uid-range: 1000760000/10000
Status:       Active

No resource quota.

No LimitRange resource.

Comment 18 Petr Horáček 2021-09-21 12:43:38 UTC
Moving to z-stream as this was already observed on 4.8 and that triage was not finished by now and we are quite late in the release already.

Comment 19 Edward Haas 2021-10-03 11:15:25 UTC
Based on the information collected so far, this does not seem to be solely related to KMP (i.e. enforcing the same MAC address on the source and target pod interfaces), although in theory this may be another layer of the problem.

We have two patch to troubleshoot this now:
- Collect more info when the problem occurs, trying to pin-point what is exactly missing.
- Troubleshoot it online on the setup this is recreated on.

I will focus here on the specific info I think may be helpful, which should also be a good baseline for the online troubleshooting.
If the ping is not working, we need to check the two guests and intermediate bridges.
When the ping is in a fail state, this info may be useful:
- Base info: The MAC addresses, IP/s and routes from both guests (a table with the interfaces as the index will be nice), before the migration is started.

After the migration finishes and in the period the ping fails:
- ARP table from both guests: `ip neigh`
- IP/s: `ip addr`
- Routes: `ip route`

If there is a bridge on the node which serves the network, then dumping from it the MAC table will be very useful (the list or MAC addresses learned by the bridge on each of its ports).

Comment 22 awax 2021-12-13 12:08:31 UTC
Hi Edi,
Here is the requested info:

Before creating the migration instance:
VMA:
[fedora@vma ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	inet 127.0.0.1/8 scope host lo
	   valid_lft forever preferred_lft forever
	inet6 ::1/128 scope host 
	   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:00 brd ff:ff:ff:ff:ff:ff
	altname enp1s0
	inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
	   valid_lft 86312415sec preferred_lft 86312415sec
	inet6 fe80::72:a25e:37cc:c3ed/64 scope link noprefixroute 
	   valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:01 brd ff:ff:ff:ff:ff:ff
	altname enp2s0
	inet 10.200.0.1/24 brd 10.200.0.255 scope global noprefixroute eth1
	   valid_lft forever preferred_lft forever
	inet6 fe80::a0:84ff:fe00:1/64 scope link 
	   valid_lft forever preferred_lft forever
[fedora@vma ~]$ ip route
default via 10.0.2.1 dev eth0 proto dhcp metric 100 
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 
10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.1 metric 101 
[fedora@vma ~]$ ip neigh
10.0.2.1 dev eth0 lladdr 16:e2:36:ad:90:1a REACHABLE
10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 REACHABLE


VMB:
[fedora@vmb ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	inet 127.0.0.1/8 scope host lo
	   valid_lft forever preferred_lft forever
	inet6 ::1/128 scope host 
	   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:02 brd ff:ff:ff:ff:ff:ff
	altname enp1s0
	inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
	   valid_lft 86312427sec preferred_lft 86312427sec
	inet6 fe80::6444:43f8:f34d:892/64 scope link noprefixroute 
	   valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:03 brd ff:ff:ff:ff:ff:ff
	altname enp2s0
	inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1
	   valid_lft forever preferred_lft forever
	inet6 fe80::a0:84ff:fe00:3/64 scope link 
	   valid_lft forever preferred_lft forever
[fedora@vmb ~]$ ip route
default via 10.0.2.1 dev eth0 proto dhcp metric 100 
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 
10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.2 metric 101 
[fedora@vmb ~]$ ip neigh
10.0.2.1 dev eth0 lladdr d6:45:36:dc:ed:94 REACHABLE
10.200.0.1 dev eth1 lladdr 02:a0:84:00:00:01 REACHABLE




During ping failure:
VMA:
 [fedora@vma ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	inet 127.0.0.1/8 scope host lo
	   valid_lft forever preferred_lft forever
	inet6 ::1/128 scope host 
	   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:00 brd ff:ff:ff:ff:ff:ff
	altname enp1s0
	inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
	   valid_lft 86232786sec preferred_lft 86232786sec
	inet6 fe80::72:a25e:37cc:c3ed/64 scope link noprefixroute 
	   valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:01 brd ff:ff:ff:ff:ff:ff
	altname enp2s0
	inet 10.200.0.1/24 brd 10.200.0.255 scope global noprefixroute eth1
	   valid_lft forever preferred_lft forever
	inet6 fe80::a0:84ff:fe00:1/64 scope link 
	   valid_lft forever preferred_lft forever

[fedora@vma ~]$ ip route
default via 10.0.2.1 dev eth0 proto dhcp metric 100 
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 
10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.1 metric 101 
[fedora@vma ~]$ ip neigh
10.0.2.1 dev eth0 lladdr 16:e2:36:ad:90:1a REACHABLE
10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 STALE

VMB:
[fedora@vmb ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	inet 127.0.0.1/8 scope host lo
	   valid_lft forever preferred_lft forever
	inet6 ::1/128 scope host 
	   valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:02 brd ff:ff:ff:ff:ff:ff
	altname enp1s0
	inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0
	   valid_lft 86232803sec preferred_lft 86232803sec
	inet6 fe80::6444:43f8:f34d:892/64 scope link noprefixroute 
	   valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
	link/ether 02:a0:84:00:00:03 brd ff:ff:ff:ff:ff:ff
	altname enp2s0
	inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1
	   valid_lft forever preferred_lft forever
	inet6 fe80::a0:84ff:fe00:3/64 scope link 
	   valid_lft forever preferred_lft forever
[fedora@vmb ~]$ ip route
default via 10.0.2.1 dev eth0 proto dhcp metric 100 
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 
10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.2 metric 101 
[fedora@vmb ~]$ ip neigh
10.0.2.1 dev eth0 lladdr 9e:67:b0:19:ce:3d REACHABLE
10.200.0.1 dev eth1  INCOMPLETE




Bridge info:
(2 bridges are created on 2 nodes)
[cnv-qe-jenkins@n-awax-48-2-7kdn4-executor ~]$ oc get nnce
NAME                                                  STATUS
n-awax-48-2-7kdn4-master-0.migration-worker-1         NodeSelectorNotMatching
n-awax-48-2-7kdn4-master-0.migration-worker-2         NodeSelectorNotMatching
n-awax-48-2-7kdn4-master-1.migration-worker-1         NodeSelectorNotMatching
n-awax-48-2-7kdn4-master-1.migration-worker-2         NodeSelectorNotMatching
n-awax-48-2-7kdn4-master-2.migration-worker-1         NodeSelectorNotMatching
n-awax-48-2-7kdn4-master-2.migration-worker-2         NodeSelectorNotMatching
n-awax-48-2-7kdn4-worker-0-4svh5.migration-worker-1   SuccessfullyConfigured
n-awax-48-2-7kdn4-worker-0-4svh5.migration-worker-2   NodeSelectorNotMatching
n-awax-48-2-7kdn4-worker-0-66v2c.migration-worker-1   NodeSelectorNotMatching
n-awax-48-2-7kdn4-worker-0-66v2c.migration-worker-2   NodeSelectorNotMatching
n-awax-48-2-7kdn4-worker-0-llcbl.migration-worker-1   NodeSelectorNotMatching
n-awax-48-2-7kdn4-worker-0-llcbl.migration-worker-2   SuccessfullyConfigured

[core@n-awax-48-2-7kdn4-worker-0-llcbl ~]$ bridge fdb show dev bridge fdb show dev br100test
01:00:5e:00:00:01 self permanent
[core@n-awax-48-2-7kdn4-worker-0-4svh5 ~]$ bridge fdb show dev br100test
01:00:5e:00:00:01 self permanent


For the bridge fdb result on both nodes please see the attached files (worker-0-4svh5_bridge_fdb.txt and worker-0-llcbl_bridge_fdb.txt).

Comment 23 Edward Haas 2022-01-10 11:38:20 UTC
I'm sorry for the late response on this one.
Thank you for all the information.

I see that the ARP tables on both VM/s have not been refreshed/resolved:
`10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 STALE`
`10.200.0.1 dev eth1  INCOMPLETE`

Unfortunately, I have not been able to see any of these mac addresses on the bridge mac table attached. So I am unable to demystify this.

It looks like the only way to go forward with this, is to be online connected to the setup while the problem occurs to debug it.
If the test is running with automation, we can plan the steps do be taken in case of failure.

I'm unsure if this is feasible/practical from your point of view, let me know what you think.

Comment 25 Petr Horáček 2022-01-25 10:46:35 UTC
Update from an offline discussion: QE will create a test for this specific case and will try to use it to collect as much information as possible the next time this happens.

Comment 26 Ruth Netser 2022-03-03 13:14:38 UTC
@awax Can you please create the required test to collect the needed data?

Comment 27 awax 2022-05-19 08:00:35 UTC
After discussing this with Edi, I will create automation to collect the data on v.4.11.

Comment 28 Petr Horáček 2022-05-19 12:48:08 UTC
Retargeting to "future", since this wait for more data to be gathered

Comment 29 awax 2022-06-09 12:13:58 UTC
Update - I couldn't recreate the bug (using automation) on versions 4.9, 4.10, and 4.11. Currently working on recreating it using automation on v4.8.

Comment 30 awax 2022-06-15 08:51:15 UTC
I have the automation working on 4.8 and the bug is seen. The automation collects all the requested data before migration and while the bug is seen.
I'll sync with Edi on how to proceed from here.

Comment 31 Petr Horáček 2022-07-21 08:26:22 UTC
@awax there was still no sign of this issue on 4.9+, right? If that's the case, I think we can close this BZ.

Comment 32 awax 2022-07-25 09:20:56 UTC
@phoracek We didn't run the automation on 4.9 for the past month. We will re-run it on 4.9 after the v4.11 release.

Comment 34 Petr Horáček 2023-06-29 12:29:03 UTC
Please reopen this once the issue reappears.


Note You need to log in before you can comment on or make changes to this bug.