Created attachment 1795082 [details] migration_vmb_new.yaml Created attachment 1795082 [details] migration_vmb_new.yaml Created attachment 1795082 [details] migration_vmb_new.yaml Description of problem: a migrated vm takes a lot of time (between 10 to 60 seconds) to gain connectivity. when pinging over a secondary interface from the migrated vm (with multus) to another vm (with multus) in the same cluster, there is a packet loss (with 'Destination Host Unreachable') during this period of time. Version-Release number of selected component (if applicable): CNV v.4.8.0 OCP v.4.8.0-fc.5 Kubernetes Version: v1.21.0-rc.0+88a3e8c How reproducible: Not always. I couldn't find a correlation to understand why. Steps to Reproduce: 1. create a dedicated namespace for the resources that will be created in the next steps. Name it "anat-test-migration" to match the namespace defined in the files attached. 2. create bridge (use 'migration_nncp_1.yaml' and 'migration_nncp_2.yaml' files attached - make sure to change the node selector to match your cluster nodes) 3. create nad (use 'migration_nad_new.yaml' file attached) 4. create vma and vmb (use 'migration_vma_new.yaml' and 'migration_vmb_new.yaml' files attached). 5. run both VM's: $ virtctl start vma $ virtctl start vmb 6. expose services to allow ssh connection to both vms (use 'migration_ssh_service_for_vma.yaml' and 'migration_ssh_service_for_vmb.yaml' files attached). 7. migrate vmb (use 'migration_virtualmachineinstancemigration.yaml' file attached). 8. connect to vmb as soon as the migration finishes. To find the exact moment you can check when the vmi is assigned a new IP address using the command: $ oc get vmi -w 9. ping from vmb to vma over the secondary interface (bridge): - enter VM vmb through ssh (the IP is the ip of the node on which vmb is running, '-p' is the port of the vmb's service which can be found using the command 'oc get service'): $ ssh fedora.2.83 -p 30401 - ping vma: $ ping 10.200.0.1 * in order to reproduce, steps 8 and 9 should be performed as close to the migration ending as possible. Actual results: when bug occurs: [fedora@vmb ~]$ ping 10.200.0.1 PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data. From 10.200.0.22 icmp_seq=10 Destination Host Unreachable From 10.200.0.22 icmp_seq=11 Destination Host Unreachable From 10.200.0.22 icmp_seq=12 Destination Host Unreachable From 10.200.0.22 icmp_seq=13 Destination Host Unreachable From 10.200.0.22 icmp_seq=14 Destination Host Unreachable From 10.200.0.22 icmp_seq=15 Destination Host Unreachable From 10.200.0.22 icmp_seq=16 Destination Host Unreachable From 10.200.0.22 icmp_seq=17 Destination Host Unreachable From 10.200.0.22 icmp_seq=18 Destination Host Unreachable From 10.200.0.22 icmp_seq=19 Destination Host Unreachable From 10.200.0.22 icmp_seq=20 Destination Host Unreachable From 10.200.0.22 icmp_seq=21 Destination Host Unreachable From 10.200.0.22 icmp_seq=22 Destination Host Unreachable From 10.200.0.22 icmp_seq=23 Destination Host Unreachable From 10.200.0.22 icmp_seq=24 Destination Host Unreachable From 10.200.0.22 icmp_seq=25 Destination Host Unreachable From 10.200.0.22 icmp_seq=26 Destination Host Unreachable From 10.200.0.22 icmp_seq=27 Destination Host Unreachable From 10.200.0.22 icmp_seq=28 Destination Host Unreachable From 10.200.0.22 icmp_seq=29 Destination Host Unreachable From 10.200.0.22 icmp_seq=30 Destination Host Unreachable From 10.200.0.22 icmp_seq=31 Destination Host Unreachable From 10.200.0.22 icmp_seq=32 Destination Host Unreachable From 10.200.0.22 icmp_seq=33 Destination Host Unreachable 64 bytes from 10.200.0.1: icmp_seq=35 ttl=64 time=3.93 ms 64 bytes from 10.200.0.1: icmp_seq=34 ttl=64 time=1028 ms 64 bytes from 10.200.0.1: icmp_seq=36 ttl=64 time=1.36 ms 64 bytes from 10.200.0.1: icmp_seq=37 ttl=64 time=0.962 ms 64 bytes from 10.200.0.1: icmp_seq=38 ttl=64 time=1.30 ms ^C --- 10.200.0.1 ping statistics --- 38 packets transmitted, 5 received, +24 errors, 86.8421% packet loss, time 37808ms rtt min/avg/max/mdev = 0.962/207.169/1028.296/410.564 ms, pipe 4 Expected results: no packet loss. Additional info: tcpdump of the secondary interface of the migrated VM (vmb) is included - steps to produce: 1. ssh to vmb: $ ssh fedora.2.83 -p 30401 2. run tcpdump: $ sudo tcpdump -i eth1 -xx >~/tcpdump_log.log
Created attachment 1795083 [details] migration_vma_new.yaml
Created attachment 1795093 [details] migration_ssh_service_for_vmb.yaml
Created attachment 1795094 [details] migration_ssh_service_for_vma.yaml
Created attachment 1795095 [details] migration_nad_new.yaml
Created attachment 1795096 [details] migration_nncp_1.yaml
Created attachment 1795098 [details] migration_virtualmachineinstancemigration.yaml
Created attachment 1795099 [details] tcpdump_log.log
Anat, could you confirm whether this is a regression introduced in 4.8? We recommend to live-migrate over the bridge network to assure no connectivity drop. It is alarming that we see this issue.
Who owns the address `10.200.0.22` ?
Could you also try to recreate this when KMP is disabled for these VMI/s?
(In reply to Edward Haas from comment #10) > Who owns the address `10.200.0.22` ? eth1 - see recreation (different cluster to the IP is different as well): [fedora@vmb ~]$ ping 10.200.0.1 PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data. From 10.200.0.2 icmp_seq=10 Destination Host Unreachable From 10.200.0.2 icmp_seq=11 Destination Host Unreachable From 10.200.0.2 icmp_seq=12 Destination Host Unreachable From 10.200.0.2 icmp_seq=13 Destination Host Unreachable From 10.200.0.2 icmp_seq=14 Destination Host Unreachable From 10.200.0.2 icmp_seq=15 Destination Host Unreachable From 10.200.0.2 icmp_seq=16 Destination Host Unreachable From 10.200.0.2 icmp_seq=17 Destination Host Unreachable From 10.200.0.2 icmp_seq=18 Destination Host Unreachable From 10.200.0.2 icmp_seq=19 Destination Host Unreachable From 10.200.0.2 icmp_seq=20 Destination Host Unreachable From 10.200.0.2 icmp_seq=21 Destination Host Unreachable From 10.200.0.2 icmp_seq=22 Destination Host Unreachable From 10.200.0.2 icmp_seq=23 Destination Host Unreachable From 10.200.0.2 icmp_seq=24 Destination Host Unreachable From 10.200.0.2 icmp_seq=25 Destination Host Unreachable From 10.200.0.2 icmp_seq=26 Destination Host Unreachable From 10.200.0.2 icmp_seq=27 Destination Host Unreachable From 10.200.0.2 icmp_seq=28 Destination Host Unreachable From 10.200.0.2 icmp_seq=29 Destination Host Unreachable From 10.200.0.2 icmp_seq=30 Destination Host Unreachable From 10.200.0.2 icmp_seq=31 Destination Host Unreachable From 10.200.0.2 icmp_seq=32 Destination Host Unreachable From 10.200.0.2 icmp_seq=33 Destination Host Unreachable From 10.200.0.2 icmp_seq=34 Destination Host Unreachable From 10.200.0.2 icmp_seq=35 Destination Host Unreachable From 10.200.0.2 icmp_seq=36 Destination Host Unreachable From 10.200.0.2 icmp_seq=37 Destination Host Unreachable From 10.200.0.2 icmp_seq=38 Destination Host Unreachable From 10.200.0.2 icmp_seq=39 Destination Host Unreachable From 10.200.0.2 icmp_seq=40 Destination Host Unreachable From 10.200.0.2 icmp_seq=41 Destination Host Unreachable From 10.200.0.2 icmp_seq=42 Destination Host Unreachable From 10.200.0.2 icmp_seq=43 Destination Host Unreachable From 10.200.0.2 icmp_seq=44 Destination Host Unreachable From 10.200.0.2 icmp_seq=45 Destination Host Unreachable From 10.200.0.2 icmp_seq=46 Destination Host Unreachable From 10.200.0.2 icmp_seq=47 Destination Host Unreachable From 10.200.0.2 icmp_seq=48 Destination Host Unreachable From 10.200.0.2 icmp_seq=49 Destination Host Unreachable From 10.200.0.2 icmp_seq=50 Destination Host Unreachable From 10.200.0.2 icmp_seq=51 Destination Host Unreachable From 10.200.0.2 icmp_seq=52 Destination Host Unreachable From 10.200.0.2 icmp_seq=53 Destination Host Unreachable From 10.200.0.2 icmp_seq=54 Destination Host Unreachable From 10.200.0.2 icmp_seq=55 Destination Host Unreachable From 10.200.0.2 icmp_seq=56 Destination Host Unreachable From 10.200.0.2 icmp_seq=57 Destination Host Unreachable From 10.200.0.2 icmp_seq=58 Destination Host Unreachable From 10.200.0.2 icmp_seq=59 Destination Host Unreachable From 10.200.0.2 icmp_seq=60 Destination Host Unreachable From 10.200.0.2 icmp_seq=61 Destination Host Unreachable From 10.200.0.2 icmp_seq=62 Destination Host Unreachable From 10.200.0.2 icmp_seq=63 Destination Host Unreachable From 10.200.0.2 icmp_seq=64 Destination Host Unreachable From 10.200.0.2 icmp_seq=65 Destination Host Unreachable From 10.200.0.2 icmp_seq=66 Destination Host Unreachable From 10.200.0.2 icmp_seq=67 Destination Host Unreachable From 10.200.0.2 icmp_seq=68 Destination Host Unreachable From 10.200.0.2 icmp_seq=69 Destination Host Unreachable From 10.200.0.2 icmp_seq=70 Destination Host Unreachable From 10.200.0.2 icmp_seq=71 Destination Host Unreachable From 10.200.0.2 icmp_seq=72 Destination Host Unreachable From 10.200.0.2 icmp_seq=73 Destination Host Unreachable From 10.200.0.2 icmp_seq=74 Destination Host Unreachable From 10.200.0.2 icmp_seq=75 Destination Host Unreachable From 10.200.0.2 icmp_seq=76 Destination Host Unreachable From 10.200.0.2 icmp_seq=77 Destination Host Unreachable From 10.200.0.2 icmp_seq=78 Destination Host Unreachable From 10.200.0.2 icmp_seq=79 Destination Host Unreachable From 10.200.0.2 icmp_seq=80 Destination Host Unreachable From 10.200.0.2 icmp_seq=81 Destination Host Unreachable From 10.200.0.2 icmp_seq=82 Destination Host Unreachable From 10.200.0.2 icmp_seq=83 Destination Host Unreachable From 10.200.0.2 icmp_seq=84 Destination Host Unreachable From 10.200.0.2 icmp_seq=85 Destination Host Unreachable From 10.200.0.2 icmp_seq=86 Destination Host Unreachable From 10.200.0.2 icmp_seq=87 Destination Host Unreachable From 10.200.0.2 icmp_seq=88 Destination Host Unreachable From 10.200.0.2 icmp_seq=89 Destination Host Unreachable From 10.200.0.2 icmp_seq=90 Destination Host Unreachable From 10.200.0.2 icmp_seq=91 Destination Host Unreachable From 10.200.0.2 icmp_seq=92 Destination Host Unreachable From 10.200.0.2 icmp_seq=93 Destination Host Unreachable From 10.200.0.2 icmp_seq=94 Destination Host Unreachable From 10.200.0.2 icmp_seq=95 Destination Host Unreachable From 10.200.0.2 icmp_seq=96 Destination Host Unreachable From 10.200.0.2 icmp_seq=97 Destination Host Unreachable From 10.200.0.2 icmp_seq=98 Destination Host Unreachable From 10.200.0.2 icmp_seq=99 Destination Host Unreachable From 10.200.0.2 icmp_seq=100 Destination Host Unreachable From 10.200.0.2 icmp_seq=101 Destination Host Unreachable From 10.200.0.2 icmp_seq=102 Destination Host Unreachable 64 bytes from 10.200.0.1: icmp_seq=103 ttl=64 time=1026 ms 64 bytes from 10.200.0.1: icmp_seq=104 ttl=64 time=2.41 ms 64 bytes from 10.200.0.1: icmp_seq=105 ttl=64 time=1.22 ms 64 bytes from 10.200.0.1: icmp_seq=106 ttl=64 time=2.50 ms 64 bytes from 10.200.0.1: icmp_seq=107 ttl=64 time=0.832 ms 64 bytes from 10.200.0.1: icmp_seq=108 ttl=64 time=1.10 ms 64 bytes from 10.200.0.1: icmp_seq=109 ttl=64 time=0.710 ms 64 bytes from 10.200.0.1: icmp_seq=110 ttl=64 time=0.878 ms 64 bytes from 10.200.0.1: icmp_seq=111 ttl=64 time=0.891 ms 64 bytes from 10.200.0.1: icmp_seq=112 ttl=64 time=0.833 ms ^C --- 10.200.0.1 ping statistics --- 112 packets transmitted, 10 received, +93 errors, 91.0714% packet loss, time 113473ms rtt min/avg/max/mdev = 0.710/103.761/1026.243/307.494 ms, pipe 4 [fedora@vmb ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:32:a6:00:00:02 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86309468sec preferred_lft 86309468sec inet6 fe80::b00f:71c1:3fa2:140e/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:32:a6:00:00:03 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::32:a6ff:fe00:3/64 scope link valid_lft forever preferred_lft forever
Thank you for the clarification on the IP. We still need answers on comment 9 and 11. Thanks.
(In reply to Petr Horáček from comment #9) > Anat, could you confirm whether this is a regression introduced in 4.8? We > recommend to live-migrate over the bridge network to assure no connectivity > drop. It is alarming that we see this issue. I managed to recreate the bug on a 2.6.6 cluster. results: [fedora@vmb ~]$ ping 10.200.0.1 PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data. From 10.200.0.2 icmp_seq=10 Destination Host Unreachable From 10.200.0.2 icmp_seq=11 Destination Host Unreachable From 10.200.0.2 icmp_seq=12 Destination Host Unreachable From 10.200.0.2 icmp_seq=13 Destination Host Unreachable From 10.200.0.2 icmp_seq=14 Destination Host Unreachable From 10.200.0.2 icmp_seq=15 Destination Host Unreachable From 10.200.0.2 icmp_seq=16 Destination Host Unreachable From 10.200.0.2 icmp_seq=17 Destination Host Unreachable From 10.200.0.2 icmp_seq=18 Destination Host Unreachable From 10.200.0.2 icmp_seq=19 Destination Host Unreachable From 10.200.0.2 icmp_seq=20 Destination Host Unreachable From 10.200.0.2 icmp_seq=21 Destination Host Unreachable From 10.200.0.2 icmp_seq=22 Destination Host Unreachable From 10.200.0.2 icmp_seq=23 Destination Host Unreachable From 10.200.0.2 icmp_seq=24 Destination Host Unreachable From 10.200.0.2 icmp_seq=25 Destination Host Unreachable From 10.200.0.2 icmp_seq=26 Destination Host Unreachable From 10.200.0.2 icmp_seq=27 Destination Host Unreachable From 10.200.0.2 icmp_seq=28 Destination Host Unreachable From 10.200.0.2 icmp_seq=29 Destination Host Unreachable From 10.200.0.2 icmp_seq=30 Destination Host Unreachable 64 bytes from 10.200.0.1: icmp_seq=32 ttl=64 time=1027 ms 64 bytes from 10.200.0.1: icmp_seq=33 ttl=64 time=3.31 ms 64 bytes from 10.200.0.1: icmp_seq=31 ttl=64 time=2052 ms 64 bytes from 10.200.0.1: icmp_seq=34 ttl=64 time=1.03 ms 64 bytes from 10.200.0.1: icmp_seq=35 ttl=64 time=1.08 ms ^C --- 10.200.0.1 ping statistics --- 35 packets transmitted, 5 received, +21 errors, 85.7143% packet loss, time 34810ms rtt min/avg/max/mdev = 1.029/616.855/2051.523/819.955 ms, pipe 4 [fedora@vmb ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:31:f1:1c brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86313057sec preferred_lft 86313057sec inet6 fe80::4ddc:8a67:d50d:ddab/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 16:3f:fc:f1:00:f4 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::143f:fcff:fef1:f4/64 scope link valid_lft forever preferred_lft forever
Thanks, that helps triaging this bug. Adding back the needinfo for https://bugzilla.redhat.com/show_bug.cgi?id=1976578#c13.
(In reply to Edward Haas from comment #11) > Could you also try to recreate this when KMP is disabled for these VMI/s? This bug is flaky, so I'm saying this with caution - I wasn't able to recreate the bug when KMP was disabled on the namespace (on a 4.8 cluster).
(In reply to awax from comment #16) > (In reply to Edward Haas from comment #11) > > Could you also try to recreate this when KMP is disabled for these VMI/s? > > This bug is flaky, so I'm saying this with caution - I wasn't able to > recreate the bug when KMP was disabled on the namespace (on a 4.8 cluster). Did manage to reproduce it now: [fedora@vmb ~]$ ping 10.200.0.1 PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data. From 10.200.0.2 icmp_seq=9 Destination Host Unreachable From 10.200.0.2 icmp_seq=10 Destination Host Unreachable From 10.200.0.2 icmp_seq=11 Destination Host Unreachable From 10.200.0.2 icmp_seq=12 Destination Host Unreachable From 10.200.0.2 icmp_seq=13 Destination Host Unreachable From 10.200.0.2 icmp_seq=14 Destination Host Unreachable From 10.200.0.2 icmp_seq=15 Destination Host Unreachable From 10.200.0.2 icmp_seq=16 Destination Host Unreachable From 10.200.0.2 icmp_seq=17 Destination Host Unreachable From 10.200.0.2 icmp_seq=18 Destination Host Unreachable From 10.200.0.2 icmp_seq=19 Destination Host Unreachable From 10.200.0.2 icmp_seq=20 Destination Host Unreachable From 10.200.0.2 icmp_seq=21 Destination Host Unreachable From 10.200.0.2 icmp_seq=22 Destination Host Unreachable From 10.200.0.2 icmp_seq=23 Destination Host Unreachable From 10.200.0.2 icmp_seq=24 Destination Host Unreachable From 10.200.0.2 icmp_seq=25 Destination Host Unreachable From 10.200.0.2 icmp_seq=26 Destination Host Unreachable From 10.200.0.2 icmp_seq=27 Destination Host Unreachable From 10.200.0.2 icmp_seq=28 Destination Host Unreachable From 10.200.0.2 icmp_seq=29 Destination Host Unreachable From 10.200.0.2 icmp_seq=30 Destination Host Unreachable From 10.200.0.2 icmp_seq=31 Destination Host Unreachable From 10.200.0.2 icmp_seq=32 Destination Host Unreachable From 10.200.0.2 icmp_seq=33 Destination Host Unreachable From 10.200.0.2 icmp_seq=34 Destination Host Unreachable From 10.200.0.2 icmp_seq=35 Destination Host Unreachable From 10.200.0.2 icmp_seq=36 Destination Host Unreachable From 10.200.0.2 icmp_seq=37 Destination Host Unreachable From 10.200.0.2 icmp_seq=38 Destination Host Unreachable From 10.200.0.2 icmp_seq=39 Destination Host Unreachable From 10.200.0.2 icmp_seq=40 Destination Host Unreachable From 10.200.0.2 icmp_seq=41 Destination Host Unreachable From 10.200.0.2 icmp_seq=42 Destination Host Unreachable From 10.200.0.2 icmp_seq=43 Destination Host Unreachable From 10.200.0.2 icmp_seq=44 Destination Host Unreachable From 10.200.0.2 icmp_seq=45 Destination Host Unreachable From 10.200.0.2 icmp_seq=46 Destination Host Unreachable From 10.200.0.2 icmp_seq=47 Destination Host Unreachable From 10.200.0.2 icmp_seq=48 Destination Host Unreachable From 10.200.0.2 icmp_seq=49 Destination Host Unreachable From 10.200.0.2 icmp_seq=50 Destination Host Unreachable From 10.200.0.2 icmp_seq=51 Destination Host Unreachable From 10.200.0.2 icmp_seq=52 Destination Host Unreachable From 10.200.0.2 icmp_seq=53 Destination Host Unreachable From 10.200.0.2 icmp_seq=54 Destination Host Unreachable From 10.200.0.2 icmp_seq=55 Destination Host Unreachable From 10.200.0.2 icmp_seq=56 Destination Host Unreachable From 10.200.0.2 icmp_seq=57 Destination Host Unreachable From 10.200.0.2 icmp_seq=58 Destination Host Unreachable From 10.200.0.2 icmp_seq=59 Destination Host Unreachable From 10.200.0.2 icmp_seq=60 Destination Host Unreachable From 10.200.0.2 icmp_seq=61 Destination Host Unreachable From 10.200.0.2 icmp_seq=62 Destination Host Unreachable From 10.200.0.2 icmp_seq=63 Destination Host Unreachable From 10.200.0.2 icmp_seq=64 Destination Host Unreachable From 10.200.0.2 icmp_seq=65 Destination Host Unreachable From 10.200.0.2 icmp_seq=66 Destination Host Unreachable From 10.200.0.2 icmp_seq=67 Destination Host Unreachable From 10.200.0.2 icmp_seq=68 Destination Host Unreachable From 10.200.0.2 icmp_seq=69 Destination Host Unreachable From 10.200.0.2 icmp_seq=70 Destination Host Unreachable From 10.200.0.2 icmp_seq=71 Destination Host Unreachable From 10.200.0.2 icmp_seq=72 Destination Host Unreachable From 10.200.0.2 icmp_seq=73 Destination Host Unreachable From 10.200.0.2 icmp_seq=74 Destination Host Unreachable From 10.200.0.2 icmp_seq=75 Destination Host Unreachable From 10.200.0.2 icmp_seq=76 Destination Host Unreachable From 10.200.0.2 icmp_seq=77 Destination Host Unreachable From 10.200.0.2 icmp_seq=78 Destination Host Unreachable From 10.200.0.2 icmp_seq=79 Destination Host Unreachable From 10.200.0.2 icmp_seq=80 Destination Host Unreachable From 10.200.0.2 icmp_seq=81 Destination Host Unreachable From 10.200.0.2 icmp_seq=82 Destination Host Unreachable From 10.200.0.2 icmp_seq=83 Destination Host Unreachable From 10.200.0.2 icmp_seq=84 Destination Host Unreachable From 10.200.0.2 icmp_seq=85 Destination Host Unreachable From 10.200.0.2 icmp_seq=86 Destination Host Unreachable From 10.200.0.2 icmp_seq=87 Destination Host Unreachable From 10.200.0.2 icmp_seq=88 Destination Host Unreachable From 10.200.0.2 icmp_seq=89 Destination Host Unreachable From 10.200.0.2 icmp_seq=90 Destination Host Unreachable From 10.200.0.2 icmp_seq=91 Destination Host Unreachable From 10.200.0.2 icmp_seq=92 Destination Host Unreachable From 10.200.0.2 icmp_seq=93 Destination Host Unreachable From 10.200.0.2 icmp_seq=94 Destination Host Unreachable From 10.200.0.2 icmp_seq=95 Destination Host Unreachable From 10.200.0.2 icmp_seq=96 Destination Host Unreachable From 10.200.0.2 icmp_seq=97 Destination Host Unreachable From 10.200.0.2 icmp_seq=98 Destination Host Unreachable From 10.200.0.2 icmp_seq=99 Destination Host Unreachable From 10.200.0.2 icmp_seq=100 Destination Host Unreachable From 10.200.0.2 icmp_seq=101 Destination Host Unreachable From 10.200.0.2 icmp_seq=102 Destination Host Unreachable From 10.200.0.2 icmp_seq=103 Destination Host Unreachable From 10.200.0.2 icmp_seq=104 Destination Host Unreachable From 10.200.0.2 icmp_seq=105 Destination Host Unreachable From 10.200.0.2 icmp_seq=106 Destination Host Unreachable From 10.200.0.2 icmp_seq=107 Destination Host Unreachable From 10.200.0.2 icmp_seq=108 Destination Host Unreachable From 10.200.0.2 icmp_seq=109 Destination Host Unreachable From 10.200.0.2 icmp_seq=110 Destination Host Unreachable From 10.200.0.2 icmp_seq=111 Destination Host Unreachable From 10.200.0.2 icmp_seq=112 Destination Host Unreachable From 10.200.0.2 icmp_seq=113 Destination Host Unreachable From 10.200.0.2 icmp_seq=114 Destination Host Unreachable From 10.200.0.2 icmp_seq=115 Destination Host Unreachable From 10.200.0.2 icmp_seq=116 Destination Host Unreachable From 10.200.0.2 icmp_seq=117 Destination Host Unreachable From 10.200.0.2 icmp_seq=118 Destination Host Unreachable From 10.200.0.2 icmp_seq=119 Destination Host Unreachable From 10.200.0.2 icmp_seq=120 Destination Host Unreachable From 10.200.0.2 icmp_seq=121 Destination Host Unreachable From 10.200.0.2 icmp_seq=122 Destination Host Unreachable 64 bytes from 10.200.0.1: icmp_seq=125 ttl=64 time=4.47 ms 64 bytes from 10.200.0.1: icmp_seq=124 ttl=64 time=1028 ms 64 bytes from 10.200.0.1: icmp_seq=123 ttl=64 time=2053 ms 64 bytes from 10.200.0.1: icmp_seq=126 ttl=64 time=1.31 ms 64 bytes from 10.200.0.1: icmp_seq=127 ttl=64 time=1.06 ms 64 bytes from 10.200.0.1: icmp_seq=128 ttl=64 time=0.817 ms ^C --- 10.200.0.1 ping statistics --- 128 packets transmitted, 6 received, +114 errors, 95.3125% packet loss, time 129999ms rtt min/avg/max/mdev = 0.817/514.774/2052.604/783.243 ms, pipe 4 [fedora@vmb ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:bd:12:03 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86311137sec preferred_lft 86311137sec inet6 fe80::a7b8:96eb:5aaf:4358/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether e2:80:2b:97:54:d5 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::e080:2bff:fe97:54d5/64 scope link valid_lft forever preferred_lft forever [fedora@vmb ~]$ exit logout -bash: ssh-dss: command not found Connection to 192.168.2.83 closed. [cnv-qe-jenkins@awax-48-3-jvl6x-executor anat_files]$ oc describe namespaces anat-test-migration Name: anat-test-migration Labels: kubernetes.io/metadata.name=anat-test-migration mutatevirtualmachines.kubemacpool.io=ignore Annotations: openshift.io/description: openshift.io/display-name: openshift.io/requester: system:admin openshift.io/sa.scc.mcs: s0:c28,c2 openshift.io/sa.scc.supplemental-groups: 1000760000/10000 openshift.io/sa.scc.uid-range: 1000760000/10000 Status: Active No resource quota. No LimitRange resource.
Moving to z-stream as this was already observed on 4.8 and that triage was not finished by now and we are quite late in the release already.
Based on the information collected so far, this does not seem to be solely related to KMP (i.e. enforcing the same MAC address on the source and target pod interfaces), although in theory this may be another layer of the problem. We have two patch to troubleshoot this now: - Collect more info when the problem occurs, trying to pin-point what is exactly missing. - Troubleshoot it online on the setup this is recreated on. I will focus here on the specific info I think may be helpful, which should also be a good baseline for the online troubleshooting. If the ping is not working, we need to check the two guests and intermediate bridges. When the ping is in a fail state, this info may be useful: - Base info: The MAC addresses, IP/s and routes from both guests (a table with the interfaces as the index will be nice), before the migration is started. After the migration finishes and in the period the ping fails: - ARP table from both guests: `ip neigh` - IP/s: `ip addr` - Routes: `ip route` If there is a bridge on the node which serves the network, then dumping from it the MAC table will be very useful (the list or MAC addresses learned by the bridge on each of its ports).
Hi Edi, Here is the requested info: Before creating the migration instance: VMA: [fedora@vma ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:00 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86312415sec preferred_lft 86312415sec inet6 fe80::72:a25e:37cc:c3ed/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:01 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.1/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a0:84ff:fe00:1/64 scope link valid_lft forever preferred_lft forever [fedora@vma ~]$ ip route default via 10.0.2.1 dev eth0 proto dhcp metric 100 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.1 metric 101 [fedora@vma ~]$ ip neigh 10.0.2.1 dev eth0 lladdr 16:e2:36:ad:90:1a REACHABLE 10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 REACHABLE VMB: [fedora@vmb ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:02 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86312427sec preferred_lft 86312427sec inet6 fe80::6444:43f8:f34d:892/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:03 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a0:84ff:fe00:3/64 scope link valid_lft forever preferred_lft forever [fedora@vmb ~]$ ip route default via 10.0.2.1 dev eth0 proto dhcp metric 100 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.2 metric 101 [fedora@vmb ~]$ ip neigh 10.0.2.1 dev eth0 lladdr d6:45:36:dc:ed:94 REACHABLE 10.200.0.1 dev eth1 lladdr 02:a0:84:00:00:01 REACHABLE During ping failure: VMA: [fedora@vma ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:00 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86232786sec preferred_lft 86232786sec inet6 fe80::72:a25e:37cc:c3ed/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:01 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.1/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a0:84ff:fe00:1/64 scope link valid_lft forever preferred_lft forever [fedora@vma ~]$ ip route default via 10.0.2.1 dev eth0 proto dhcp metric 100 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.1 metric 101 [fedora@vma ~]$ ip neigh 10.0.2.1 dev eth0 lladdr 16:e2:36:ad:90:1a REACHABLE 10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 STALE VMB: [fedora@vmb ~]$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:02 brd ff:ff:ff:ff:ff:ff altname enp1s0 inet 10.0.2.2/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 valid_lft 86232803sec preferred_lft 86232803sec inet6 fe80::6444:43f8:f34d:892/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 02:a0:84:00:00:03 brd ff:ff:ff:ff:ff:ff altname enp2s0 inet 10.200.0.2/24 brd 10.200.0.255 scope global noprefixroute eth1 valid_lft forever preferred_lft forever inet6 fe80::a0:84ff:fe00:3/64 scope link valid_lft forever preferred_lft forever [fedora@vmb ~]$ ip route default via 10.0.2.1 dev eth0 proto dhcp metric 100 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.2 metric 100 10.200.0.0/24 dev eth1 proto kernel scope link src 10.200.0.2 metric 101 [fedora@vmb ~]$ ip neigh 10.0.2.1 dev eth0 lladdr 9e:67:b0:19:ce:3d REACHABLE 10.200.0.1 dev eth1 INCOMPLETE Bridge info: (2 bridges are created on 2 nodes) [cnv-qe-jenkins@n-awax-48-2-7kdn4-executor ~]$ oc get nnce NAME STATUS n-awax-48-2-7kdn4-master-0.migration-worker-1 NodeSelectorNotMatching n-awax-48-2-7kdn4-master-0.migration-worker-2 NodeSelectorNotMatching n-awax-48-2-7kdn4-master-1.migration-worker-1 NodeSelectorNotMatching n-awax-48-2-7kdn4-master-1.migration-worker-2 NodeSelectorNotMatching n-awax-48-2-7kdn4-master-2.migration-worker-1 NodeSelectorNotMatching n-awax-48-2-7kdn4-master-2.migration-worker-2 NodeSelectorNotMatching n-awax-48-2-7kdn4-worker-0-4svh5.migration-worker-1 SuccessfullyConfigured n-awax-48-2-7kdn4-worker-0-4svh5.migration-worker-2 NodeSelectorNotMatching n-awax-48-2-7kdn4-worker-0-66v2c.migration-worker-1 NodeSelectorNotMatching n-awax-48-2-7kdn4-worker-0-66v2c.migration-worker-2 NodeSelectorNotMatching n-awax-48-2-7kdn4-worker-0-llcbl.migration-worker-1 NodeSelectorNotMatching n-awax-48-2-7kdn4-worker-0-llcbl.migration-worker-2 SuccessfullyConfigured [core@n-awax-48-2-7kdn4-worker-0-llcbl ~]$ bridge fdb show dev bridge fdb show dev br100test 01:00:5e:00:00:01 self permanent [core@n-awax-48-2-7kdn4-worker-0-4svh5 ~]$ bridge fdb show dev br100test 01:00:5e:00:00:01 self permanent For the bridge fdb result on both nodes please see the attached files (worker-0-4svh5_bridge_fdb.txt and worker-0-llcbl_bridge_fdb.txt).
I'm sorry for the late response on this one. Thank you for all the information. I see that the ARP tables on both VM/s have not been refreshed/resolved: `10.200.0.2 dev eth1 lladdr 02:a0:84:00:00:03 STALE` `10.200.0.1 dev eth1 INCOMPLETE` Unfortunately, I have not been able to see any of these mac addresses on the bridge mac table attached. So I am unable to demystify this. It looks like the only way to go forward with this, is to be online connected to the setup while the problem occurs to debug it. If the test is running with automation, we can plan the steps do be taken in case of failure. I'm unsure if this is feasible/practical from your point of view, let me know what you think.
Update from an offline discussion: QE will create a test for this specific case and will try to use it to collect as much information as possible the next time this happens.
@awax Can you please create the required test to collect the needed data?
After discussing this with Edi, I will create automation to collect the data on v.4.11.
Retargeting to "future", since this wait for more data to be gathered
Update - I couldn't recreate the bug (using automation) on versions 4.9, 4.10, and 4.11. Currently working on recreating it using automation on v4.8.
I have the automation working on 4.8 and the bug is seen. The automation collects all the requested data before migration and while the bug is seen. I'll sync with Edi on how to proceed from here.
@awax there was still no sign of this issue on 4.9+, right? If that's the case, I think we can close this BZ.
@phoracek We didn't run the automation on 4.9 for the past month. We will re-run it on 4.9 after the v4.11 release.
Please reopen this once the issue reappears.