2186372 – Packet drops during the initial phase of VM live migration

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2186372 - Packet drops during the initial phase of VM live migration

Summary: Packet drops during the initial phase of VM live migration

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.12.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.14.2
Assignee:	Alona Kaplan
QA Contact:	Yossi Segev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-13 03:42 UTC by nijin ashok
Modified:	2024-07-03 04:25 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-14 16:11:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	containernetworking plugins issues 951	None	open	Add "activateInterface" option to bridge plugin CNI	2023-10-05 07:41:06 UTC
Red Hat Issue Tracker	CNV-28040	None	None	None	2024-01-30 04:44:04 UTC
Red Hat Issue Tracker	CNV-32164	None	None	None	2023-09-14 10:41:57 UTC
Red Hat Knowledge Base (Solution)	7011767	None	None	None	2023-05-09 08:06:27 UTC

Description nijin ashok 2023-04-13 03:42:38 UTC

Description of problem:

When a virtual machine is getting live migrated the packet drops are observed on the inbound traffic to the VM immediately after the target virt-launcher starts. These packets are getting routed to the destination node when the migration is still running.

Test: Started a ping to the VM during the VM migration from an external machine and the tcpdump was collected on both the source and the destination worker node and on the client machine.

IP address of the VM: 10.74.130.192
MAC: 02:6a:85:00:00:21


~~~
# ping -i 0.5 10.74.130.192

64 bytes from 10.74.130.192: icmp_seq=11 ttl=64 time=0.375 ms
64 bytes from 10.74.130.192: icmp_seq=12 ttl=64 time=0.624 ms
64 bytes from 10.74.130.192: icmp_seq=13 ttl=64 time=0.299 ms
64 bytes from 10.74.130.192: icmp_seq=14 ttl=64 time=63.5 ms

< --- drops -->

64 bytes from 10.74.130.192: icmp_seq=83 ttl=64 time=415 ms
64 bytes from 10.74.130.192: icmp_seq=84 ttl=64 time=11.9 ms  
~~~


The lost packets in the client packet capture.

~~~
# TZ=UTC tshark -nr client.pcap -t ad icmp
....
....
   49 2023-04-12 04:26:11.881169 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=13/3328, ttl=64 (request in 48)
   50 2023-04-12 04:26:12.380186 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=14/3584, ttl=64
   51 2023-04-12 04:26:12.443677 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=14/3584, ttl=64 (request in 50)
   54 2023-04-12 04:26:12.880854 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=15/3840, ttl=64
   55 2023-04-12 04:26:13.380357 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=16/4096, ttl=64
   58 2023-04-12 04:26:13.881358 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=17/4352, ttl=64
   59 2023-04-12 04:26:14.380506 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=18/4608, ttl=64
   61 2023-04-12 04:26:14.880871 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=19/4864, ttl=64
   62 2023-04-12 04:26:15.380386 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=20/5120, ttl=64
   63 2023-04-12 04:26:15.880623 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=21/5376, ttl=64
.......
.......
.......
  127 2023-04-12 04:26:46.402744 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=82/20992, ttl=64
  129 2023-04-12 04:26:47.316301 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=83/21248, ttl=64
  130 2023-04-12 04:26:47.318223 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=83/21248, ttl=64 (request in 129)
  131 2023-04-12 04:26:47.402238 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=84/21504, ttl=64
  132 2023-04-12 04:26:47.414150 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=84/21504, ttl=64 (request in 131)
~~~

These packets actually reached the destination node although the migration was still running:

~~~
The destination node, seq 15 - 82 reached here:

# TZ=UTC tshark -nr worker1_dst.pcap -t ad icmp
    3 2023-04-12 04:26:12.878671 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=15/3840, ttl=64
    4 2023-04-12 04:26:13.378150 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=16/4096, ttl=64
    7 2023-04-12 04:26:13.879223 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=17/4352, ttl=64
    8 2023-04-12 04:26:14.378311 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=18/4608, ttl=64
   10 2023-04-12 04:26:14.878612 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=19/4864, ttl=64
   11 2023-04-12 04:26:15.378144 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=20/5120, ttl=64
   12 2023-04-12 04:26:15.878365 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=21/5376, ttl=64
   13 2023-04-12 04:26:16.378102 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=22/5632, ttl=64
....
....
....


Source Node, seq 15 - 82 missing

   48 2023-04-12 04:26:11.884612 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=13/3328, ttl=64
   49 2023-04-12 04:26:11.884731 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=13/3328, ttl=64 (request in 48)
   50 2023-04-12 04:26:12.389793 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=14/3584, ttl=64
   51 2023-04-12 04:26:12.447208 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=14/3584, ttl=64 (request in 50)

< seq 15 - 82 missing >

   57 2023-04-12 04:26:47.320115 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=83/21248, ttl=64
   58 2023-04-12 04:26:47.320579 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=83/21248, ttl=64 (request in 57)
   59 2023-04-12 04:26:47.413168 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=84/21504, ttl=64
   60 2023-04-12 04:26:47.416380 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=84/21504, ttl=64 (request in 59)
   61 2023-04-12 04:26:47.907042 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=85/21760, ttl=64
   62 2023-04-12 04:26:47.907207 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=85/21760, ttl=64 (request in 61)
   63 2023-04-12 04:26:48.407203 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=86/22016, ttl=64

~~~

The domain was getting migrated during this time, and hence the destination VM was in paused status:

~~~
oc logs virt-launcher-rhel8-d58yi5fym85626yq-h76wk |grep "kubevirt domain status"
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):StartingUp(11)","pos":"client.go:289","timestamp":"2023-04-12T04:26:15.582630Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):Migration(2)","pos":"client.go:289","timestamp":"2023-04-12T04:26:16.244198Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):Migration(2)","pos":"client.go:289","timestamp":"2023-04-12T04:28:30.799153Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):Migration(2)","pos":"client.go:289","timestamp":"2023-04-12T04:28:30.832917Z"}

< --- Migration completed -->

{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(2)","pos":"client.go:289","timestamp":"2023-04-12T04:28:30.883757Z"}
~~~

It looks like the client is getting confused during the migration and routing traffic to the destination node while the migration is still going on because of the below ipv6 multicast packets originating from 02:6a:85:00:00:21( MAC address of the VM interface). 

~~~
   48 2023-04-12 04:26:11.880892 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=13/3328, ttl=64
   49 2023-04-12 04:26:11.881169 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=13/3328, ttl=64 (request in 48)
   50 2023-04-12 04:26:12.380186 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=14/3584, ttl=64
   51 2023-04-12 04:26:12.443677 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=14/3584, ttl=64 (request in 50)
   52 2023-04-12 04:26:12.470232           :: → ff02::16     90 Multicast Listener Report Message v2                         <<<
   53 2023-04-12 04:26:12.782278           :: → ff02::1:ff00:21 86 Neighbor Solicitation for fe80::6a:85ff:fe00:21           <<<
   54 2023-04-12 04:26:12.880854 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=15/3840, ttl=64        <<< ping routed to dest node
   55 2023-04-12 04:26:13.380357 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=16/4096, ttl=64
   56 2023-04-12 04:26:13.798396 fe80::6a:85ff:fe00:21 → ff02::16     90 Multicast Listener Report Message v2
   57 2023-04-12 04:26:13.798452 fe80::6a:85ff:fe00:21 → ff02::2      70 Router Solicitation from 02:6a:85:00:00:21
   58 2023-04-12 04:26:13.881358 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=17/4352, ttl=64
   59 2023-04-12 04:26:14.380506 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=18/4608, ttl=64
   60 2023-04-12 04:26:14.390271           :: → ff02::1:ff00:21 86 Neighbor Solicitation for 2620:52:0:4a80:6a:85ff:fe00:21
   61 2023-04-12 04:26:14.880871 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=19/4864, ttl=64
   62 2023-04-12 04:26:15.380386 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=20/5120, ttl=64
   63 2023-04-12 04:26:15.880623 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=21/5376, ttl=64

Packet 52:

52	2023-04-12 09:56:12.470232	::	ff02::16	90	Multicast Listener Report Message v252	2023-04-12 09:56:12.470232	::	ff02::16	90	Multicast Listener Report Message v2
Frame 52: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Ethernet II, Src: 02:6a:85:00:00:21 (02:6a:85:00:00:21), Dst: IPv6mcast_16 (33:33:00:00:00:16)       <<<
Internet Protocol Version 6, Src: ::, Dst: ff02::16
Internet Control Message Protocol v6

Packet 53:

53	2023-04-12 09:56:12.782278	::	ff02::1:ff00:21	86	Neighbor Solicitation for fe80::6a:85ff:fe00:21
Frame 53: 86 bytes on wire (688 bits), 86 bytes captured (688 bits)
Ethernet II, Src: 02:6a:85:00:00:21 (02:6a:85:00:00:21), Dst: IPv6mcast_ff:00:00:21 (33:33:ff:00:00:21)
Internet Protocol Version 6, Src: ::, Dst: ff02::1:ff00:21
Internet Control Message Protocol v6
~~~ 

And these ipv6 multicast packets are originating from the destination virt-launcher pod and it seems to be when the virt-launcher pod does the ipv6 neighbor discovery. virt-launcher pod will have the VM's MAC before it creates the bridge and pass it to the VM.  

~~~
net1 is having 02:6a:85:00:00:21 before it creates the bridge.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if198: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 0a:58:0a:83:00:9a brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.154/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d023:79ff:fe49:79d/64 scope link
       valid_lft forever preferred_lft forever
4: net1@if199: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default     <<<<
    link/ether 02:6a:85:00:00:21 brd ff:ff:ff:ff:ff:ff link-netnsid 0      <<<<
    inet6 fe80::6a:85ff:fe00:21/64 scope link tentative
       valid_lft forever preferred_lft forever
~~~

The packets are routed to the destination until the client does an ARP discovery again:

~~~
TZ=UTC tshark -nr client.pcap -t ad

  125 2023-04-12 04:26:45.611473 18:66:da:9f:b3:b9 → 02:6a:85:00:00:21 42 Who has 10.74.130.192? Tell 10.74.128.144
  126 2023-04-12 04:26:45.903198 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=81/20736, ttl=64
  127 2023-04-12 04:26:46.402744 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=82/20992, ttl=64
  128 2023-04-12 04:26:47.316291 02:6a:85:00:00:21 → 18:66:da:9f:b3:b9 60 10.74.130.192 is at 02:6a:85:00:00:21
  129 2023-04-12 04:26:47.316301 10.74.128.144 → 10.74.130.192 98 Echo (ping) request  id=0xa732, seq=83/21248, ttl=64
  130 2023-04-12 04:26:47.318223 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=83/21248, ttl=64 (request in 129)
~~~

Once the migration completes, we can see RARP from the destination node as expected and then it routes the packets to the destination node:

~~~
TZ=UTC tshark -nr client.pcap -t ad

   77 2023-04-12 04:28:30.835510 02:6a:85:00:00:21 → ff:ff:ff:ff:ff:ff 60 Who is 02:6a:85:00:00:21? Tell 02:6a:85:00:00:21
   78 2023-04-12 04:28:30.835539 02:6a:85:00:00:21 → ff:ff:ff:ff:ff:ff 60 Who is 02:6a:85:00:00:21? Tell 02:6a:85:00:00:21
   79 2023-04-12 04:28:30.835553 02:6a:85:00:00:21 → ff:ff:ff:ff:ff:ff 60 Who is 02:6a:85:00:00:21? Tell 02:6a:85:00:00:21
   80 2023-04-12 04:28:30.848542 02:6a:85:00:00:21 → 18:66:da:9f:b3:b9 42 Who has 10.74.128.144? Tell 10.74.130.192
   81 2023-04-12 04:28:30.848743 18:66:da:9f:b3:b9 → 02:6a:85:00:00:21 60 10.74.128.144 is at 18:66:da:9f:b3:b9
   83 2023-04-12 04:28:30.851173 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=21/5376, ttl=64 (request in 12)
   84 2023-04-12 04:28:30.851352 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=22/5632, ttl=64 (request in 13)
   85 2023-04-12 04:28:30.851454 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=23/5888, ttl=64 (request in 14)
   86 2023-04-12 04:28:30.851541 10.74.130.192 → 10.74.128.144 98 Echo (ping) reply    id=0xa732, seq=24/6144, ttl=64 (request in 15)
~~~


Version-Release number of selected component (if applicable):

OpenShift Virtualization   4.12.0

How reproducible:

100%

Steps to Reproduce:

1. Start a ping to the VM during the VM migration. Use -i in ping to shorten the interval between packets.
2. We can observe packet drops for a few seconds when the destination virt-launcher starts.

Actual results:

Packet drops during the initial phase of VM live migration. This drop is in addition to the drops during the final stage of migration where the source qemu is paused to move the last remaining memory pages to the destination qemu. So the user will experience more network downtime than other platforms which like RHV.


Expected results:


Additional info:

Comment 3 sgott 2023-04-19 12:14:22 UTC

The basic issue here is that there by nature exists a race condition between when the cluster is able to re-assign the service endpoint and when the migrating VM switches its seat of identity (I just coined that phrase for this conversation) between the source and destination node. I don't know if there's anything we could do to try and coordinate these two events better, but I'm re-assigning the component to Networking as my guess is they might have better insight into the matter. Please feel free to re-assign if this isn't the best move--this one really is a gray area in terms of ownership.

Comment 4 nijin ashok 2023-04-24 05:22:22 UTC

(In reply to sgott from comment #3)
> The basic issue here is that there by nature exists a race condition between
> when the cluster is able to re-assign the service endpoint and when the
> migrating VM switches its seat of identity (I just coined that phrase for
> this conversation) between the source and destination node.

To clarify, this is when using Linux bridge where Kubernetes Service is not involved.

Comment 5 Petr Horáček 2023-06-02 11:30:27 UTC

This may be an issue caused by us having a NIC in the destination virt-launcher with a MAC identical to the NIC of the source guest VM for a brief moment after the virt-launcher is started, but before virt-handler reassigns the MAC.

The team will look into this. Thanks for providing such a detailed analysis of what's happening!

Comment 6 Germano Veit Michel 2023-08-09 01:35:38 UTC

(In reply to Petr Horáček from comment #5)
> This may be an issue caused by us having a NIC in the destination
> virt-launcher with a MAC identical to the NIC of the source guest VM for a
> brief moment after the virt-launcher is started, but before virt-handler
> reassigns the MAC.

Yup, it is caused by creating the network plumbing on the destination host, the linux kernel automatically sends an IPv6 ND NS/NA on device creation.
1

This goes out on pod creation (not just virt-launchers, any pod does this)
00:59:17.488074 02:3b:54:00:00:01 > 33:33:00:00:00:02, ethertype IPv6 (0x86dd), length 70: fe80::3b:54ff:fe00:1 > ff02::2: ICMP6, router solicitation, length 16

Then all switches update their tables with the "new" 02:3b:54:00:00:01 location.

That is not a problem for most pods, but unfortunate for VM live migration. A frame to update the network should only go out when the VM runs on destination. This is supposed to be done by the qemu GARP/RARP, but it only goes out much later, during announce_self() when the VM really runs on the destination.

00:59:19.451322 02:3b:54:00:00:01 > ff:ff:ff:ff:ff:ff, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is 02:3b:54:00:00:01 tell 02:3b:54:00:00:01, length 46

However, the network was already updated by the IPv6 ND earlier, and traffic redirected.

Can reproduce the behaviour of the pod going up by doing this:

# ip link add address 00:11:22:33:44:55 name test0 type veth peer name test1 address 00:11:22:33:44:56
# ip link set dev test0 master virt.home.arpa
# ip link set dev test0 up 
# ip link set dev test1 up 

Here goes the ND NS out which update the switches :(

01:25:55.150023 00:11:22:33:44:56 > 33:33:ff:33:44:56, ethertype IPv6 (0x86dd), length 86: :: > ff02::1:ff33:4456: ICMP6, neighbor solicitation, who has fe80::211:22ff:fe33:4456, length 32

You probably need to have multus/CNI set NOARP on the interfaces when creating the plumbing:

# ip link add address 00:11:22:33:44:55 name test0 type veth peer name test1 address 00:11:22:33:44:56
# ip link set dev test0 arp off
# ip link set dev test1 arp off
# ip link set dev test1 up 
# ip link set dev test0 up 

<NO IPv6 ND is seen>

And after this the update to the network will be done by qemu, when the VM runs using virtio announce.

Comment 8 Edward Haas 2023-08-10 13:51:13 UTC

Thank you Germano for the detail investigation.

We had an initial discussion to evaluated this issue.
Here are the main ideas raised:
- Send IPv6 ND NS/NA frequently during migration from the source virt-launcher.
  - A workaround before a formal version, can be tried to do using a sidecar.
- Use the tuning CNI meta plugin to set the NOARP option for "all" and "default" interfaces.
  It needs to be added to the plugin chain as the top plugin in the NetworkAttachementDefinition.
- Change the specific CNI plugin (e.g. the bridge CNI) to accept the initial link state or the NOARP option.

Action Items:
- Test the tuning CNI meta plugin and see if it behaves as expected.

Comment 9 Germano Veit Michel 2023-08-10 21:05:27 UTC

(In reply to Edward Haas from comment #8)
> Thank you Germano for the detail investigation.
My pleasure.
> 
> We had an initial discussion to evaluated this issue.
> Here are the main ideas raised:
> - Send IPv6 ND NS/NA frequently during migration from the source
> virt-launcher.

Or a GARP, so its more clear about the intentions in case someone is looking at a tcpdump and wondering what is that additional IPv6 ND spam coming from 2 different places now :)

>   - A workaround before a formal version, can be tried to do using a sidecar.
> - Use the tuning CNI meta plugin to set the NOARP option for "all" and
> "default" interfaces.
>   It needs to be added to the plugin chain as the top plugin in the
> NetworkAttachementDefinition.
> - Change the specific CNI plugin (e.g. the bridge CNI) to accept the initial
> link state or the NOARP option.
> 
> Action Items:
> - Test the tuning CNI meta plugin and see if it behaves as expected.

Frankly, I don't think this is really a CNV specific problem, it just affects CNV more than others. 
It seems to be a general bad OCP behaviour, checking my network I have dozens of random MAC spammed per day on a pretty much idle cluster, all coming from OCP nodes. It just a matter of time and luck until it spams a MAC that already exists on the network, and in CNV case the same spamming breaks traffic on live migration as it spams the VM MAC. Maybe OCP should be fixed.

Comment 10 Edward Haas 2023-08-13 12:44:01 UTC

(In reply to Germano Veit Michel from comment #9)
> > - Send IPv6 ND NS/NA frequently during migration from the source
> > virt-launcher.
> 
> Or a GARP, so its more clear about the intentions in case someone is looking
> at a tcpdump and wondering what is that additional IPv6 ND spam coming from
> 2 different places now :)

Yes, although unlike IPv6, that would have to be sent with a broadcast source address.
I am not even sure it is "legal" to send such packets (they may be dropped due to the
non unicast source).

> > Action Items:
> > - Test the tuning CNI meta plugin and see if it behaves as expected.
> 
> Frankly, I don't think this is really a CNV specific problem, it just
> affects CNV more than others. 
> It seems to be a general bad OCP behaviour, checking my network I have
> dozens of random MAC spammed per day on a pretty much idle cluster, all
> coming from OCP nodes. It just a matter of time and luck until it spams a
> MAC that already exists on the network, and in CNV case the same spamming
> breaks traffic on live migration as it spams the VM MAC. Maybe OCP should be
> fixed.

Each pod is creating a network end-point device which is announcing itself.
This seems to be the default behavior of the kernel, unrelated to OCP.
For IP traffic, this seems indeed a waste, as address resolution is expected
anyway, with it comes the learning.

I am not really sure why this default has been decided on, we should probably ask
the kernel network devs about it.

We can pursue this in all direction, it all depends on the response time.

Comment 11 Germano Veit Michel 2023-08-13 21:15:51 UTC

(In reply to Edward Haas from comment #10)
> (In reply to Germano Veit Michel from comment #9)
> > > - Send IPv6 ND NS/NA frequently during migration from the source
> > > virt-launcher.
> > 
> > Or a GARP, so its more clear about the intentions in case someone is looking
> > at a tcpdump and wondering what is that additional IPv6 ND spam coming from
> > 2 different places now :)
> 
> Yes, although unlike IPv6, that would have to be sent with a broadcast
> source address.
> I am not even sure it is "legal" to send such packets (they may be dropped
> due to the
> non unicast source).

The source address (L2) is unicast and should be the MAC of the VM, otherwise
it won't have the effects we want.

Note QEMU sends these at the end of migration, VMware and others do the exact same.
So its unlikely they will be dropped.

> 
> > > Action Items:
> > > - Test the tuning CNI meta plugin and see if it behaves as expected.
> > 
> > Frankly, I don't think this is really a CNV specific problem, it just
> > affects CNV more than others. 
> > It seems to be a general bad OCP behaviour, checking my network I have
> > dozens of random MAC spammed per day on a pretty much idle cluster, all
> > coming from OCP nodes. It just a matter of time and luck until it spams a
> > MAC that already exists on the network, and in CNV case the same spamming
> > breaks traffic on live migration as it spams the VM MAC. Maybe OCP should be
> > fixed.
> 
> Each pod is creating a network end-point device which is announcing itself.
> This seems to be the default behavior of the kernel, unrelated to OCP.
> For IP traffic, this seems indeed a waste, as address resolution is expected
> anyway, with it comes the learning.

Right, but this behaviour is there for years, for much longer than OCP exists.
I'd say OCP should have set more sane defaults when creating all the mesh
of devices it uses for the network plumbing. Well note really OCP but the
related subcomponents.

Still, it seems that network spamming wasn't really a problem until we had VMs.

Also, it appears that when running on other platforms (i.e. vSphere) we don't get that spam, maybe dropped at the hypervisor by MAC filtering.

> I am not really sure why this default has been decided on, we should
> probably ask
> the kernel network devs about it.
> 
> We can pursue this in all direction, it all depends on the response time.

Just my 2c, but I think changing a decades old default may break other things and this should be fixed somewhere in OCP.

Comment 16 Edward Haas 2023-08-16 15:04:55 UTC

(In reply to Edward Haas from comment #8)
> We had an initial discussion to evaluated this issue.
> Here are the main ideas raised:
> - Send IPv6 ND NS/NA frequently during migration from the source
> virt-launcher.
>   - A workaround before a formal version, can be tried to do using a sidecar.
> - Use the tuning CNI meta plugin to set the NOARP option for "all" and
> "default" interfaces.
>   It needs to be added to the plugin chain as the top plugin in the
> NetworkAttachementDefinition.
> - Change the specific CNI plugin (e.g. the bridge CNI) to accept the initial
> link state or the NOARP option.

A possible workaround is to make sure the guest is generating egress traffic
out of the relevant interface.

We likely do not see this problem widely because during live migration
the guest is still sending traffic out, causing all the network to re-learn
the path to the mac quickly.

Comment 17 Germano Veit Michel 2023-08-16 21:39:42 UTC

(In reply to Edward Haas from comment #16)
> We likely do not see this problem widely because during live migration
> the guest is still sending traffic out, causing all the network to re-learn
> the path to the mac quickly.

In a simple network with a single switch, yes. But that's not the case in data-centers.

Some common data-center network topologies like spine-leaf/trees will only update the MAC address port in the switch table if the guest sends some broadcast frame.
It may take a while for it to send a broadcast one, and if it only sends unicast to a particular system it will only update the switches on a particular path end to end, leaving many others with the wrong entry.
So even with the guest sending traffic, it can still cause some network outage, to a subset of systems.
It would take broadcast or many unicast in different directions to update the network.

Note the IPv6 ND goes out with the mac group bit set, and floods the wrong info to all switches, it would take something similar to repair the entries until the VM is fully migrated.

Comment 18 Petr Horáček 2023-09-04 10:27:24 UTC

Nijin could you help us assess the urgency of this? Is the workaround of the guest continuously emitting ARP an acceptable temporary solution?

Comment 25 Alona Kaplan 2023-10-02 15:55:44 UTC

(In reply to Edward Haas from comment #8)
> Thank you Germano for the detail investigation.
> 
> We had an initial discussion to evaluated this issue.
> Here are the main ideas raised:
> - Send IPv6 ND NS/NA frequently during migration from the source
> virt-launcher.
>   - A workaround before a formal version, can be tried to do using a sidecar.
> - Use the tuning CNI meta plugin to set the NOARP option for "all" and
> "default" interfaces.
>   It needs to be added to the plugin chain as the top plugin in the
> NetworkAttachementDefinition.
> - Change the specific CNI plugin (e.g. the bridge CNI) to accept the initial
> link state or the NOARP option.
> 
> Action Items:
> - Test the tuning CNI meta plugin and see if it behaves as expected.

Tuning CNI cannot be added to the plugin chain as the top plugin since it doesn't return any result (the following error is returned when trying to put the tuning at the top of the chain "Required prevResult missing").
Even if it was possible it wouldn't necessary be a good idea.

I"m not sure why "ip link set dev ifaceName arp off" disables "neighbor solicitation" messages (ipv6 is not using ARP), but even with the "arp off" I still see other ipv6 traffic when setting the interface to "up"-

12:29:38.260546 test1 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48
12:29:38.260552 test0 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48

I couldn't find any "sysctl" that disables "neighbor solicitation" the same way the "arp off" does. The only option I could find is to disable ipv6 by sysctl "net.ipv6.conf.all.disable_ipv6".
I don't think it is a good idea to disable IPv6 for all the interfaces in the pod. Maybe some actually need and use it, especially the primary one.
But as mentioned before, the tuning cannot be the first one in the chain, so even this option cannot be done with the tuning.

I think the best option is to ask the bridge CNI to create the interface with down state and for kubvirt to move it to up only once the qemu is ready.
I will open an RFE in the CNI plugins repo and link it here.

Comment 26 Alona Kaplan 2023-10-02 15:56:25 UTC

(In reply to Edward Haas from comment #8)
> Thank you Germano for the detail investigation.
> 
> We had an initial discussion to evaluated this issue.
> Here are the main ideas raised:
> - Send IPv6 ND NS/NA frequently during migration from the source
> virt-launcher.
>   - A workaround before a formal version, can be tried to do using a sidecar.
> - Use the tuning CNI meta plugin to set the NOARP option for "all" and
> "default" interfaces.
>   It needs to be added to the plugin chain as the top plugin in the
> NetworkAttachementDefinition.
> - Change the specific CNI plugin (e.g. the bridge CNI) to accept the initial
> link state or the NOARP option.
> 
> Action Items:
> - Test the tuning CNI meta plugin and see if it behaves as expected.

Tuning CNI cannot be added to the plugin chain as the top plugin since it doesn't return any result (the following error is returned when trying to put the tuning at the top of the chain "Required prevResult missing").
Even if it was possible it wouldn't necessary be a good idea.

I"m not sure why "ip link set dev ifaceName arp off" disables "neighbor solicitation" messages (ipv6 is not using ARP), but even with the "arp off" I still see other ipv6 traffic when setting the interface to "up"-

12:29:38.260546 test1 Out IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48
12:29:38.260552 test0 M   IP6 :: > ff02::16: HBH ICMP6, multicast listener report v2, 2 group record(s), length 48

I couldn't find any "sysctl" that disables "neighbor solicitation" the same way the "arp off" does. The only option I could find is to disable ipv6 by sysctl "net.ipv6.conf.all.disable_ipv6".
I don't think it is a good idea to disable IPv6 for all the interfaces in the pod. Maybe some actually need and use it, especially the primary one.
But as mentioned before, the tuning cannot be the first one in the chain, so even this option cannot be done with the tuning.

I think the best option is to ask the bridge CNI to create the interface with down state and for kubvirt to move it to up only once the qemu is ready.
I will open an RFE in the CNI plugins repo and link it here.

Comment 27 Alona Kaplan 2023-10-02 17:12:53 UTC

Bridge plugin CNI issue to introduce "activateInterface" option - https://github.com/containernetworking/plugins/issues/951

Comment 28 Red Hat Bugzilla 2024-07-03 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.