OSP Tracker -Multichassis ports on localnet-attached logical switches should receive ICMP Path Discovery hints if their effective MTU is lower than localnet MTU
DescriptionIhar Hrachyshka
2023-03-22 18:02:02 UTC
This bug was initially created as a copy of Bug #2180955
I am copying this bug because: the underlying OVN bug affects live migration scenario in OSP where we utilize multichassis ports to reduce network downtime during libvirt live migration switch to a new node. (This feature is used since 17.1.)
I expect this bug to become a blocker for 17.1.
This bug is to track the fix in OVN. We should test the fix in OSP for live migration scenario.
===
Description of problem:
When a port is multichassis (requested-chassis is a comma separated list), if the LS that the port belongs to has a localnet port, traffic to and from the multichassis port is tunneled anyway. (This is done to guarantee delivery of packets destined to the port MAC address to all its locations.)
This enforced tunneling may be a problem if the effective MTU for the ports becomes different from the theoretical MTU of the physical network that underlies the LS (defined by MTU of localnet port in the same switch). In this case, the port should not communicate with the outside world using the max MTU.
The proposal here is for OVN controller to set up ICMP Path Discovery replies to oversized packets received from a multichassis port, so that the port owner is aware of the change in circumstances and can adequately adjust their effective MTU.
The problem was originally discussed in ovs-dev ML: https://www.mail-archive.com/ovs-dev@openvswitch.org/msg68204.html
Implementation was proposed at: https://mail.openvswitch.org/pipermail/ovs-dev/2022-November/398981.html
This bug is to take the patch over and get it tested / merged in OVN.
The bug affects OSP live migration scenario for VMs attached to physical networks (=switches with localnet port).
I expect this bug to become a blocker for OSP 17.1 because of its effect on live migration scenario.
Status:
I'm still waiting for OVN core team to look into it in upstream. I'm asking about reviews on every occasion (in upstream meetings and elsewhere). I've explicitly asked to review the series in scope of their next release process (soft-freeze was just announced). I hope we'll get some attention in next days...
There's no automation on OSP side for this scenario.
I assume that verification of the issue on OSP side would involve manual checks
- start a vm on vlan network
- establish tcp session with iperf
- start live migration
- make sure that the session doesn’t get degraded / dropped during the process
(- confirm the migration is complete)