Bug 1909930 - Flows are not updated for trunk ports on VM migration for the qrouters
Summary: Flows are not updated for trunk ports on VM migration for the qrouters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: All
OS: All
high
high
Target Milestone: ---
: ---
Assignee: Slawek Kaplonski
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-22 05:00 UTC by Brendan Shephard
Modified: 2022-08-30 14:50 UTC (History)
7 users (show)

Fixed In Version: openstack-neutron-12.1.1-39.el7ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-16 10:58:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1914747 0 None None None 2021-02-05 14:34:10 UTC
OpenStack gerrit 774245 0 None MERGED Fix update of trunk subports during live migration 2021-02-16 08:14:40 UTC
Red Hat Issue Tracker OSP-1373 0 None None None 2022-08-30 14:50:25 UTC
Red Hat Product Errata RHBA-2021:2385 0 None None None 2021-06-16 10:59:26 UTC

Description Brendan Shephard 2020-12-22 05:00:05 UTC
Description of problem:
We have OpenShift running with Kuryr CNI. This heavily utilses Neutron trunk ports. When we live migrate one of the OpenShift VM's, we can see that the OpenFlow rules are updated on the Hypervisors, but for the qrouter node, the flow rules are still sending the traffic to the original Hypervisor. 

So tcpdump from the VM tap device, we can see the ARP request heading to the qrouter - no reply. tcpdump from the qrouter namespace on the Gateway interface for the VM, we can see the ARP request AND the reply. Additionally, if we add a static ARP entry to the VM namespace, and ping the router. We can see the ICMP request and response on the qrouter. But the response is sent via the VXLAN back to the Original hypervisor pre live-migration.

Version-Release number of selected component (if applicable):
registry.access.redhat.com/rhosp13/openstack-neutron-openvswitch-agent:13.0-125
registry.access.redhat.com/rhosp13/openstack-neutron-l3-agent:13.0-125

How reproducible:
I don't have adequate infrastructure to deploy OCP + Kuryr CNI. But in this environment, we seem to be reliably reproducing it by simply live migrating OCP Infra nodes.

Steps to Reproduce:
1. Deploy OCP3.11 + Kuryr
2. Live migrate a Infra node VM
3. Observe PODs failing to start complaining about "No route to host"

Actual results:
pods can't start being unable to reach the k8s API endpoint. Checking the openflow rules, we can see the path TO the qrouter is fine, the path back to the VM is going via the original Hypervisor VXLAN.

Expected results:
OpenFlow rules should be updated on both sides to reflect the migration.

Attachments will be provided including all OpenFlow rules from both sides. Also the ovs-appctl ofproto/trace outputs.

Additional info:

Comment 7 Mohammad 2021-01-12 05:52:26 UTC
Hello,

I have replicated this issue on one of the infra dev clusters (OpenShift running on OpenStack with Kuryr). I asked for one of the infra nodes to be livemigrated after I drained it (which only relocated one of the logging-es-data-master pods).

After the livemigration, I attempted to relocate that elasticsearch pod, which attempted to start on that livemigrated node and failed.

Upon checking, I found the same symptoms, to match those of this case. Any network tests (ping of default route, or api) fail.

What would you like me to gather from this environment?

Mohammad

Comment 17 Slawek Kaplonski 2021-01-22 11:40:15 UTC
Thx Brendan and Mohammad for info. That clarifies things for me. At least I know where to look for the issue (networker node or neutron-server and l2population mechanism).

I'm now deploying OSP-13 with dvr to try to reproduce such issue but if You can provide debug logs from the neutron server and neutron agents running on networker node where router is, that could also help me.

Ahh, one more question - You told me that the router which is not reachable is centralized router. But in the sos report I see that dvr is enabled, at least on compute node. Is that intended?

Comment 18 PURANDHAR SAIRAM MANNIDI 2021-01-26 03:52:26 UTC
@slawek yes, these are centralized routers. There were other issues with DVR for Kuryr so they are manually setting the routers as centralized. One such issue i could remember is the VIP movemement on VMs with  Allowed Address pairs, I believe its https://bugzilla.redhat.com/show_bug.cgi?id=1818741 so they were marking the routers as centralized rather than distributed.

Comment 19 Mohammad 2021-02-01 01:38:53 UTC
Confirming this as well, the customer set the routers as centralised just to be able to install OpenShift 3.11 with Kuryr while on OSP13z11. I believe a few other BZs were open to address these issues:

https://bugzilla.redhat.com/show_bug.cgi?id=1819055
https://bugzilla.redhat.com/show_bug.cgi?id=1818695

Comment 28 ldenny 2021-03-02 22:23:09 UTC
Hi Slawek,

Could you please advise if/when this will be merged downstream for OSP13: https://review.opendev.org/c/openstack/neutron/+/775104

Thank you.

Comment 29 Slawek Kaplonski 2021-03-08 08:20:09 UTC
Hi,

That patch was backported to OSP-13: https://code.engineering.redhat.com/gerrit/#/c/227439/ and is merged already. It is available in openstack-neutron-12.1.1-39.el7ost build.

Comment 43 errata-xmlrpc 2021-06-16 10:58:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2385


Note You need to log in before you can comment on or make changes to this bug.