Description of problem: We have OpenShift running with Kuryr CNI. This heavily utilses Neutron trunk ports. When we live migrate one of the OpenShift VM's, we can see that the OpenFlow rules are updated on the Hypervisors, but for the qrouter node, the flow rules are still sending the traffic to the original Hypervisor. So tcpdump from the VM tap device, we can see the ARP request heading to the qrouter - no reply. tcpdump from the qrouter namespace on the Gateway interface for the VM, we can see the ARP request AND the reply. Additionally, if we add a static ARP entry to the VM namespace, and ping the router. We can see the ICMP request and response on the qrouter. But the response is sent via the VXLAN back to the Original hypervisor pre live-migration. Version-Release number of selected component (if applicable): registry.access.redhat.com/rhosp13/openstack-neutron-openvswitch-agent:13.0-125 registry.access.redhat.com/rhosp13/openstack-neutron-l3-agent:13.0-125 How reproducible: I don't have adequate infrastructure to deploy OCP + Kuryr CNI. But in this environment, we seem to be reliably reproducing it by simply live migrating OCP Infra nodes. Steps to Reproduce: 1. Deploy OCP3.11 + Kuryr 2. Live migrate a Infra node VM 3. Observe PODs failing to start complaining about "No route to host" Actual results: pods can't start being unable to reach the k8s API endpoint. Checking the openflow rules, we can see the path TO the qrouter is fine, the path back to the VM is going via the original Hypervisor VXLAN. Expected results: OpenFlow rules should be updated on both sides to reflect the migration. Attachments will be provided including all OpenFlow rules from both sides. Also the ovs-appctl ofproto/trace outputs. Additional info:
Hello, I have replicated this issue on one of the infra dev clusters (OpenShift running on OpenStack with Kuryr). I asked for one of the infra nodes to be livemigrated after I drained it (which only relocated one of the logging-es-data-master pods). After the livemigration, I attempted to relocate that elasticsearch pod, which attempted to start on that livemigrated node and failed. Upon checking, I found the same symptoms, to match those of this case. Any network tests (ping of default route, or api) fail. What would you like me to gather from this environment? Mohammad
Thx Brendan and Mohammad for info. That clarifies things for me. At least I know where to look for the issue (networker node or neutron-server and l2population mechanism). I'm now deploying OSP-13 with dvr to try to reproduce such issue but if You can provide debug logs from the neutron server and neutron agents running on networker node where router is, that could also help me. Ahh, one more question - You told me that the router which is not reachable is centralized router. But in the sos report I see that dvr is enabled, at least on compute node. Is that intended?
@slawek yes, these are centralized routers. There were other issues with DVR for Kuryr so they are manually setting the routers as centralized. One such issue i could remember is the VIP movemement on VMs with Allowed Address pairs, I believe its https://bugzilla.redhat.com/show_bug.cgi?id=1818741 so they were marking the routers as centralized rather than distributed.
Confirming this as well, the customer set the routers as centralised just to be able to install OpenShift 3.11 with Kuryr while on OSP13z11. I believe a few other BZs were open to address these issues: https://bugzilla.redhat.com/show_bug.cgi?id=1819055 https://bugzilla.redhat.com/show_bug.cgi?id=1818695
Hi Slawek, Could you please advise if/when this will be merged downstream for OSP13: https://review.opendev.org/c/openstack/neutron/+/775104 Thank you.
Hi, That patch was backported to OSP-13: https://code.engineering.redhat.com/gerrit/#/c/227439/ and is merged already. It is available in openstack-neutron-12.1.1-39.el7ost build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2385