Description of problem: I started investigating as openstack hypervisor list showed two hypervisors as down. I couldn't ping/ssh to these hypervisors over their control plan IPs. On grabbing the console of these nodes and looking at the ovs-flows I saw that there was a flow for the primary interface hosting VLANs for the tenant, controlplane and internal api networks with actions: drop and all traffic going through this interface was being dropped because of this. AS a result, there was no communication between this hypervisor and thee rest of the cluster. OVS version is 2.6.1 and just to clarify there was no upgrade performed on this node. It is an OSP 11 GA environment hosting some VMs for OpenShift testing. Two hypervisors were marked as down by Nova and that's how the investigation began. Restarting neutron-openvswitch restored connectivity. OVS flows: https://gist.github.com/smalleni/a65cd07e019497a9ead3433e147243b7 ovs-vsctl show: https://gist.github.com/smalleni/e5a8374d4991d501c103b8b47c06f6f6 ip a: https://gist.github.com/smalleni/4ed2c19981b52300d7007341a85e7549 Version-Release number of selected component (if applicable): 11 OVS- 2.6.1 How reproducible: Happened only on two compute nodes on a 133 node setup Steps to Reproduce: 1. 2. 3. Actual results: Some compute nodes lost connectivity Expected results: None of the nodes should lose connectivity. Additional info:
Can we pinpoint in time when it happened? Anything in logs around that time? Any actions executed external to OSP?
Ihar, Not entirely sure, as I found the hypervisors marked down when I was checking for some other errors. I know when the state of the VMs on these hypervisors was set to SHUTOFF (assuming it was done when connectivity was lost with the hypervisor). Would some neutron-openvswitch-agent and ovs logs around that time help?
Ihar, I confirmed that a reboot on the compute node causes this. Manually bouncing the neutron-openvswitch-process is the only way to get connectivity back.
We suspect it won't be of much help to get access to the node once it happens. Sai, is it easily reproducible? If so, may I ask you to enable debug logs for neutron-server and neutron-openvswitch-agent and then reproduce it, and attach logs? Also, Kuba pointed to an upstream bug that suggests some flows were lost during operation, which resembles what you experience. The bug is: https://bugs.launchpad.net/neutron/+bug/1697243
Given that br-ex in the provided gist is in standalone mode, this might be same issue as bug 1473763
*** This bug has been marked as a duplicate of bug 1473763 ***