Bug 1469751

Summary: OVS has a flow to drop all packets leading to loss of connectivity between the compute node and controllers- neutron-openvswitch-agent restart is required
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED DUPLICATE QA Contact: Toni Freger <tfreger>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: amuller, chrisw, ihrachys, jlibosva, nyechiel, smalleni, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab, aos-scalability-36
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-28 13:54:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2017-07-11 18:52:28 UTC
Description of problem: I started investigating as openstack hypervisor list showed two hypervisors as down. I couldn't ping/ssh to these hypervisors over their control plan IPs. On grabbing the console of these nodes and looking at the ovs-flows I saw that there was a flow for the primary interface hosting VLANs for the tenant, controlplane and internal api networks with actions: drop and all traffic going through this interface was being dropped because of this. AS a result, there was no communication between this hypervisor and thee rest of the cluster. OVS version is 2.6.1 and just to clarify there was no upgrade performed on this node. It is an OSP 11 GA environment hosting some VMs for OpenShift testing. Two hypervisors were marked as down by Nova and that's how the investigation began.

Restarting neutron-openvswitch restored connectivity.

OVS flows: https://gist.github.com/smalleni/a65cd07e019497a9ead3433e147243b7
ovs-vsctl show: https://gist.github.com/smalleni/e5a8374d4991d501c103b8b47c06f6f6
ip a: https://gist.github.com/smalleni/4ed2c19981b52300d7007341a85e7549


Version-Release number of selected component (if applicable):
11
OVS- 2.6.1

How reproducible: Happened only on two compute nodes on a 133 node setup


Steps to Reproduce:
1. 
2.
3.

Actual results:
Some compute nodes lost connectivity

Expected results:
None of the nodes should lose connectivity.

Additional info:

Comment 1 Ihar Hrachyshka 2017-07-13 15:26:33 UTC
Can we pinpoint in time when it happened? Anything in logs around that time? Any actions executed external to OSP?

Comment 2 Sai Sindhur Malleni 2017-07-14 18:44:33 UTC
Ihar,
Not entirely sure, as I found the hypervisors marked down when I was checking for some other errors. I know when the state of the VMs on these hypervisors was set to SHUTOFF (assuming it was done when connectivity was lost with the hypervisor). Would some neutron-openvswitch-agent and ovs logs around that time help?

Comment 3 Sai Sindhur Malleni 2017-07-14 20:03:26 UTC
Ihar,
I confirmed that a reboot on the compute node causes this. Manually bouncing the neutron-openvswitch-process is the only way to get connectivity back.

Comment 5 Ihar Hrachyshka 2017-07-19 13:52:20 UTC
We suspect it won't be of much help to get access to the node once it happens. Sai, is it easily reproducible? If so, may I ask you to enable debug logs for neutron-server and neutron-openvswitch-agent and then reproduce it, and attach logs?

Also, Kuba pointed to an upstream bug that suggests some flows were lost during operation, which resembles what you experience. The bug is: https://bugs.launchpad.net/neutron/+bug/1697243

Comment 6 Jakub Libosvar 2017-08-21 15:41:00 UTC
Given that br-ex in the provided gist is in standalone mode, this might be same issue as bug 1473763

Comment 7 Assaf Muller 2017-08-28 13:54:51 UTC

*** This bug has been marked as a duplicate of bug 1473763 ***