Bug 1469751 - OVS has a flow to drop all packets leading to loss of connectivity between the compute node and controllers- neutron-openvswitch-agent restart is required
OVS has a flow to drop all packets leading to loss of connectivity between th...
Status: CLOSED DUPLICATE of bug 1473763
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
11.0 (Ocata)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Assaf Muller
Toni Freger
scale_lab, aos-scalability-36
Depends On:
  Show dependency treegraph
Reported: 2017-07-11 14:52 EDT by Sai Sindhur Malleni
Modified: 2017-10-02 08:42 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-08-28 09:54:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Sai Sindhur Malleni 2017-07-11 14:52:28 EDT
Description of problem: I started investigating as openstack hypervisor list showed two hypervisors as down. I couldn't ping/ssh to these hypervisors over their control plan IPs. On grabbing the console of these nodes and looking at the ovs-flows I saw that there was a flow for the primary interface hosting VLANs for the tenant, controlplane and internal api networks with actions: drop and all traffic going through this interface was being dropped because of this. AS a result, there was no communication between this hypervisor and thee rest of the cluster. OVS version is 2.6.1 and just to clarify there was no upgrade performed on this node. It is an OSP 11 GA environment hosting some VMs for OpenShift testing. Two hypervisors were marked as down by Nova and that's how the investigation began.

Restarting neutron-openvswitch restored connectivity.

OVS flows: https://gist.github.com/smalleni/a65cd07e019497a9ead3433e147243b7
ovs-vsctl show: https://gist.github.com/smalleni/e5a8374d4991d501c103b8b47c06f6f6
ip a: https://gist.github.com/smalleni/4ed2c19981b52300d7007341a85e7549

Version-Release number of selected component (if applicable):
OVS- 2.6.1

How reproducible: Happened only on two compute nodes on a 133 node setup

Steps to Reproduce:

Actual results:
Some compute nodes lost connectivity

Expected results:
None of the nodes should lose connectivity.

Additional info:
Comment 1 Ihar Hrachyshka 2017-07-13 11:26:33 EDT
Can we pinpoint in time when it happened? Anything in logs around that time? Any actions executed external to OSP?
Comment 2 Sai Sindhur Malleni 2017-07-14 14:44:33 EDT
Not entirely sure, as I found the hypervisors marked down when I was checking for some other errors. I know when the state of the VMs on these hypervisors was set to SHUTOFF (assuming it was done when connectivity was lost with the hypervisor). Would some neutron-openvswitch-agent and ovs logs around that time help?
Comment 3 Sai Sindhur Malleni 2017-07-14 16:03:26 EDT
I confirmed that a reboot on the compute node causes this. Manually bouncing the neutron-openvswitch-process is the only way to get connectivity back.
Comment 5 Ihar Hrachyshka 2017-07-19 09:52:20 EDT
We suspect it won't be of much help to get access to the node once it happens. Sai, is it easily reproducible? If so, may I ask you to enable debug logs for neutron-server and neutron-openvswitch-agent and then reproduce it, and attach logs?

Also, Kuba pointed to an upstream bug that suggests some flows were lost during operation, which resembles what you experience. The bug is: https://bugs.launchpad.net/neutron/+bug/1697243
Comment 6 Jakub Libosvar 2017-08-21 11:41:00 EDT
Given that br-ex in the provided gist is in standalone mode, this might be same issue as bug 1473763
Comment 7 Assaf Muller 2017-08-28 09:54:51 EDT

*** This bug has been marked as a duplicate of bug 1473763 ***

Note You need to log in before you can comment on or make changes to this bug.