Bug 1469751 - OVS has a flow to drop all packets leading to loss of connectivity between the compute node and controllers- neutron-openvswitch-agent restart is required
Summary: OVS has a flow to drop all packets leading to loss of connectivity between th...
Keywords:
Status: CLOSED DUPLICATE of bug 1473763
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Toni Freger
URL:
Whiteboard: scale_lab, aos-scalability-36
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-11 18:52 UTC by Sai Sindhur Malleni
Modified: 2017-10-02 12:42 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-28 13:54:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sai Sindhur Malleni 2017-07-11 18:52:28 UTC
Description of problem: I started investigating as openstack hypervisor list showed two hypervisors as down. I couldn't ping/ssh to these hypervisors over their control plan IPs. On grabbing the console of these nodes and looking at the ovs-flows I saw that there was a flow for the primary interface hosting VLANs for the tenant, controlplane and internal api networks with actions: drop and all traffic going through this interface was being dropped because of this. AS a result, there was no communication between this hypervisor and thee rest of the cluster. OVS version is 2.6.1 and just to clarify there was no upgrade performed on this node. It is an OSP 11 GA environment hosting some VMs for OpenShift testing. Two hypervisors were marked as down by Nova and that's how the investigation began.

Restarting neutron-openvswitch restored connectivity.

OVS flows: https://gist.github.com/smalleni/a65cd07e019497a9ead3433e147243b7
ovs-vsctl show: https://gist.github.com/smalleni/e5a8374d4991d501c103b8b47c06f6f6
ip a: https://gist.github.com/smalleni/4ed2c19981b52300d7007341a85e7549


Version-Release number of selected component (if applicable):
11
OVS- 2.6.1

How reproducible: Happened only on two compute nodes on a 133 node setup


Steps to Reproduce:
1. 
2.
3.

Actual results:
Some compute nodes lost connectivity

Expected results:
None of the nodes should lose connectivity.

Additional info:

Comment 1 Ihar Hrachyshka 2017-07-13 15:26:33 UTC
Can we pinpoint in time when it happened? Anything in logs around that time? Any actions executed external to OSP?

Comment 2 Sai Sindhur Malleni 2017-07-14 18:44:33 UTC
Ihar,
Not entirely sure, as I found the hypervisors marked down when I was checking for some other errors. I know when the state of the VMs on these hypervisors was set to SHUTOFF (assuming it was done when connectivity was lost with the hypervisor). Would some neutron-openvswitch-agent and ovs logs around that time help?

Comment 3 Sai Sindhur Malleni 2017-07-14 20:03:26 UTC
Ihar,
I confirmed that a reboot on the compute node causes this. Manually bouncing the neutron-openvswitch-process is the only way to get connectivity back.

Comment 5 Ihar Hrachyshka 2017-07-19 13:52:20 UTC
We suspect it won't be of much help to get access to the node once it happens. Sai, is it easily reproducible? If so, may I ask you to enable debug logs for neutron-server and neutron-openvswitch-agent and then reproduce it, and attach logs?

Also, Kuba pointed to an upstream bug that suggests some flows were lost during operation, which resembles what you experience. The bug is: https://bugs.launchpad.net/neutron/+bug/1697243

Comment 6 Jakub Libosvar 2017-08-21 15:41:00 UTC
Given that br-ex in the provided gist is in standalone mode, this might be same issue as bug 1473763

Comment 7 Assaf Muller 2017-08-28 13:54:51 UTC

*** This bug has been marked as a duplicate of bug 1473763 ***


Note You need to log in before you can comment on or make changes to this bug.