Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1397557

Summary:	Complete loss of network connectivity after controller reboot introduced by BZ 1372370 - Set secure fail mode for physical bridges
Product:	Red Hat OpenStack	Reporter:	Andreas Karis <akaris>
Component:	openstack-neutron	Assignee:	Assaf Muller <amuller>
Status:	CLOSED DUPLICATE	QA Contact:	Toni Freger <tfreger>
Severity:	high	Docs Contact:
Priority:	medium
Version:	8.0 (Liberty)	CC:	amuller, chrisw, nyechiel, srevivo
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-22 22:37:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andreas Karis 2016-11-22 19:36:55 UTC

Description of problem:
Complete loss of network connectivity after controller reboot introduced by BZ 1372370 (https://bugzilla.redhat.com/show_bug.cgi?id=1372370) - Set secure fail mode for physical bridges
From the BZ:
"Prior to this update, the failure mode on OVS physical bridges was not set, defaulting to `standalone`. Consequently, when the ofctl_interface was set to `native` and the interface became unavailable (due to heavy load, OVS agent shutdown, network disruption), the flows on physical bridges may have been cleared, with the physical bridge traffic being disrupted.
With this update, the OVS physical bridge fail mode is set to `secure`. As a result, flows are retained on physical bridges."

The problem is that in Red Hat OpenStack Director deployments with HA, connectivity for pacemaker is more often than not established across VLANs that reside / rely on br-ex. If the fail-mode is set to secure, then upon a controller reboot, all flows will be deleted from br-ex. Thus, pacemaker can never reach the rest of the cluster and can never bring up neutron-openvswitch-agent. neutron-openvswitch-agent, however, is needed to recreate the flows. 

Version-Release number of selected component (if applicable):
Bug introduced with openstack-neutron-7.1.1-6.el7ost

How reproducible:

In a lab, before upgrading to a version >= 7.1.1.6
~~~
[root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex
~~~

In a lab, after upgrading to neutron  7.1.1.7
~~~
[root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex
secure
~~~

I powered off and rebooted controller 1, and after the reboot, I get this:
~~~
[root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
[root@overcloud-controller-1 ~]# 
~~~

And 
~~~
Online: [ overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-0 overcloud-controller-2 ]
~~~

pacemaker connectivity is established across a vlan which goes out via ovs' br-ex interface.
~~~
[root@overcloud-controller-1 ~]# ping overcloud-controller-0
PING overcloud-controller-0.localdomain (172.16.2.9) 56(84) bytes of data.
^C
--- overcloud-controller-0.localdomain ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

[root@overcloud-controller-1 ~]# ip r g 172.16.2.9
172.16.2.9 dev vlan901  src 172.16.2.6 
    cache 
~~~

This leads to a chicken/egg problem:
- ovs will remove all flows from br-ex because it is in fail-mode secure. Because flows are removed  a) pacemaker does not bring up the controller's own services because it cannot reach the rest of the cluster, and b) connectivity to the other 2 controllers for neutron itself obviously does not work, neither.
- in order to populate ovs with the correct flows, neutron-openvswitch-agent needs to be started by pacemaker. this will only happen if pacemaker can reach the rest of the cluster. This cannot work, because ovs removed all flows due to fail-mode secure.

A manual start of neutron-openvswitch-agent recreates OVS flows and the controller can join the cluster again:
~~~
[root@overcloud-controller-1 ~]# systemctl start neutron-openvswitch-agent.service
(...)
[root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
 cookie=0xa9a7725d59dcff14, duration=43.037s, table=0, n_packets=0, n_bytes=0, idle_age=43, priority=2,in_port=7 actions=drop
 cookie=0xa9a7725d59dcff14, duration=43.085s, table=0, n_packets=45897, n_bytes=8302418, idle_age=0, priority=0 actions=NORMAL
~~~

Stopping pacemaker on all controllers
~~~
[root@overcloud-controller-0 ~]# pcs cluster stop --all
~~~

Downgrading to 7.1.1.5 on all controllers
~~~
yum downgrade openstack-neutron-7.1.1-5.el7ost.noarch openstack-neutron-common-7.1.1-5.el7ost.noarch openstack-neutron-ml2-7.1.1-5.el7ost.noarch openstack-neutron-openvswitch-7.1.1-5.el7ost.noarch python-neutron-7.1.1-5.el7ost.noarch openstack-neutron-metering-agent-7.1.1-5.el7ost.noarch -y
~~~

Starting all pacemaker services on all controllers
~~~
[root@overcloud-controller-0 ~]# pcs cluster start --all
~~~

And a manual removal of fail-mode is needed to get rid of fail-mode secure:
~~~
[root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex
secure
[root@overcloud-controller-0 log]# ovs-vsctl help | grep fail
[root@overcloud-controller-0 log]# ovs-vsctl del-fail-mode br-ex
[root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex
[root@overcloud-controller-0 log]# 
~~~

Stopping cluster again on all machines just to be sure
~~~
[root@overcloud-controller-0 log]# pcs cluster stop --all
~~~

(Soft) Rebooting all controllers
~~~
reboot
~~~

After this, all controllers come back, with:
~~~
[root@overcloud-controller-1 ~]# ovs-vsctl get-fail-mode br-ex
[root@overcloud-controller-1 ~]# 
[root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
 cookie=0xb0ee7296820f37f2, duration=7.941s, table=0, n_packets=0, n_bytes=0, idle_age=7, priority=2,in_port=7 actions=drop
 cookie=0xb0ee7296820f37f2, duration=7.988s, table=0, n_packets=22863, n_bytes=4384453, idle_age=0, priority=0 actions=NORMAL
~~~

Killing controller-1 as in initial test and waiting for its restart, confirmations of flows and of pcs status
~~~
Right after reboot:
[root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=61.048s, table=0, n_packets=9460, n_bytes=1483143, idle_age=0, priority=0 actions=NORMAL
[root@overcloud-controller-1 ~]# pcs status | head
Cluster name: tripleo_cluster
Last updated: Tue Nov 22 19:29:18 2016		Last change: Tue Nov 22 19:27:06 2016 by hacluster via crmd on overcloud-controller-2
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 112 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

[root@overcloud-controller-1 ~]#
~~~

After PCS reconverges
~~~
[root@overcloud-controller-1 ~]# pcs status | grep -i stop
[root@overcloud-controller-1 ~]# 
[root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
 cookie=0xaa204b7310e24e4a, duration=57.494s, table=0, n_packets=0, n_bytes=0, idle_age=57, priority=2,in_port=7 actions=drop
 cookie=0xaa204b7310e24e4a, duration=57.544s, table=0, n_packets=35025, n_bytes=5293871, idle_age=0, priority=0 actions=NORMAL
~~~

Comment 1 Assaf Muller 2016-11-22 22:37:33 UTC


*** This bug has been marked as a duplicate of bug 1394894 ***