Hide Forgot
Description of problem: Complete loss of network connectivity after controller reboot introduced by BZ 1372370 (https://bugzilla.redhat.com/show_bug.cgi?id=1372370) - Set secure fail mode for physical bridges From the BZ: "Prior to this update, the failure mode on OVS physical bridges was not set, defaulting to `standalone`. Consequently, when the ofctl_interface was set to `native` and the interface became unavailable (due to heavy load, OVS agent shutdown, network disruption), the flows on physical bridges may have been cleared, with the physical bridge traffic being disrupted. With this update, the OVS physical bridge fail mode is set to `secure`. As a result, flows are retained on physical bridges." The problem is that in Red Hat OpenStack Director deployments with HA, connectivity for pacemaker is more often than not established across VLANs that reside / rely on br-ex. If the fail-mode is set to secure, then upon a controller reboot, all flows will be deleted from br-ex. Thus, pacemaker can never reach the rest of the cluster and can never bring up neutron-openvswitch-agent. neutron-openvswitch-agent, however, is needed to recreate the flows. Version-Release number of selected component (if applicable): Bug introduced with openstack-neutron-7.1.1-6.el7ost How reproducible: In a lab, before upgrading to a version >= 7.1.1.6 ~~~ [root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex ~~~ In a lab, after upgrading to neutron 7.1.1.7 ~~~ [root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex secure ~~~ I powered off and rebooted controller 1, and after the reboot, I get this: ~~~ [root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): [root@overcloud-controller-1 ~]# ~~~ And ~~~ Online: [ overcloud-controller-1 ] OFFLINE: [ overcloud-controller-0 overcloud-controller-2 ] ~~~ pacemaker connectivity is established across a vlan which goes out via ovs' br-ex interface. ~~~ [root@overcloud-controller-1 ~]# ping overcloud-controller-0 PING overcloud-controller-0.localdomain (172.16.2.9) 56(84) bytes of data. ^C --- overcloud-controller-0.localdomain ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms [root@overcloud-controller-1 ~]# ip r g 172.16.2.9 172.16.2.9 dev vlan901 src 172.16.2.6 cache ~~~ This leads to a chicken/egg problem: - ovs will remove all flows from br-ex because it is in fail-mode secure. Because flows are removed a) pacemaker does not bring up the controller's own services because it cannot reach the rest of the cluster, and b) connectivity to the other 2 controllers for neutron itself obviously does not work, neither. - in order to populate ovs with the correct flows, neutron-openvswitch-agent needs to be started by pacemaker. this will only happen if pacemaker can reach the rest of the cluster. This cannot work, because ovs removed all flows due to fail-mode secure. A manual start of neutron-openvswitch-agent recreates OVS flows and the controller can join the cluster again: ~~~ [root@overcloud-controller-1 ~]# systemctl start neutron-openvswitch-agent.service (...) [root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): cookie=0xa9a7725d59dcff14, duration=43.037s, table=0, n_packets=0, n_bytes=0, idle_age=43, priority=2,in_port=7 actions=drop cookie=0xa9a7725d59dcff14, duration=43.085s, table=0, n_packets=45897, n_bytes=8302418, idle_age=0, priority=0 actions=NORMAL ~~~ Stopping pacemaker on all controllers ~~~ [root@overcloud-controller-0 ~]# pcs cluster stop --all ~~~ Downgrading to 7.1.1.5 on all controllers ~~~ yum downgrade openstack-neutron-7.1.1-5.el7ost.noarch openstack-neutron-common-7.1.1-5.el7ost.noarch openstack-neutron-ml2-7.1.1-5.el7ost.noarch openstack-neutron-openvswitch-7.1.1-5.el7ost.noarch python-neutron-7.1.1-5.el7ost.noarch openstack-neutron-metering-agent-7.1.1-5.el7ost.noarch -y ~~~ Starting all pacemaker services on all controllers ~~~ [root@overcloud-controller-0 ~]# pcs cluster start --all ~~~ And a manual removal of fail-mode is needed to get rid of fail-mode secure: ~~~ [root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex secure [root@overcloud-controller-0 log]# ovs-vsctl help | grep fail [root@overcloud-controller-0 log]# ovs-vsctl del-fail-mode br-ex [root@overcloud-controller-0 log]# ovs-vsctl get-fail-mode br-ex [root@overcloud-controller-0 log]# ~~~ Stopping cluster again on all machines just to be sure ~~~ [root@overcloud-controller-0 log]# pcs cluster stop --all ~~~ (Soft) Rebooting all controllers ~~~ reboot ~~~ After this, all controllers come back, with: ~~~ [root@overcloud-controller-1 ~]# ovs-vsctl get-fail-mode br-ex [root@overcloud-controller-1 ~]# [root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): cookie=0xb0ee7296820f37f2, duration=7.941s, table=0, n_packets=0, n_bytes=0, idle_age=7, priority=2,in_port=7 actions=drop cookie=0xb0ee7296820f37f2, duration=7.988s, table=0, n_packets=22863, n_bytes=4384453, idle_age=0, priority=0 actions=NORMAL ~~~ Killing controller-1 as in initial test and waiting for its restart, confirmations of flows and of pcs status ~~~ Right after reboot: [root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): cookie=0x0, duration=61.048s, table=0, n_packets=9460, n_bytes=1483143, idle_age=0, priority=0 actions=NORMAL [root@overcloud-controller-1 ~]# pcs status | head Cluster name: tripleo_cluster Last updated: Tue Nov 22 19:29:18 2016 Last change: Tue Nov 22 19:27:06 2016 by hacluster via crmd on overcloud-controller-2 Stack: corosync Current DC: overcloud-controller-0 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 3 nodes and 112 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: [root@overcloud-controller-1 ~]# ~~~ After PCS reconverges ~~~ [root@overcloud-controller-1 ~]# pcs status | grep -i stop [root@overcloud-controller-1 ~]# [root@overcloud-controller-1 ~]# ovs-ofctl dump-flows br-ex NXST_FLOW reply (xid=0x4): cookie=0xaa204b7310e24e4a, duration=57.494s, table=0, n_packets=0, n_bytes=0, idle_age=57, priority=2,in_port=7 actions=drop cookie=0xaa204b7310e24e4a, duration=57.544s, table=0, n_packets=35025, n_bytes=5293871, idle_age=0, priority=0 actions=NORMAL ~~~
*** This bug has been marked as a duplicate of bug 1394894 ***