Bug 1490281

Summary: ARP storm on controllers after all controllers ungracefully reset at once
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openstack-neutronAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED ERRATA QA Contact: Marian Krcmarik <mkrcmari>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: ahrechan, akaris, amuller, aschultz, chrisw, fdinitto, hbrock, ihrachys, jlibosva, jschluet, jslagle, mariel, mburns, mcornea, michele, mkrcmari, nyechiel, oblaut, ohochman, rhel-osp-director-maint, sasha, srevivo, tfreger, tvignaud, ushkalim
Target Milestone: rcKeywords: AutomationBlocker, Reopened, TestBlocker, Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-11.0.2-0.20171020230402.el7ost Doc Type: Bug Fix
Doc Text:
Some deployments use Neutron provider bridges for internal traffic, such as traffic for AMQP, which causes bridges on boot are set to behave like normal switching. Because ARP broadcast packets use patch-ports to go between the integration bridge and the provider bridges, ARP storms to occur if more controllers were turned off ungracefully and then simultaneously booted up. The new systemd service neutron-destroy-patch-ports now executes at the boot to remove the patch ports and break the connection between the integration bridge and the provider bridges. This prevents ARP storms, and the patch ports are then renewed after the openvswitch agent is started.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 22:08:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1505773    
Bug Blocks:    

Description Marian Krcmarik 2017-09-11 08:44:46 UTC
Description of problem:
OSP12 env with 3 controllers hosting all services - no composable roles. ARP storm appears on controllers after boot once all three controllers are taken down ungracefully during recovery test. This may be the reason why controllers cannot communicate with each other -> huge packet loss between them and pacemaker cluster is not formed after all controllers boot.

I am not sure what is the exact cause or component nor how much the topology of networks influence that -> I am using Infrared for openstack deployment.

Jakub Libosvar was taking a look at the env and will comment more, in short it seems that arp broadcast comes to br-isolated interface and then through patch-port phy-br-isolated It gets to br-int and the arp broadcast gets to int-br-ex which is port patch for br-ex and from br-ex It gets to eth2 and arp storm is created.

Interfaces on controller node:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:0d:cb:3f brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.14/24 brd 192.168.24.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe0d:cb3f/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:58:5d:15 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe58:5d15/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:76:19:6b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe76:196b/64 scope link 
       valid_lft forever preferred_lft forever
5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 3a:21:bc:14:c4:d5 brd ff:ff:ff:ff:ff:ff
6: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:76:19:6b brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.106/24 brd 10.0.0.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe76:196b/64 scope link 
       valid_lft forever preferred_lft forever
7: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000
    link/ether 1a:e4:a0:38:8d:d0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::18e4:a0ff:fe38:8dd0/64 scope link 
       valid_lft forever preferred_lft forever
8: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether b6:16:ca:54:a2:48 brd ff:ff:ff:ff:ff:ff
9: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 96:39:74:52:53:40 brd ff:ff:ff:ff:ff:ff
10: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 3a:5c:14:43:da:11 brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.22/24 brd 172.17.4.255 scope global vlan40
       valid_lft forever preferred_lft forever
    inet6 fe80::385c:14ff:fe43:da11/64 scope link 
       valid_lft forever preferred_lft forever
11: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 9a:3b:99:88:1c:d9 brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.15/24 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet 172.17.1.19/32 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet6 fe80::983b:99ff:fe88:1cd9/64 scope link 
       valid_lft forever preferred_lft forever
12: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 8e:d5:36:5e:74:28 brd ff:ff:ff:ff:ff:ff
    inet 172.17.3.12/24 brd 172.17.3.255 scope global vlan30
       valid_lft forever preferred_lft forever
    inet 172.17.3.19/32 brd 172.17.3.255 scope global vlan30
       valid_lft forever preferred_lft forever
    inet6 fe80::8cd5:36ff:fe5e:7428/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether da:c4:d9:40:0d:41 brd ff:ff:ff:ff:ff:ff
    inet 172.17.2.19/24 brd 172.17.2.255 scope global vlan50
       valid_lft forever preferred_lft forever
    inet6 fe80::d8c4:d9ff:fe40:d41/64 scope link 
       valid_lft forever preferred_lft forever
14: br-isolated: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:58:5d:15 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe58:5d15/64 scope link 
       valid_lft forever preferred_lft forever
15: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:e2:ca:4c:3d brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

ovs bridges:
bf3712f4-894a-42a8-8971-722a50c47cc0
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-ex
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "eth2"
            Interface "eth2"
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-isolated
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "vlan30"
            tag: 30
            Interface "vlan30"
                type: internal
        Port phy-br-isolated
            Interface phy-br-isolated
                type: patch
                options: {peer=int-br-isolated}
        Port "eth1"
            Interface "eth1"
        Port "vlan40"
            tag: 40
            Interface "vlan40"
                type: internal
        Port "vlan20"
            tag: 20
            Interface "vlan20"
                type: internal
        Port br-isolated
            Interface br-isolated
                type: internal
        Port "vlan50"
            tag: 50
            Interface "vlan50"
                type: internal
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port int-br-isolated
            Interface int-br-isolated
                type: patch
                options: {peer=phy-br-isolated}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-tun
            Interface br-tun
                type: internal
        Port "vxlan-ac11020f"
            Interface "vxlan-ac11020f"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.15"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "vxlan-ac11020a"
            Interface "vxlan-ac11020a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.10"}
        Port "vxlan-ac110212"
            Interface "vxlan-ac110212"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.18"}
    ovs_version: "2.7.2"

Version-Release number of selected component (if applicable):
Not sure what components are applicable

How reproducible:
Always

Steps to Reproduce:
1. Take down all controllers nodes at once ungracefully
2. Wait for them to boot up

Actual results:
ARP storm appears on all the nodes, high packet loss between nodes, pacemaker cluster is not formed.

Expected results:
Cluster is formed successfully and no packet loss between nodes.

Additional info:

Comment 4 Jakub Libosvar 2017-09-18 14:58:59 UTC
The description looks pretty accurate :)

So the real issue here is that we shouldn't be actually using ovs bridges for management networks but put them to separate interface or bond [1]. Quote:

"The OVS bridge connects to the Neutron server in order to get configuration data. If the OpenStack control traffic (typically the Control Plane and Internal API networks) is placed on an OVS bridge, then connectivity to the Neutron server gets lost whenever OVS is upgraded or the OVS bridge is restarted by the admin user or process. This will cause some downtime. If downtime is not acceptable under these circumstances, then the Control group networks should be placed on a separate interface or bond rather than on an OVS bridge:

A minimal setting can be achieved, when you put the Internal API network on a VLAN on the provisioning interface and the OVS bridge on a second interface.
If you want bonding, you need at least two bonds (four network interfaces). The control group should be placed on a Linux bond (Linux bridge). If the switch does not support LACP fallback to a single interface for PXE boot, then this solution requires at least five NICs."

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/advanced_overcloud_customization/#sect-Isolating_Networks

Such configuration is passed to tripleo but it's not recommended as per documentation. I'm closing this as a NOTABUG for now as it seems like a bug in deployment tool (infrared?) that generated templates for tripleo.

Comment 5 Jakub Libosvar 2017-09-27 09:17:35 UTC
*** Bug 1495224 has been marked as a duplicate of this bug. ***

Comment 6 Marius Cornea 2017-09-27 11:23:14 UTC
(In reply to Jakub Libosvar from comment #4)
> Such configuration is passed to tripleo but it's not recommended as per
> documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> in deployment tool (infrared?) that generated templates for tripleo.

This recommendation was introduced after OSP11 GA. I haven't tested yet but I suspect upgraded environments(coming for <=OSP11) would be affected as well by this issue. Since it looks that in OSP11 this bug doesn't manifest on the same topology I think we should find out where the regression comes from in OSP12.

Comment 7 Jakub Libosvar 2017-09-27 11:27:10 UTC
(In reply to Marius Cornea from comment #6)
> (In reply to Jakub Libosvar from comment #4)
> > Such configuration is passed to tripleo but it's not recommended as per
> > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > in deployment tool (infrared?) that generated templates for tripleo.
> 
> This recommendation was introduced after OSP11 GA. I haven't tested yet but
> I suspect upgraded environments(coming for <=OSP11) would be affected as
> well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> on the same topology I think we should find out where the regression comes
> from in OSP12.

We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting node will cause that ovs-agent won't have access to neutron-server after boot because it needs information about how to configure bridge which it uses for communication with neutron-server. So it creates a chicken-egg problem.

In my opinion, ideally we should provide a way to switch isolated networks from ovs bridge to linux devices as part of upgrade process.

Comment 8 Fabio Massimo Di Nitto 2017-09-27 12:27:14 UTC
(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> > This recommendation was introduced after OSP11 GA. I haven't tested yet but
> > I suspect upgraded environments(coming for <=OSP11) would be affected as
> > well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> > on the same topology I think we should find out where the regression comes
> > from in OSP12.
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.
> 
> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

Shouldn´t we keep this bug open to track the activity to migrate environments during upgrades? At least there is a visible reference to the issue in bugzilla and we should have at least a KB for customers upgrading (+ release notes).

Comment 9 Marius Cornea 2017-09-27 12:28:33 UTC
(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> > This recommendation was introduced after OSP11 GA. I haven't tested yet but
> > I suspect upgraded environments(coming for <=OSP11) would be affected as
> > well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> > on the same topology I think we should find out where the regression comes
> > from in OSP12.
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.
> 
> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.

I see that the environment reported for bug 1473763 had neutron ovs agent running inside a container (openstack-neutron-openvswitch-agent-docker). This is not the case anymore as we're now running the neutron related services on the baremetal host and not inside a container. Do you think we should still expect to see the issues reported initially if we reverted the fix ?

> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

We should probably look into this and what the implications are. Afaik currently we're not doing any isolated networks configuration update during upgrade. This kind of change highly depends on the physical networking infrastructure so we need to see if the recommended old architecture is compatible and can be easily migrated to the new recommended architecture, i.e - make sure our users won't have to install additional NICs to be able to perform an upgrade.

Comment 10 Udi Shkalim 2017-09-27 15:48:48 UTC
Reopening the bug for tracking and additional attention.

To summarize:

This bug is introduced only in OSP12 due to a fix related to https://bugzilla.redhat.com/show_bug.cgi?id=1473763 
This issue was not observed in OSP11.

Forcing a network NIC arch. change on customer for upgrade or new installation is not a solution.
Problems are starting to pop up and we are not close to finish our tests.

Need to look on comment #9 - maybe reverting the fix will be the solution here?

Comment 11 Omri Hochman 2017-09-27 17:30:04 UTC
Keywords: TestBlocker  

Our OSP12 Assure deployments have to survive the following test post reboot: 

Squance :
Deployment --> Launch Instance --> Sanity --> Reboot --> Launch Instance --> Sanity


Due to this bug the squance post reboot is currently blocked.

Comment 12 Assaf Muller 2017-09-27 19:55:49 UTC
(In reply to Udi Shkalim from comment #10)
> Reopening the bug for tracking and additional attention.
> 
> To summarize:
> 
> This bug is introduced only in OSP12 due to a fix related to
> https://bugzilla.redhat.com/show_bug.cgi?id=1473763 
> This issue was not observed in OSP11.
> 
> Forcing a network NIC arch. change on customer for upgrade or new
> installation is not a solution.
> Problems are starting to pop up and we are not close to finish our tests.

I agree.

> 
> Need to look on comment #9 - maybe reverting the fix will be the solution
> here?

Jakub and I talked about this today. A straight up revert is not an option because it will reintroduce the different blocker we resolved. We're trying to come up with a solution to both issues.

Comment 14 Jakub Libosvar 2017-10-02 11:30:16 UTC
I can propose a patch to remove patch ports between isolated bridge and integration bridge in the network script.

Comment 15 Assaf Muller 2017-10-02 14:32:53 UTC
We spent an hour~ talking about this issue today on a team call, Jakub will follow with Sofer on follow up details. It looks like the way forward is to break the loop by deleting the patch ports as Jakub commented in comment 14.

Comment 16 Marian Krcmarik 2017-10-04 09:34:18 UTC
(In reply to Jakub Libosvar from comment #14)
> I can propose a patch to remove patch ports between isolated bridge and
> integration bridge in the network script.

I've tested proposed change on my deployment where the bug could be reproduced and I did not hit the issue (I performed the test multiples times with successful result).

Comment 27 Jakub Libosvar 2017-10-25 09:14:55 UTC
The bug 1505773 is enabling the new systemd service with puppet module. To verify this bug, you can run "systemctl enable neutron-destroy-patch-ports" after installing the new RPM to avoid dependency on bug 1505773.

Comment 29 Alexander Chuzhoy 2017-10-27 16:40:14 UTC
FailedQA


Environment:
openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch

Seems like the issue is reproducing.

What I see after reboot:




Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:38:09 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-0 overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

############################################################################

Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:37:27 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-2 ]
OFFLINE: [ overcloud-controller-0 overcloud-controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

################################################################


Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:38:50 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 ]
OFFLINE: [ overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

#######################################################################
Obviously unable to work with oc

Comment 30 Jakub Libosvar 2017-10-30 09:01:04 UTC
Can you provide more information? Have you enabled the service like I described in comment 27? After machine is booted, do you still see the patch ports between br-int and provider bridges? Can you still see the ARP storm?

Comment 35 Jon Schlueter 2017-10-30 20:53:22 UTC
Looks like another patch was added to build moving back to MODIFIED

Comment 37 Omri Hochman 2017-10-30 21:30:40 UTC
Thanks Jon 
Not sure why this bug is switched to ON_QA before the new RPM is included in the puddle.  

+ the final solution cannot hold requirement for manually running 'systemctl enable neutron-destroy-patch-ports' , we need to make sure that by running clean deployment with the fix - enable of neutron-destroy-patch-ports
won't be required.

Comment 39 Ihar Hrachyshka 2017-11-06 21:36:17 UTC
Sofer, we have https://bugzilla.redhat.com/show_bug.cgi?id=1505773 to track puppet integration.

Comment 40 Ihar Hrachyshka 2017-11-06 21:37:34 UTC
Sorry, the last comment was for Omri.

Comment 47 errata-xmlrpc 2017-12-13 22:08:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462