1490281 – ARP storm on controllers after all controllers ungracefully reset at once

Bug 1490281 - ARP storm on controllers after all controllers ungracefully reset at once

Summary: ARP storm on controllers after all controllers ungracefully reset at once

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	12.0 (Pike)
Assignee:	Jakub Libosvar
QA Contact:	Marian Krcmarik
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1495224 (view as bug list)
Depends On:	1505773
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-11 08:44 UTC by Marian Krcmarik
Modified:	2018-02-05 19:12 UTC (History)
CC List:	25 users (show)
Fixed In Version:	openstack-neutron-11.0.2-0.20171020230402.el7ost
Doc Type:	Bug Fix
Doc Text:	Some deployments use Neutron provider bridges for internal traffic, such as traffic for AMQP, which causes bridges on boot are set to behave like normal switching. Because ARP broadcast packets use patch-ports to go between the integration bridge and the provider bridges, ARP storms to occur if more controllers were turned off ungracefully and then simultaneously booted up. The new systemd service neutron-destroy-patch-ports now executes at the boot to remove the patch ports and break the connection between the integration bridge and the provider bridges. This prevents ARP storms, and the patch ports are then renewed after the openvswitch agent is started.
Clone Of:
Environment:
Last Closed:	2017-12-13 22:08:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1720766	None	None	None	2017-10-02 11:40:33 UTC
OpenStack gerrit	513460	None	MERGED	Enable neutron-destroy-port-patches service	2020-10-16 10:10:33 UTC
RDO	10145	None	None	None	2017-10-17 20:34:14 UTC
RDO	10336	None	None	None	2017-10-30 15:57:18 UTC
Red Hat Product Errata	RHEA-2017:3462	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 12.0 Enhancement Advisory	2018-02-16 01:43:25 UTC

Description Marian Krcmarik 2017-09-11 08:44:46 UTC

Description of problem:
OSP12 env with 3 controllers hosting all services - no composable roles. ARP storm appears on controllers after boot once all three controllers are taken down ungracefully during recovery test. This may be the reason why controllers cannot communicate with each other -> huge packet loss between them and pacemaker cluster is not formed after all controllers boot.

I am not sure what is the exact cause or component nor how much the topology of networks influence that -> I am using Infrared for openstack deployment.

Jakub Libosvar was taking a look at the env and will comment more, in short it seems that arp broadcast comes to br-isolated interface and then through patch-port phy-br-isolated It gets to br-int and the arp broadcast gets to int-br-ex which is port patch for br-ex and from br-ex It gets to eth2 and arp storm is created.

Interfaces on controller node:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:0d:cb:3f brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.14/24 brd 192.168.24.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe0d:cb3f/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:58:5d:15 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe58:5d15/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:76:19:6b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe76:196b/64 scope link 
       valid_lft forever preferred_lft forever
5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 3a:21:bc:14:c4:d5 brd ff:ff:ff:ff:ff:ff
6: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:76:19:6b brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.106/24 brd 10.0.0.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe76:196b/64 scope link 
       valid_lft forever preferred_lft forever
7: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000
    link/ether 1a:e4:a0:38:8d:d0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::18e4:a0ff:fe38:8dd0/64 scope link 
       valid_lft forever preferred_lft forever
8: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether b6:16:ca:54:a2:48 brd ff:ff:ff:ff:ff:ff
9: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 96:39:74:52:53:40 brd ff:ff:ff:ff:ff:ff
10: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 3a:5c:14:43:da:11 brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.22/24 brd 172.17.4.255 scope global vlan40
       valid_lft forever preferred_lft forever
    inet6 fe80::385c:14ff:fe43:da11/64 scope link 
       valid_lft forever preferred_lft forever
11: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 9a:3b:99:88:1c:d9 brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.15/24 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet 172.17.1.19/32 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet6 fe80::983b:99ff:fe88:1cd9/64 scope link 
       valid_lft forever preferred_lft forever
12: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 8e:d5:36:5e:74:28 brd ff:ff:ff:ff:ff:ff
    inet 172.17.3.12/24 brd 172.17.3.255 scope global vlan30
       valid_lft forever preferred_lft forever
    inet 172.17.3.19/32 brd 172.17.3.255 scope global vlan30
       valid_lft forever preferred_lft forever
    inet6 fe80::8cd5:36ff:fe5e:7428/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether da:c4:d9:40:0d:41 brd ff:ff:ff:ff:ff:ff
    inet 172.17.2.19/24 brd 172.17.2.255 scope global vlan50
       valid_lft forever preferred_lft forever
    inet6 fe80::d8c4:d9ff:fe40:d41/64 scope link 
       valid_lft forever preferred_lft forever
14: br-isolated: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:58:5d:15 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe58:5d15/64 scope link 
       valid_lft forever preferred_lft forever
15: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:e2:ca:4c:3d brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

ovs bridges:
bf3712f4-894a-42a8-8971-722a50c47cc0
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-ex
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "eth2"
            Interface "eth2"
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-isolated
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "vlan30"
            tag: 30
            Interface "vlan30"
                type: internal
        Port phy-br-isolated
            Interface phy-br-isolated
                type: patch
                options: {peer=int-br-isolated}
        Port "eth1"
            Interface "eth1"
        Port "vlan40"
            tag: 40
            Interface "vlan40"
                type: internal
        Port "vlan20"
            tag: 20
            Interface "vlan20"
                type: internal
        Port br-isolated
            Interface br-isolated
                type: internal
        Port "vlan50"
            tag: 50
            Interface "vlan50"
                type: internal
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port int-br-isolated
            Interface int-br-isolated
                type: patch
                options: {peer=phy-br-isolated}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-tun
            Interface br-tun
                type: internal
        Port "vxlan-ac11020f"
            Interface "vxlan-ac11020f"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.15"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port "vxlan-ac11020a"
            Interface "vxlan-ac11020a"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.10"}
        Port "vxlan-ac110212"
            Interface "vxlan-ac110212"
                type: vxlan
                options: {df_default="true", in_key=flow, local_ip="172.17.2.19", out_key=flow, remote_ip="172.17.2.18"}
    ovs_version: "2.7.2"

Version-Release number of selected component (if applicable):
Not sure what components are applicable

How reproducible:
Always

Steps to Reproduce:
1. Take down all controllers nodes at once ungracefully
2. Wait for them to boot up

Actual results:
ARP storm appears on all the nodes, high packet loss between nodes, pacemaker cluster is not formed.

Expected results:
Cluster is formed successfully and no packet loss between nodes.

Additional info:

Comment 4 Jakub Libosvar 2017-09-18 14:58:59 UTC

The description looks pretty accurate :)

So the real issue here is that we shouldn't be actually using ovs bridges for management networks but put them to separate interface or bond [1]. Quote:

"The OVS bridge connects to the Neutron server in order to get configuration data. If the OpenStack control traffic (typically the Control Plane and Internal API networks) is placed on an OVS bridge, then connectivity to the Neutron server gets lost whenever OVS is upgraded or the OVS bridge is restarted by the admin user or process. This will cause some downtime. If downtime is not acceptable under these circumstances, then the Control group networks should be placed on a separate interface or bond rather than on an OVS bridge:

A minimal setting can be achieved, when you put the Internal API network on a VLAN on the provisioning interface and the OVS bridge on a second interface.
If you want bonding, you need at least two bonds (four network interfaces). The control group should be placed on a Linux bond (Linux bridge). If the switch does not support LACP fallback to a single interface for PXE boot, then this solution requires at least five NICs."

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/advanced_overcloud_customization/#sect-Isolating_Networks

Such configuration is passed to tripleo but it's not recommended as per documentation. I'm closing this as a NOTABUG for now as it seems like a bug in deployment tool (infrared?) that generated templates for tripleo.

Comment 5 Jakub Libosvar 2017-09-27 09:17:35 UTC

*** Bug 1495224 has been marked as a duplicate of this bug. ***

Comment 6 Marius Cornea 2017-09-27 11:23:14 UTC

(In reply to Jakub Libosvar from comment #4)
> Such configuration is passed to tripleo but it's not recommended as per
> documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> in deployment tool (infrared?) that generated templates for tripleo.

This recommendation was introduced after OSP11 GA. I haven't tested yet but I suspect upgraded environments(coming for <=OSP11) would be affected as well by this issue. Since it looks that in OSP11 this bug doesn't manifest on the same topology I think we should find out where the regression comes from in OSP12.

Comment 7 Jakub Libosvar 2017-09-27 11:27:10 UTC

(In reply to Marius Cornea from comment #6)
> (In reply to Jakub Libosvar from comment #4)
> > Such configuration is passed to tripleo but it's not recommended as per
> > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > in deployment tool (infrared?) that generated templates for tripleo.
> 
> This recommendation was introduced after OSP11 GA. I haven't tested yet but
> I suspect upgraded environments(coming for <=OSP11) would be affected as
> well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> on the same topology I think we should find out where the regression comes
> from in OSP12.

We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting node will cause that ovs-agent won't have access to neutron-server after boot because it needs information about how to configure bridge which it uses for communication with neutron-server. So it creates a chicken-egg problem.

In my opinion, ideally we should provide a way to switch isolated networks from ovs bridge to linux devices as part of upgrade process.

Comment 8 Fabio Massimo Di Nitto 2017-09-27 12:27:14 UTC

(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> > This recommendation was introduced after OSP11 GA. I haven't tested yet but
> > I suspect upgraded environments(coming for <=OSP11) would be affected as
> > well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> > on the same topology I think we should find out where the regression comes
> > from in OSP12.
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.
> 
> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

Shouldn´t we keep this bug open to track the activity to migrate environments during upgrades? At least there is a visible reference to the issue in bugzilla and we should have at least a KB for customers upgrading (+ release notes).

Comment 9 Marius Cornea 2017-09-27 12:28:33 UTC

(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> > This recommendation was introduced after OSP11 GA. I haven't tested yet but
> > I suspect upgraded environments(coming for <=OSP11) would be affected as
> > well by this issue. Since it looks that in OSP11 this bug doesn't manifest
> > on the same topology I think we should find out where the regression comes
> > from in OSP12.
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.
> 
> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

(In reply to Jakub Libosvar from comment #7)
> (In reply to Marius Cornea from comment #6)
> > (In reply to Jakub Libosvar from comment #4)
> > > Such configuration is passed to tripleo but it's not recommended as per
> > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug
> > > in deployment tool (infrared?) that generated templates for tripleo.
> > 
> 
> We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting
> node will cause that ovs-agent won't have access to neutron-server after
> boot because it needs information about how to configure bridge which it
> uses for communication with neutron-server. So it creates a chicken-egg
> problem.

I see that the environment reported for bug 1473763 had neutron ovs agent running inside a container (openstack-neutron-openvswitch-agent-docker). This is not the case anymore as we're now running the neutron related services on the baremetal host and not inside a container. Do you think we should still expect to see the issues reported initially if we reverted the fix ?

> In my opinion, ideally we should provide a way to switch isolated networks
> from ovs bridge to linux devices as part of upgrade process.

We should probably look into this and what the implications are. Afaik currently we're not doing any isolated networks configuration update during upgrade. This kind of change highly depends on the physical networking infrastructure so we need to see if the recommended old architecture is compatible and can be easily migrated to the new recommended architecture, i.e - make sure our users won't have to install additional NICs to be able to perform an upgrade.

Comment 10 Udi Shkalim 2017-09-27 15:48:48 UTC

Reopening the bug for tracking and additional attention.

To summarize:

This bug is introduced only in OSP12 due to a fix related to https://bugzilla.redhat.com/show_bug.cgi?id=1473763 
This issue was not observed in OSP11.

Forcing a network NIC arch. change on customer for upgrade or new installation is not a solution.
Problems are starting to pop up and we are not close to finish our tests.

Need to look on comment #9 - maybe reverting the fix will be the solution here?

Comment 11 Omri Hochman 2017-09-27 17:30:04 UTC

Keywords: TestBlocker  

Our OSP12 Assure deployments have to survive the following test post reboot: 

Squance :
Deployment --> Launch Instance --> Sanity --> Reboot --> Launch Instance --> Sanity


Due to this bug the squance post reboot is currently blocked.

Comment 12 Assaf Muller 2017-09-27 19:55:49 UTC

(In reply to Udi Shkalim from comment #10)
> Reopening the bug for tracking and additional attention.
> 
> To summarize:
> 
> This bug is introduced only in OSP12 due to a fix related to
> https://bugzilla.redhat.com/show_bug.cgi?id=1473763 
> This issue was not observed in OSP11.
> 
> Forcing a network NIC arch. change on customer for upgrade or new
> installation is not a solution.
> Problems are starting to pop up and we are not close to finish our tests.

I agree.

> 
> Need to look on comment #9 - maybe reverting the fix will be the solution
> here?

Jakub and I talked about this today. A straight up revert is not an option because it will reintroduce the different blocker we resolved. We're trying to come up with a solution to both issues.

Comment 14 Jakub Libosvar 2017-10-02 11:30:16 UTC

I can propose a patch to remove patch ports between isolated bridge and integration bridge in the network script.

Comment 15 Assaf Muller 2017-10-02 14:32:53 UTC

We spent an hour~ talking about this issue today on a team call, Jakub will follow with Sofer on follow up details. It looks like the way forward is to break the loop by deleting the patch ports as Jakub commented in comment 14.

Comment 16 Marian Krcmarik 2017-10-04 09:34:18 UTC

(In reply to Jakub Libosvar from comment #14)
> I can propose a patch to remove patch ports between isolated bridge and
> integration bridge in the network script.

I've tested proposed change on my deployment where the bug could be reproduced and I did not hit the issue (I performed the test multiples times with successful result).

Comment 27 Jakub Libosvar 2017-10-25 09:14:55 UTC

The bug 1505773 is enabling the new systemd service with puppet module. To verify this bug, you can run "systemctl enable neutron-destroy-patch-ports" after installing the new RPM to avoid dependency on bug 1505773.

Comment 29 Alexander Chuzhoy 2017-10-27 16:40:14 UTC

FailedQA


Environment:
openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch

Seems like the issue is reproducing.

What I see after reboot:




Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:38:09 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-0 overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

############################################################################

Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:37:27 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-2 ]
OFFLINE: [ overcloud-controller-0 overcloud-controller-1 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

################################################################


Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum
Last updated: Fri Oct 27 16:38:50 2017
Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 ]
OFFLINE: [ overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Stopped
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Stopped
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Stopped
   galera-bundle-1      (ocf::heartbeat:galera):        Stopped
   galera-bundle-2      (ocf::heartbeat:galera):        Stopped
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Stopped
   redis-bundle-1       (ocf::heartbeat:redis): Stopped
   redis-bundle-2       (ocf::heartbeat:redis): Stopped
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Stopped
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.1.15 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.14 (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.4.19 (ocf::heartbeat:IPaddr2):       Stopped
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

#######################################################################
Obviously unable to work with oc

Comment 30 Jakub Libosvar 2017-10-30 09:01:04 UTC

Can you provide more information? Have you enabled the service like I described in comment 27? After machine is booted, do you still see the patch ports between br-int and provider bridges? Can you still see the ARP storm?

Comment 35 Jon Schlueter 2017-10-30 20:53:22 UTC

Looks like another patch was added to build moving back to MODIFIED

Comment 37 Omri Hochman 2017-10-30 21:30:40 UTC

Thanks Jon 
Not sure why this bug is switched to ON_QA before the new RPM is included in the puddle.  

+ the final solution cannot hold requirement for manually running 'systemctl enable neutron-destroy-patch-ports' , we need to make sure that by running clean deployment with the fix - enable of neutron-destroy-patch-ports
won't be required.

Comment 39 Ihar Hrachyshka 2017-11-06 21:36:17 UTC

Sofer, we have https://bugzilla.redhat.com/show_bug.cgi?id=1505773 to track puppet integration.

Comment 40 Ihar Hrachyshka 2017-11-06 21:37:34 UTC

Sorry, the last comment was for Omri.

Comment 47 errata-xmlrpc 2017-12-13 22:08:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.

ahrechan
akaris
amuller
aschultz
chrisw
fdinitto
hbrock
ihrachys
jlibosva
jschluet
jslagle
mariel
mburns
mcornea
michele
mkrcmari
nyechiel
oblaut
ohochman
rhel-osp-director-maint
sasha
srevivo
tfreger
tvignaud
ushkalim