Bug 1490281
Summary: | ARP storm on controllers after all controllers ungracefully reset at once | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marian Krcmarik <mkrcmari> |
Component: | openstack-neutron | Assignee: | Jakub Libosvar <jlibosva> |
Status: | CLOSED ERRATA | QA Contact: | Marian Krcmarik <mkrcmari> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 12.0 (Pike) | CC: | ahrechan, akaris, amuller, aschultz, chrisw, fdinitto, hbrock, ihrachys, jlibosva, jschluet, jslagle, mariel, mburns, mcornea, michele, mkrcmari, nyechiel, oblaut, ohochman, rhel-osp-director-maint, sasha, srevivo, tfreger, tvignaud, ushkalim |
Target Milestone: | rc | Keywords: | AutomationBlocker, Reopened, TestBlocker, Triaged |
Target Release: | 12.0 (Pike) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-neutron-11.0.2-0.20171020230402.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Some deployments use Neutron provider bridges for internal traffic, such as traffic for AMQP, which causes bridges on boot are set to behave like normal switching. Because ARP broadcast packets use patch-ports to go between the integration bridge and the provider bridges, ARP storms to occur if more controllers were turned off ungracefully and then simultaneously booted up.
The new systemd service neutron-destroy-patch-ports now executes at the boot to remove the patch ports and break the connection between the integration bridge and the provider bridges. This prevents ARP storms, and the patch ports are then renewed after the openvswitch agent is started.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-12-13 22:08:14 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1505773 | ||
Bug Blocks: |
Description
Marian Krcmarik
2017-09-11 08:44:46 UTC
The description looks pretty accurate :) So the real issue here is that we shouldn't be actually using ovs bridges for management networks but put them to separate interface or bond [1]. Quote: "The OVS bridge connects to the Neutron server in order to get configuration data. If the OpenStack control traffic (typically the Control Plane and Internal API networks) is placed on an OVS bridge, then connectivity to the Neutron server gets lost whenever OVS is upgraded or the OVS bridge is restarted by the admin user or process. This will cause some downtime. If downtime is not acceptable under these circumstances, then the Control group networks should be placed on a separate interface or bond rather than on an OVS bridge: A minimal setting can be achieved, when you put the Internal API network on a VLAN on the provisioning interface and the OVS bridge on a second interface. If you want bonding, you need at least two bonds (four network interfaces). The control group should be placed on a Linux bond (Linux bridge). If the switch does not support LACP fallback to a single interface for PXE boot, then this solution requires at least five NICs." https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/advanced_overcloud_customization/#sect-Isolating_Networks Such configuration is passed to tripleo but it's not recommended as per documentation. I'm closing this as a NOTABUG for now as it seems like a bug in deployment tool (infrared?) that generated templates for tripleo. *** Bug 1495224 has been marked as a duplicate of this bug. *** (In reply to Jakub Libosvar from comment #4) > Such configuration is passed to tripleo but it's not recommended as per > documentation. I'm closing this as a NOTABUG for now as it seems like a bug > in deployment tool (infrared?) that generated templates for tripleo. This recommendation was introduced after OSP11 GA. I haven't tested yet but I suspect upgraded environments(coming for <=OSP11) would be affected as well by this issue. Since it looks that in OSP11 this bug doesn't manifest on the same topology I think we should find out where the regression comes from in OSP12. (In reply to Marius Cornea from comment #6) > (In reply to Jakub Libosvar from comment #4) > > Such configuration is passed to tripleo but it's not recommended as per > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug > > in deployment tool (infrared?) that generated templates for tripleo. > > This recommendation was introduced after OSP11 GA. I haven't tested yet but > I suspect upgraded environments(coming for <=OSP11) would be affected as > well by this issue. Since it looks that in OSP11 this bug doesn't manifest > on the same topology I think we should find out where the regression comes > from in OSP12. We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting node will cause that ovs-agent won't have access to neutron-server after boot because it needs information about how to configure bridge which it uses for communication with neutron-server. So it creates a chicken-egg problem. In my opinion, ideally we should provide a way to switch isolated networks from ovs bridge to linux devices as part of upgrade process. (In reply to Jakub Libosvar from comment #7) > (In reply to Marius Cornea from comment #6) > > (In reply to Jakub Libosvar from comment #4) > > > Such configuration is passed to tripleo but it's not recommended as per > > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug > > > in deployment tool (infrared?) that generated templates for tripleo. > > > > This recommendation was introduced after OSP11 GA. I haven't tested yet but > > I suspect upgraded environments(coming for <=OSP11) would be affected as > > well by this issue. Since it looks that in OSP11 this bug doesn't manifest > > on the same topology I think we should find out where the regression comes > > from in OSP12. > > We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting > node will cause that ovs-agent won't have access to neutron-server after > boot because it needs information about how to configure bridge which it > uses for communication with neutron-server. So it creates a chicken-egg > problem. > > In my opinion, ideally we should provide a way to switch isolated networks > from ovs bridge to linux devices as part of upgrade process. Shouldn´t we keep this bug open to track the activity to migrate environments during upgrades? At least there is a visible reference to the issue in bugzilla and we should have at least a KB for customers upgrading (+ release notes). (In reply to Jakub Libosvar from comment #7) > (In reply to Marius Cornea from comment #6) > > (In reply to Jakub Libosvar from comment #4) > > > Such configuration is passed to tripleo but it's not recommended as per > > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug > > > in deployment tool (infrared?) that generated templates for tripleo. > > > > This recommendation was introduced after OSP11 GA. I haven't tested yet but > > I suspect upgraded environments(coming for <=OSP11) would be affected as > > well by this issue. Since it looks that in OSP11 this bug doesn't manifest > > on the same topology I think we should find out where the regression comes > > from in OSP12. > > We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting > node will cause that ovs-agent won't have access to neutron-server after > boot because it needs information about how to configure bridge which it > uses for communication with neutron-server. So it creates a chicken-egg > problem. > > In my opinion, ideally we should provide a way to switch isolated networks > from ovs bridge to linux devices as part of upgrade process. (In reply to Jakub Libosvar from comment #7) > (In reply to Marius Cornea from comment #6) > > (In reply to Jakub Libosvar from comment #4) > > > Such configuration is passed to tripleo but it's not recommended as per > > > documentation. I'm closing this as a NOTABUG for now as it seems like a bug > > > in deployment tool (infrared?) that generated templates for tripleo. > > > > We can revert fix for bug 1473763 to avoid the arp storm. Then rebooting > node will cause that ovs-agent won't have access to neutron-server after > boot because it needs information about how to configure bridge which it > uses for communication with neutron-server. So it creates a chicken-egg > problem. I see that the environment reported for bug 1473763 had neutron ovs agent running inside a container (openstack-neutron-openvswitch-agent-docker). This is not the case anymore as we're now running the neutron related services on the baremetal host and not inside a container. Do you think we should still expect to see the issues reported initially if we reverted the fix ? > In my opinion, ideally we should provide a way to switch isolated networks > from ovs bridge to linux devices as part of upgrade process. We should probably look into this and what the implications are. Afaik currently we're not doing any isolated networks configuration update during upgrade. This kind of change highly depends on the physical networking infrastructure so we need to see if the recommended old architecture is compatible and can be easily migrated to the new recommended architecture, i.e - make sure our users won't have to install additional NICs to be able to perform an upgrade. Reopening the bug for tracking and additional attention. To summarize: This bug is introduced only in OSP12 due to a fix related to https://bugzilla.redhat.com/show_bug.cgi?id=1473763 This issue was not observed in OSP11. Forcing a network NIC arch. change on customer for upgrade or new installation is not a solution. Problems are starting to pop up and we are not close to finish our tests. Need to look on comment #9 - maybe reverting the fix will be the solution here? Keywords: TestBlocker Our OSP12 Assure deployments have to survive the following test post reboot: Squance : Deployment --> Launch Instance --> Sanity --> Reboot --> Launch Instance --> Sanity Due to this bug the squance post reboot is currently blocked. (In reply to Udi Shkalim from comment #10) > Reopening the bug for tracking and additional attention. > > To summarize: > > This bug is introduced only in OSP12 due to a fix related to > https://bugzilla.redhat.com/show_bug.cgi?id=1473763 > This issue was not observed in OSP11. > > Forcing a network NIC arch. change on customer for upgrade or new > installation is not a solution. > Problems are starting to pop up and we are not close to finish our tests. I agree. > > Need to look on comment #9 - maybe reverting the fix will be the solution > here? Jakub and I talked about this today. A straight up revert is not an option because it will reintroduce the different blocker we resolved. We're trying to come up with a solution to both issues. I can propose a patch to remove patch ports between isolated bridge and integration bridge in the network script. We spent an hour~ talking about this issue today on a team call, Jakub will follow with Sofer on follow up details. It looks like the way forward is to break the loop by deleting the patch ports as Jakub commented in comment 14. (In reply to Jakub Libosvar from comment #14) > I can propose a patch to remove patch ports between isolated bridge and > integration bridge in the network script. I've tested proposed change on my deployment where the bug could be reproduced and I did not hit the issue (I performed the test multiples times with successful result). The bug 1505773 is enabling the new systemd service with puppet module. To verify this bug, you can run "systemctl enable neutron-destroy-patch-ports" after installing the new RPM to avoid dependency on bug 1505773. FailedQA Environment: openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch Seems like the issue is reproducing. What I see after reboot: Stack: corosync Current DC: overcloud-controller-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum Last updated: Fri Oct 27 16:38:09 2017 Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0 12 nodes configured 37 resources configured Online: [ overcloud-controller-1 ] OFFLINE: [ overcloud-controller-0 overcloud-controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Stopped galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Stopped redis-bundle-2 (ocf::heartbeat:redis): Stopped ip-192.168.24.6 (ocf::heartbeat:IPaddr2): Stopped ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.3.14 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Stopped Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ############################################################################ Stack: corosync Current DC: overcloud-controller-2 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum Last updated: Fri Oct 27 16:37:27 2017 Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0 12 nodes configured 37 resources configured Online: [ overcloud-controller-2 ] OFFLINE: [ overcloud-controller-0 overcloud-controller-1 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Stopped galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Stopped redis-bundle-2 (ocf::heartbeat:redis): Stopped ip-192.168.24.6 (ocf::heartbeat:IPaddr2): Stopped ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.3.14 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Stopped Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ################################################################ Stack: corosync Current DC: overcloud-controller-0 (version 1.1.16-12.el7_4.4-94ff4df) - partition WITHOUT quorum Last updated: Fri Oct 27 16:38:50 2017 Last change: Fri Oct 27 15:30:13 2017 by root via cibadmin on overcloud-controller-0 12 nodes configured 37 resources configured Online: [ overcloud-controller-0 ] OFFLINE: [ overcloud-controller-1 overcloud-controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Stopped galera-bundle-1 (ocf::heartbeat:galera): Stopped galera-bundle-2 (ocf::heartbeat:galera): Stopped Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Stopped redis-bundle-1 (ocf::heartbeat:redis): Stopped redis-bundle-2 (ocf::heartbeat:redis): Stopped ip-192.168.24.6 (ocf::heartbeat:IPaddr2): Stopped ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.1.15 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.3.14 (ocf::heartbeat:IPaddr2): Stopped ip-172.17.4.19 (ocf::heartbeat:IPaddr2): Stopped Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ####################################################################### Obviously unable to work with oc Can you provide more information? Have you enabled the service like I described in comment 27? After machine is booted, do you still see the patch ports between br-int and provider bridges? Can you still see the ARP storm? Looks like another patch was added to build moving back to MODIFIED Thanks Jon Not sure why this bug is switched to ON_QA before the new RPM is included in the puddle. + the final solution cannot hold requirement for manually running 'systemctl enable neutron-destroy-patch-ports' , we need to make sure that by running clean deployment with the fix - enable of neutron-destroy-patch-ports won't be required. Sofer, we have https://bugzilla.redhat.com/show_bug.cgi?id=1505773 to track puppet integration. Sorry, the last comment was for Omri. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462 |