Bug 1433726 - When a network node is rebooting, all HA routers will enter a transition flapping storm
Summary: When a network node is rebooting, all HA routers will enter a transition flap...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: zstream
: 7.0 (Kilo)
Assignee: anil venkata
QA Contact: GenadiC
URL:
Whiteboard:
: 1461244 (view as bug list)
Depends On:
Blocks: 1461107 1461109 1461110 1461111 1461113
TreeView+ depends on / blocked
 
Reported: 2017-03-19 14:41 UTC by David Hill
Modified: 2021-03-11 15:03 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-2015.1.4-16.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1461107 (view as bug list)
Environment:
Last Closed: 2017-07-12 13:15:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 470905 0 None None None 2017-06-06 13:21:56 UTC
Red Hat Product Errata RHBA-2017:1747 0 normal SHIPPED_LIVE openstack-neutron bug fix advisory 2017-07-12 17:11:36 UTC

Description David Hill 2017-03-19 14:41:38 UTC
Description of problem:
We have been able to reproduce the VRRP transition flapping behavior in a scaled down environment. I have also attached a dump of the list of packages installed in this environment. Here are the details:

Hardware Specs:
CPUs 2
Cores 16
Enabled Cores 16
Threads 32
Available Memory 393216
Total Memory 393216
Memory Speed 1600
Adapters 1
InterfacesTotal
NICs 8

Environment:
3 Network Nodes
1026 - HA Routers
1206 - Floating IP's
444 - Load Balancer Pools
1052 - Networks

How To Reproduce:
1. Start with all 3 network nodes in service
2. Add the resources to OpenStack as described above. At this point we observed the environment to be fairly stable.
3. Force a failover by stopping all Neutron services on 1 of the network nodes and stopping all keepalived processes (a reboot would suffice).
4. At this point you may begin to observe a router transition storm start to occur.
5. Restart Neutron services on the network node taken out of service in step 3. It is at this step that the environment begins to descend into a massive router transition storm.

Version-Release number of selected component (if applicable):


How reproducible:
Every time


Steps to Reproduce:
1. Have lots of HA routers
2. Reboot one of the servers
3.

Actual results:
Flapping storm occurs where all HA router will begin flapping between MASTER and BACKUP 


Expected results:
Should transition to BACKUP or MASTER once 

Additional info:

Comment 9 anil venkata 2017-06-05 10:22:07 UTC
Proposed change https://review.openstack.org/#/c/470905/ to fix [1]

[1] https://bugs.launchpad.net/neutron/+bug/1597461

Comment 10 Andreas Karis 2017-06-09 19:56:40 UTC
Hi Anil,

The customer had another outage recently. In your opinion, how long will it take to deliver a hotfix?

- Andreas

Comment 11 anil venkata 2017-06-10 18:21:15 UTC
Hi Andreas

This bug needs 3 patches(one u/s patch not yet merged, I hope it gets merged soon) to be backported to OSP7. These patches in newer branches are using RPC calls which are not in Kilo(and also huge code changes) making backporting difficult. I am working on this with priority and trying to provide build within this week. 

Thanks
Anil

Comment 13 anil venkata 2017-06-20 05:26:17 UTC
Below patches are ready for review. I will ask my team members to review them with priority

https://code.engineering.redhat.com/gerrit/#/c/101640/
https://code.engineering.redhat.com/gerrit/#/c/101642/
https://code.engineering.redhat.com/gerrit/#/c/109264/

Comment 17 Ihar Hrachyshka 2017-07-03 13:37:44 UTC
*** Bug 1461244 has been marked as a duplicate of this bug. ***

Comment 18 GenadiC 2017-07-09 10:49:12 UTC
When trying to do code verification I couldn't find get_routers_id function under neutron/api/rpc/handlers/l3_rpc.py in the code on the controller. 
Any reason for that?

Comment 19 GenadiC 2017-07-09 11:35:49 UTC
Code verification that Red Hat Engineering Gerrit in this bug exists in the code for openstack-neutron-2015.1.4-16.el7ost.noarch

Comment 23 errata-xmlrpc 2017-07-12 13:15:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1747

Comment 24 Andreas Karis 2017-09-11 21:06:52 UTC
I know that this is closed, but just for future reference, when this issue occured, the following could be observed:

- IN l3 agent log, lot of router transitioning messages are seen:

~~~
# grep 'transitioned to' var/log/neutron/l3-agent.log | tail
2017-02-21 10:28:38.094 74155 INFO neutron.agent.l3.ha [-] Router f979011b-944b-4f49-887d-feace356f9f7 transitioned to master
2017-02-21 10:28:38.426 74155 INFO neutron.agent.l3.ha [-] Router 9392d069-e589-46a8-8979-294c98b03040 transitioned to master
2017-02-21 10:28:43.579 74155 INFO neutron.agent.l3.ha [-] Router 23fbce01-37da-4f1f-8af9-105510086966 transitioned to master
2017-02-21 10:28:43.616 74155 INFO neutron.agent.l3.ha [-] Router 1f8e167b-8987-47b7-b1ba-4de944c986a3 transitioned to master
2017-02-21 10:28:43.654 74155 INFO neutron.agent.l3.ha [-] Router 00476f59-ae81-4dc4-bce1-e5dc3b05f2c3 transitioned to master
2017-02-21 10:28:43.783 74155 INFO neutron.agent.l3.ha [-] Router 1fbbbc96-693a-4e94-bc86-801e27934412 transitioned to master
2017-02-21 10:28:43.803 74155 INFO neutron.agent.l3.ha [-] Router e83726b8-3197-4d34-b4bc-cbfc940399e1 transitioned to master
2017-02-21 10:28:57.864 74155 INFO neutron.agent.l3.ha [-] Router 035db19e-3826-41cf-b40a-d253f36ede84 transitioned to master
2017-02-21 10:28:57.969 74155 INFO neutron.agent.l3.ha [-] Router 34c89792-bb5f-481e-9ec2-fb71409b548e transitioned to master
2017-02-21 10:28:59.003 74155 INFO neutron.agent.l3.ha [-] Router 31be9964-23fe-4b56-a57b-3247c068d7c8 transitioned to master
~~~


Note You need to log in before you can comment on or make changes to this bug.