Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1570136

Summary: All HA routers go into backup state - becoming unusable
Product: Red Hat OpenStack Reporter: kforde
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED EOL QA Contact: Toni Freger <tfreger>
Severity: medium Docs Contact:
Priority: high    
Version: 11.0 (Ocata)CC: alhernan, amuller, chrisw, jlibosva, kforde, nyechiel, sbaker, srevivo
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-22 12:35:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description kforde 2018-04-20 17:18:14 UTC
Description of problem:

Over the past 2-3 weeks we have had a repeating issue where all 3 HA routers go into backup  state. No Master is negociated and the router is unusable. 

Looking at a tcpdump of the namespace for the routers, I see VRRP traffic arriving at to all routers and so all 3 routers remain in backup. 

On a working router (for comparison) I see only traffic to 2 routers .. as expected. 


Interestingly the VRRP traffic in the broken routers case appears to be orginating from the *other* router in the tenant! 

Version-Release number of selected component (if applicable):

openstack-neutron-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-sriov-nic-agent-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python-neutron-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
puppet-neutron-10.4.1-0.20180216212754.249bdde.el7.centos.noarch
openstack-neutron-openvswitch-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python2-neutronclient-6.1.1-1.el7.noarch
python-neutron-lbaas-10.0.2-0.20180313085330.dfc0e24.el7.centos.noarch
openstack-neutron-metering-agent-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python-neutron-lib-1.1.0-1.el7.noarch
openstack-neutron-common-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-ml2-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-lbaas-10.0.2-0.20180313085330.dfc0e24.el7.centos.noarch


How reproducible:

Happens every few days. 
Not sure how to reproduce what I've noticed is tenant has more than one router and only quick and dirty solution is to delete *all* routers (not just affected one) and re-create them. 

Steps to Reproduce:
1. Not sure
2.
3.

Actual results:

Router is created and shortly after all 3 routers become backup

Expected results:

Router is created with one master and 2 backup


Additional info:

Comment 2 Assaf Muller 2018-04-23 18:08:57 UTC
Hi Kieran,

The keepalived configuration pasted in comment 1 seems correct, router replicas are supposed to be set up with the same configuration.

Next time this reproduces, can you please attach sosreports from all controllers but more importantly grab someone from the Networking team such as Brian, Jakub, Slawek or Bernard?

Comment 3 kforde 2018-04-23 19:11:28 UTC
Hi Assaf,

The router config in comment 1 is for two different routers in the same tenant.

That doesn't appear correct to me ... is it? 

From what I can see the two separate routers above are put in the same VRRP group and so we have a situation with 6 routers made up of 1 master and 5 backups, instead of 2 separate routers comprised of 1 master + 2 backups each. 

Maybe I totally on the wrong track?

Comment 4 Jakub Libosvar 2018-04-24 10:28:41 UTC
Kieran and I had a call this morning. We can confirm there is an issue in vr_id allocation as the second tenant router has same vr_id as the first router. There is only a single vr_id allocation for two routers. We created third router and it got a new allocation correctly. Fortunately, we do have debug logs from the time second router was created so after looking into those we shall hopefully find the cause.

Comment 6 kforde 2018-04-25 09:14:57 UTC
Turns out the logs were not in DEBUG mode as we recently did a deploy that overwrote out DEBUG mode settings! These logs didn't show anything obvious causing this issue. We haven't see a repeat of this for the past 4 days. I will enable debugging again, and see if the problem happens again.

Comment 7 Scott Lewis 2018-06-22 12:35:58 UTC
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828