1570136 – All HA routers go into backup state - becoming unusable

Bug 1570136 - All HA routers go into backup state - becoming unusable

Summary: All HA routers go into backup state - becoming unusable

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	11.0 (Ocata)
Assignee:	Assaf Muller
QA Contact:	Toni Freger
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-20 17:18 UTC by kforde
Modified:	2018-06-22 12:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-22 12:35:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description kforde 2018-04-20 17:18:14 UTC

Description of problem:

Over the past 2-3 weeks we have had a repeating issue where all 3 HA routers go into backup state. No Master is negociated and the router is unusable.

Looking at a tcpdump of the namespace for the routers, I see VRRP traffic arriving at to all routers and so all 3 routers remain in backup.

On a working router (for comparison) I see only traffic to 2 routers .. as expected.

Interestingly the VRRP traffic in the broken routers case appears to be orginating from the *other* router in the tenant!

Version-Release number of selected component (if applicable):

openstack-neutron-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-sriov-nic-agent-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python-neutron-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
puppet-neutron-10.4.1-0.20180216212754.249bdde.el7.centos.noarch
openstack-neutron-openvswitch-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python2-neutronclient-6.1.1-1.el7.noarch
python-neutron-lbaas-10.0.2-0.20180313085330.dfc0e24.el7.centos.noarch
openstack-neutron-metering-agent-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
python-neutron-lib-1.1.0-1.el7.noarch
openstack-neutron-common-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-ml2-10.0.6-0.20180317014607.93330ac.el7.centos.noarch
openstack-neutron-lbaas-10.0.2-0.20180313085330.dfc0e24.el7.centos.noarch

How reproducible:

Happens every few days.
Not sure how to reproduce what I've noticed is tenant has more than one router and only quick and dirty solution is to delete *all* routers (not just affected one) and re-create them.

Steps to Reproduce:
1. Not sure
2.
3.

Actual results:

Router is created and shortly after all 3 routers become backup

Expected results:

Router is created with one master and 2 backup

Additional info:

Comment 2 Assaf Muller 2018-04-23 18:08:57 UTC

Hi Kieran,

The keepalived configuration pasted in comment 1 seems correct, router replicas are supposed to be set up with the same configuration.

Next time this reproduces, can you please attach sosreports from all controllers but more importantly grab someone from the Networking team such as Brian, Jakub, Slawek or Bernard?

Comment 3 kforde 2018-04-23 19:11:28 UTC

Hi Assaf,

The router config in comment 1 is for two different routers in the same tenant.

That doesn't appear correct to me ... is it? 

From what I can see the two separate routers above are put in the same VRRP group and so we have a situation with 6 routers made up of 1 master and 5 backups, instead of 2 separate routers comprised of 1 master + 2 backups each. 

Maybe I totally on the wrong track?

Comment 4 Jakub Libosvar 2018-04-24 10:28:41 UTC

Kieran and I had a call this morning. We can confirm there is an issue in vr_id allocation as the second tenant router has same vr_id as the first router. There is only a single vr_id allocation for two routers. We created third router and it got a new allocation correctly. Fortunately, we do have debug logs from the time second router was created so after looking into those we shall hopefully find the cause.

Comment 6 kforde 2018-04-25 09:14:57 UTC

Turns out the logs were not in DEBUG mode as we recently did a deploy that overwrote out DEBUG mode settings! These logs didn't show anything obvious causing this issue. We haven't see a repeat of this for the past 4 days. I will enable debugging again, and see if the problem happens again.

Comment 7 Scott Lewis 2018-06-22 12:35:58 UTC

OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828

Note You need to log in before you can comment on or make changes to this bug.