Description of problem:
Over the past 2-3 weeks we have had a repeating issue where all 3 HA routers go into backup state. No Master is negociated and the router is unusable.
Looking at a tcpdump of the namespace for the routers, I see VRRP traffic arriving at to all routers and so all 3 routers remain in backup.
On a working router (for comparison) I see only traffic to 2 routers .. as expected.
Interestingly the VRRP traffic in the broken routers case appears to be orginating from the *other* router in the tenant!
Version-Release number of selected component (if applicable):
Happens every few days.
Not sure how to reproduce what I've noticed is tenant has more than one router and only quick and dirty solution is to delete *all* routers (not just affected one) and re-create them.
Steps to Reproduce:
1. Not sure
Router is created and shortly after all 3 routers become backup
Router is created with one master and 2 backup
The keepalived configuration pasted in comment 1 seems correct, router replicas are supposed to be set up with the same configuration.
Next time this reproduces, can you please attach sosreports from all controllers but more importantly grab someone from the Networking team such as Brian, Jakub, Slawek or Bernard?
The router config in comment 1 is for two different routers in the same tenant.
That doesn't appear correct to me ... is it?
From what I can see the two separate routers above are put in the same VRRP group and so we have a situation with 6 routers made up of 1 master and 5 backups, instead of 2 separate routers comprised of 1 master + 2 backups each.
Maybe I totally on the wrong track?
Kieran and I had a call this morning. We can confirm there is an issue in vr_id allocation as the second tenant router has same vr_id as the first router. There is only a single vr_id allocation for two routers. We created third router and it got a new allocation correctly. Fortunately, we do have debug logs from the time second router was created so after looking into those we shall hopefully find the cause.
Turns out the logs were not in DEBUG mode as we recently did a deploy that overwrote out DEBUG mode settings! These logs didn't show anything obvious causing this issue. We haven't see a repeat of this for the past 4 days. I will enable debugging again, and see if the problem happens again.
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828