As original description comment can't be public, here is a description message without the private data: Description of problem: sometimes all l3 ha routers in standby mode and hence the router IP is unpingable. In customer words : ----------------------- after creating 5 routers with the heat stack I create additional external router and try to ping it's external IP. If the ping is O.K i deleted the stack and the external router and try again after several retries the problem occur and from this point every new external router will fail to ping on this tenant. After delete the stack the tenant will recover and external router creation will work O.K After investigate the problem we believe that the RCA is race condition when creating 2 routers with no delay. The scenario is: 1- create router on a new tenant: - Neutron build internal network with subnet for HA management - Create 3 routers on all controllers - The 3 routers talk and decided the active and 2 standby 2- when the second router cratione fire (with no delay) - the internal network of the tenant is still under creation - The neutron create a new internal manage network - This new network failed (no subnet due to conflict with the first net) - The neutron build 3 routers and all of them in standby 3- From this point every router created in this tenant try to communicate on the new damage network. We believe that neutron should protect from this situation and it is only one example of HA resources creation. ----------------------- Non working : ~~~ #neutron l3-agent-list-hosting-router rh-ext-r1 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True | :-) | standby | | 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True | :-) | standby | | a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True | :-) | standby | <<=============== +--------------------------------------+------------------------------------+----------------+-------+----------+ ~~~ Working : ~~~ [root@overcloud-controller-2 (overcloudrc) ~]# neutron l3-agent-list-hosting-router rh2-ext-r2 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True | :-) | standby | | 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True | :-) | standby | | a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True | :-) | active | <<=============== +--------------------------------------+------------------------------------+----------------+-------+----------+ ~~~ Version-Release number of selected component (if applicable): # rpm -qa|grep neutron python-neutronclient-6.0.1-1.el7ost.noarch python-neutron-lbaas-9.2.2-1.el7ost.noarch python-neutron-tests-9.4.1-12.el7ost.noarch openstack-neutron-bigswitch-lldp-9.42.7-2.el7ost.noarch python-neutron-9.4.1-12.el7ost.noarch openstack-neutron-openvswitch-9.4.1-12.el7ost.noarch openstack-neutron-metering-agent-9.4.1-12.el7ost.noarch puppet-neutron-9.5.0-4.el7ost.noarch python-neutron-lib-0.4.0-1.el7ost.noarch openstack-neutron-bigswitch-agent-9.42.7-2.el7ost.noarch openstack-neutron-common-9.4.1-12.el7ost.noarch openstack-neutron-sriov-nic-agent-9.4.1-12.el7ost.noarch openstack-neutron-lbaas-9.2.2-1.el7ost.noarch openstack-neutron-9.4.1-12.el7ost.noarch openstack-neutron-ml2-9.4.1-12.el7ost.noarch How reproducible: randomly happens Steps to Reproduce: 1. mentioned above Actual results: Non working : ~~~ #neutron l3-agent-list-hosting-router rh-ext-r1 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True | :-) | standby | | 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True | :-) | standby | | a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True | :-) | standby | <<=============== +--------------------------------------+------------------------------------+----------------+-------+----------+ ~~~ router external IP is not pingable Expected results: Working : ~~~ [root@overcloud-controller-2 (overcloudrc) ~]# neutron l3-agent-list-hosting-router rh2-ext-r2 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True | :-) | standby | | 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True | :-) | standby | | a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True | :-) | active | <<=============== +--------------------------------------+------------------------------------+----------------+-------+----------+ ~~~
So I have noticed by looking at the l3-agent logs in all three controllers is they were all logging that keepalived was constantly dying, for example: keepalived for router with uuid d350fa7c-90de-400d-b375-320b7101e5a3 not found. The process should not have died WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d350fa7c-90de-400d-b375-320b7101e5a3 Each controller had about 10MB of these messages. And in var/log/messages things like: Keepalived_vrrp exited with permanent error CONFIG. Terminating /var/lib/neutron/ha_confs/ looks readable and has config files that look valid for the routers. Can the customer confirm this is still happening?
I was trying to reproduce this issue today and indeed I got it once. But the issue was slightly different than described here root cause. It was indeed race condition but race was during assigning virtual_router_id values for different routers. I created 2 routers one by one and both got virtual_router_id=1, so as both were using same HA network only one Keepalived instance set router to be active and second router was standby on all 3 nodes. But that don't explain why every router created later was also standby on all nodes and I will try to investigate it more.
When I spotted this issue I had 2 routers with allocated vr_id=1: 2019-04-04 09:45:45.602 1028016 DEBUG neutron.db.l3_hamode_db [req-006ff5d4-8035-4bf6-b1d1-16b533d1784b 31b4ad3048174b3fa759f81767c89b7f 8447b19cd21c4187954cea49ce8de9b1 - - -] Router 1658ac48-deb3-4f27-99e1-850435357919 has been allocated a ha_vr_id 1. _ensure_vr_id /usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py:245 And on other controller: 2019-04-04 09:45:44.769 1034924 DEBUG neutron.db.l3_hamode_db [req-a285b510-bce2-424f-a186-146b4ab26d98 31b4ad3048174b3fa759f81767c89b7f 8447b19cd21c4187954cea49ce8de9b1 - - -] Router d1d9ad1a-8561-49b5-b623-83106c3cd07f has been allocated a ha_vr_id 1. _ensure_vr_id /usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py:245 But looking at https://github.com/openstack/neutron/blob/newton-eol/neutron/db/l3_hamode_db.py#L144 this can't happen if all was on same network as it is part of primary key in DB so it is safe on DB layer for sure. I will continue investigating it.
I spotted same issue again (after more than 700 attempts so it is not very often issue at least in my case). What I got is: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-2 +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby | | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | active | | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby | +--------------------------------------+--------------------------+----------------+-------+----------+ [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-1 +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby | | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby | | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | standby | +--------------------------------------+--------------------------+----------------+-------+----------+ And in db it looks like: MariaDB [ovs_neutron]> select * from router_extra_attributes; +--------------------------------------+-------------+----------------+----+----------+-------------------------+ | router_id | distributed | service_router | ha | ha_vr_id | availability_zone_hints | +--------------------------------------+-------------+----------------+----+----------+-------------------------+ | 6ba430d7-2f9d-4e8e-a59f-4d4fb5644a8e | 0 | 0 | 1 | 1 | [] | | ace64e85-5f3b-4815-aeae-3b54c75ef5eb | 0 | 0 | 1 | 1 | [] | | cd6b61e1-60c9-47da-8866-169ca29ece20 | 1 | 0 | 0 | 0 | [] | +--------------------------------------+-------------+----------------+----+----------+-------------------------+ 3 rows in set (0.01 sec) MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations; +--------------------------------------+-------+ | network_id | vr_id | +--------------------------------------+-------+ | 45aaae94-ce16-412d-bd74-b3812b16ff6f | 1 | +--------------------------------------+-------+ 1 row in set (0.01 sec) So indeed there is possible race during such creation of 2 different routers in very short time. But when I then created another router, it was created properly with new vr_id and all worked fine for it: [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-3 +--------------------------------------+--------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------+----------------+-------+----------+ | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby | | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | active | | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby | +--------------------------------------+--------------------------+----------------+-------+----------+ MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations; +--------------------------------------+-------+ | network_id | vr_id | +--------------------------------------+-------+ | 45aaae94-ce16-412d-bd74-b3812b16ff6f | 1 | | 45aaae94-ce16-412d-bd74-b3812b16ff6f | 2 | +--------------------------------------+-------+
Looks like code responsible for allocating this vr_id is here: https://github.com/openstack/neutron/blob/newton-eol/neutron/db/l3_hamode_db.py#L210 At first glance it looks that changing it to allocate "random" value from available vr_id range could make this problem much less visible for user. And proper fix IMO should be done somehow in db by adding somehow unique constraint on pair (router_id, ha_vr_id) in table "router_extra_attributes" but that would be for sure not backportable.
I think that I know what is going on there. It is race condition with creating HA network and assigning new vr_id to the router. Lets assume we are creating 2 different routers (first 2 HA routers for tenant). Each request goes to different controller and now. 1. Controller-1, as part of creation of router-1, creates HA network, lets call it HA-Net-A, 2. For some reason (I'm not sure what the reason was exactly), controller-1 starts to remove HA-Net-A but 3. in same time on controller 2 HA-Net-A was found and router-2 is trying to use it 4. controller-2 allocates vr_id=1 for router-2 on HA-Net-A, 5. HA-Net-A is finally removed on controller-1 so controller-2 also got some error and retries configure router-2, 6. controller-2 creates new network HA-Net-B but it already have allocated vr_id=1 for router-2 (see p.4), it is stored in different table in db and have nothing to do with removed network, 7. controller-1 tries to allocate vr_id for router-1. As it is for HA-Net-B this time, vr_id=1 is free on this network so it is allocated, And finally both routers got vr_id=1 allocated and only one of them is active on one L3 agent. So my patch https://review.openstack.org/#/c/651495/ should at least mitigate this issue and I will backport this change to OSP-10. Unfortunately I think that this will require some changes in db layer thus may be not possible to backport to old versions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1721