Bug 1578459 - ha router sometime goes in standby mode in all controllers
Summary: ha router sometime goes in standby mode in all controllers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z12
: 10.0 (Newton)
Assignee: Slawek Kaplonski
QA Contact: Candido Campos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-15 15:54 UTC by Shatadru Bandyopadhyay
Modified: 2019-07-10 09:18 UTC (History)
12 users (show)

Fixed In Version: openstack-neutron-9.4.1-42.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-10 09:18:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1823314 0 None None None 2019-04-05 10:30:57 UTC
Red Hat Product Errata RHBA-2019:1721 0 None None None 2019-07-10 09:18:45 UTC

Comment 2 Aviv Guetta 2018-05-21 11:58:15 UTC
As original description comment can't be public, here is a description message without the private data:

Description of problem:

sometimes all l3 ha routers in standby mode and hence the router IP is unpingable. 


In customer words :
-----------------------
after creating 5 routers with the heat stack I create additional external router and try to ping it's external IP.

If the ping is O.K i deleted the stack and the external router and try again after several retries the problem occur and from this point every new external router will fail to ping on this tenant.
After delete the stack the tenant  will recover and external router creation will work O.K



After investigate the problem we believe that the RCA is race condition when creating 2 routers with no delay.
The scenario is:
1- create router on a new tenant:
    - Neutron build internal network with subnet for HA management
    - Create 3 routers on all controllers
    -  The 3 routers talk and decided the active and 2 standby  
2- when the second router cratione fire (with no delay) 
      - the internal network of the tenant is still under creation
      - The neutron create a new internal manage network 
      - This new network failed  (no subnet due to conflict with the first net)
      - The neutron build 3 routers and all of them in standby
3- From this point every router created in this tenant try to communicate on the new damage network. 

We believe that neutron should protect from this situation and it is only one example of HA resources creation.
-----------------------


Non working :
~~~
#neutron l3-agent-list-hosting-router rh-ext-r1
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True           | :-)   | standby  |  <<===============
+--------------------------------------+------------------------------------+----------------+-------+----------+
~~~

Working :
~~~

[root@overcloud-controller-2 (overcloudrc) ~]# neutron l3-agent-list-hosting-router rh2-ext-r2
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True           | :-)   | active   | <<===============
+--------------------------------------+------------------------------------+----------------+-------+----------+
~~~




Version-Release number of selected component (if applicable):
# rpm -qa|grep neutron
python-neutronclient-6.0.1-1.el7ost.noarch
python-neutron-lbaas-9.2.2-1.el7ost.noarch
python-neutron-tests-9.4.1-12.el7ost.noarch
openstack-neutron-bigswitch-lldp-9.42.7-2.el7ost.noarch
python-neutron-9.4.1-12.el7ost.noarch
openstack-neutron-openvswitch-9.4.1-12.el7ost.noarch
openstack-neutron-metering-agent-9.4.1-12.el7ost.noarch
puppet-neutron-9.5.0-4.el7ost.noarch
python-neutron-lib-0.4.0-1.el7ost.noarch
openstack-neutron-bigswitch-agent-9.42.7-2.el7ost.noarch
openstack-neutron-common-9.4.1-12.el7ost.noarch
openstack-neutron-sriov-nic-agent-9.4.1-12.el7ost.noarch
openstack-neutron-lbaas-9.2.2-1.el7ost.noarch
openstack-neutron-9.4.1-12.el7ost.noarch
openstack-neutron-ml2-9.4.1-12.el7ost.noarch

How reproducible:
randomly happens

Steps to Reproduce:
1. mentioned above

Actual results:
Non working :
~~~
#neutron l3-agent-list-hosting-router rh-ext-r1
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True           | :-)   | standby  |  <<===============
+--------------------------------------+------------------------------------+----------------+-------+----------+
~~~
router external IP is not pingable 

Expected results:

Working :
~~~

[root@overcloud-controller-2 (overcloudrc) ~]# neutron l3-agent-list-hosting-router rh2-ext-r2
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 2bc16072-df38-42fc-b292-202ef79ad826 | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| 861c0afa-17b2-494b-94af-77d24a10bcf6 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| a7c4816a-2512-4407-8563-e13f69cd4410 | overcloud-controller-0.localdomain | True           | :-)   | active   | <<===============
+--------------------------------------+------------------------------------+----------------+-------+----------+
~~~

Comment 3 Brian Haley 2018-06-14 19:25:53 UTC
So I have noticed by looking at the l3-agent logs in all three controllers is they were all logging that keepalived was constantly dying, for example:

keepalived for router with uuid d350fa7c-90de-400d-b375-320b7101e5a3 not found. The process should not have died
WARNING neutron.agent.linux.external_process [-] Respawning keepalived for uuid d350fa7c-90de-400d-b375-320b7101e5a3

Each controller had about 10MB of these messages.

And in var/log/messages things like:

Keepalived_vrrp exited with permanent error CONFIG. Terminating

/var/lib/neutron/ha_confs/ looks readable and has config files that look valid for the routers.

Can the customer confirm this is still happening?

Comment 10 Slawek Kaplonski 2019-04-04 10:52:27 UTC
I was trying to reproduce this issue today and indeed I got it once. But the issue was slightly different than described here root cause.
It was indeed race condition but race was during assigning virtual_router_id values for different routers.
I created 2 routers one by one and both got virtual_router_id=1, so as both were using same HA network only one Keepalived instance set router to be active and second router was standby on all 3 nodes.

But that don't explain why every router created later was also standby on all nodes and I will try to investigate it more.

Comment 11 Slawek Kaplonski 2019-04-04 11:05:01 UTC
When I spotted this issue I had 2 routers with allocated vr_id=1:

2019-04-04 09:45:45.602 1028016 DEBUG neutron.db.l3_hamode_db [req-006ff5d4-8035-4bf6-b1d1-16b533d1784b 31b4ad3048174b3fa759f81767c89b7f 8447b19cd21c4187954cea49ce8de9b1 - - -] Router 1658ac48-deb3-4f27-99e1-850435357919 has been allocated a ha_vr_id 1. _ensure_vr_id /usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py:245

And on other controller:
2019-04-04 09:45:44.769 1034924 DEBUG neutron.db.l3_hamode_db [req-a285b510-bce2-424f-a186-146b4ab26d98 31b4ad3048174b3fa759f81767c89b7f 8447b19cd21c4187954cea49ce8de9b1 - - -] Router d1d9ad1a-8561-49b5-b623-83106c3cd07f has been allocated a ha_vr_id 1. _ensure_vr_id /usr/lib/python2.7/site-packages/neutron/db/l3_hamode_db.py:245


But looking at https://github.com/openstack/neutron/blob/newton-eol/neutron/db/l3_hamode_db.py#L144 this can't happen if all was on same network as it is part of primary key in DB so it is safe on DB layer for sure.

I will continue investigating it.

Comment 12 Slawek Kaplonski 2019-04-05 09:48:58 UTC
I spotted same issue again (after more than 700 attempts so it is not very often issue at least in my case).

What I got is:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-2
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+
[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-1
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+


And in db it looks like:

MariaDB [ovs_neutron]> select * from router_extra_attributes;
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| router_id                            | distributed | service_router | ha | ha_vr_id | availability_zone_hints |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| 6ba430d7-2f9d-4e8e-a59f-4d4fb5644a8e |           0 |              0 |  1 |        1 | []                      |
| ace64e85-5f3b-4815-aeae-3b54c75ef5eb |           0 |              0 |  1 |        1 | []                      |
| cd6b61e1-60c9-47da-8866-169ca29ece20 |           1 |              0 |  0 |        0 | []                      |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
3 rows in set (0.01 sec)

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id                           | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
+--------------------------------------+-------+
1 row in set (0.01 sec)

So indeed there is possible race during such creation of 2 different routers in very short time.

But when I then created another router, it was created properly with new vr_id and all worked fine for it:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-3
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id                           | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     2 |
+--------------------------------------+-------+

Comment 13 Slawek Kaplonski 2019-04-05 10:21:51 UTC
Looks like code responsible for allocating this vr_id is here: https://github.com/openstack/neutron/blob/newton-eol/neutron/db/l3_hamode_db.py#L210

At first glance it looks that changing it to allocate "random" value from available vr_id range could make this problem much less visible for user.
And proper fix IMO should be done somehow in db by adding somehow unique constraint on pair (router_id, ha_vr_id) in table "router_extra_attributes" but that would be for sure not backportable.

Comment 15 Slawek Kaplonski 2019-04-17 08:44:29 UTC
I think that I know what is going on there.

It is race condition with creating HA network and assigning new vr_id to the router.

Lets assume we are creating 2 different routers (first 2 HA routers for tenant).
Each request goes to different controller and now.
1. Controller-1, as part of creation of router-1, creates HA network, lets call it HA-Net-A,
2. For some reason (I'm not sure what the reason was exactly), controller-1 starts to remove HA-Net-A but
3. in same time on controller 2 HA-Net-A was found and router-2 is trying to use it
4. controller-2 allocates vr_id=1 for router-2 on HA-Net-A,
5. HA-Net-A is finally removed on controller-1 so controller-2 also got some error and retries configure router-2,
6. controller-2 creates new network HA-Net-B but it already have allocated vr_id=1 for router-2 (see p.4), it is stored in different table in db and have nothing to do with removed network,
7. controller-1 tries to allocate vr_id for router-1. As it is for HA-Net-B this time, vr_id=1 is free on this network so it is allocated,

And finally both routers got vr_id=1 allocated and only one of them is active on one L3 agent.

So my patch https://review.openstack.org/#/c/651495/ should at least mitigate this issue and I will backport this change to OSP-10.
Unfortunately I think that this will require some changes in db layer thus may be not possible to backport to old versions.

Comment 25 errata-xmlrpc 2019-07-10 09:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1721


Note You need to log in before you can comment on or make changes to this bug.