Bug 1311864

Summary: Neutron L3 Agent shows duplicate ports
Product: Red Hat OpenStack Reporter: Pablo Iranzo Gómez <pablo.iranzo>
Component: openstack-neutronAssignee: John Schwarz <jschwarz>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: amuller, astafeye, chrisw, jschluet, jschwarz, majopela, nyechiel, oblaut, pablo.iranzo, srevivo
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.4-1.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-20 23:53:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1273812    

Description Pablo Iranzo Gómez 2016-02-25 08:57:29 UTC
Description of problem:

When executing `neutron l3-agent-list-hosting-router`:


+--------------------------------------+-------------+----------------+-------+----------+
| id                                   | host        | admin_state_up | alive | ha_state |
+--------------------------------------+-------------+----------------+-------+----------+
| 04867f8c-5632-412a-8ce7-79bfccc2f620 | neutron-n-2 | True           | :-)   | active   |  ***
| 8d24b6e8-fd63-4f84-93cd-9361ee5b9e4a | neutron-n-1 | True           | :-)   | standby  |
| 04867f8c-5632-412a-8ce7-79bfccc2f620 | neutron-n-2 | True           | :-)   | active   |  ***
| fb833e11-ce79-4287-bd6b-6ef5aeed5814 | neutron-n-0 | True           | :-)   | standby  |
+--------------------------------------+-------------+----------------+-------+----------+

We see, for a 3 controller deployment, 3 different hosts, but 4 ports, in this case, the active port with same ID and same Host is listed as alive, active.


Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.2-9.el7ost.noarch
openstack-neutron-common-2015.1.2-9.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch
openstack-neutron-ml2-2015.1.2-9.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch
openstack-neutron-vpnaas-2015.1.2-1.el7ost.noarch
python-neutron-2015.1.2-9.el7ost.noarch
python-neutron-lbaas-2015.1.2-1.el7ost.noarch
python-neutron-vpnaas-2015.1.2-1.el7ost.noarch
python-neutronclient-2.4.0-2.el7ost.noarch



This was reported to also be happening on OSP6 version before it was upgraded to OSP7

Comment 3 Miguel Angel Ajo 2016-02-25 09:10:35 UTC
This could be related to the on-flight patches @assaf and @jschwarz are working on U/S to fix a few race conditions existing in the l3_ha part.

Comment 4 John Schwarz 2016-02-25 09:51:05 UTC
This is a different issue than what me and @assaf are working on U/S - we're dealing with not enough l3 ha ports, not too many.

I've looked at the attached logs but did not find anything about the port's UUID in question (04867f8c-5632-412a-8ce7-79bfccc2f620), so not a lot to go on. Pablo, can you perhaps try giving a rough outline of what scenario was running on the servers, so that we might be able to reproduce this?

Comment 5 Pablo Iranzo Gómez 2016-02-25 09:57:57 UTC
Hi John,
I'm asking my customer on this, the background at the moment is that they had this in OSP6 and after the upgrade it's still there.

Not sure from the comments if this was cleaned up before upgrade or if it's an issue appearing in OSP6 and carried over to OSP7 setup.

Initial request from them was on how to properly clean this up and the availability implications of the cleaning procedure.

Thanks,
Pablo

Comment 6 Miguel Angel Ajo 2016-02-25 12:20:13 UTC
One possible option could be to delete the specific agents via neutron client, and wait for heart beat to come back, so they are re-registered. But probably, that would also disassociate the routers from the agent.

@pablo, could we check that procedure/workaround in an OSP7:

1) Create a few routers, in HA
2) List l3-agent-list-hosting-router for one of the routers
3) Delete the agent holding the ACTIVE instance of the router
4) Wait for hearbeat to come back so agent appears in neutron agent-list again
5) List l3-agent-list-hosting-router for the same router as in (2)
6) If the agent is not there, we could do: 
      neutron l3-agent-router-add $agent-id $router
7) repeat 5 and verify the list is ok (2 agents backup, one active)

I believe such procedure would be harmless even if (6) happened, one of the backup routers would take the traffic until we do (7).

But it's better if we could verify this first.

Comment 7 John Schwarz 2016-03-01 11:30:50 UTC
I've added a patch that should be backported from upstream to the tracker. Once the upstream patch has been merged we can continue work on this patch.

Comment 16 Assaf Muller 2016-06-04 20:09:19 UTC
Please remember to flip this bug to MODIFIED with the appropriate 'Fixed in version' when you rebase OSP 7. Thank you.

Comment 17 Nir Magnezi 2016-06-30 13:04:45 UTC
The fix is incorporated in the rebase.
See bug 1350400

Comment 19 Alexander Stafeyev 2016-07-11 08:49:24 UTC
At this point there is no version to verify on. 

Tnx

Comment 21 Alexander Stafeyev 2016-07-11 12:44:49 UTC
openstack-neutron-2015.1.4-2.el7ost.noarch

neutron l3-agent-list-hosting-router Router_eNet
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| e0ad8091-ef57-4950-9e7d-7549cc529b1d | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| aa48e625-19a4-4a38-96d2-34e85fe7cf6c | overcloud-controller-2.localdomain | True           | :-)   | active   |
| caef1d9c-d65b-4ea3-bd05-6814efc5c934 | overcloud-controller-0.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+


[root@overcloud-controller-2 ~]# neutron l3-agent-router-remove caef1d9c-d65b-4ea3-bd05-6814efc5c934 Router_eNet
Removed router Router_eNet from L3 agent
[root@overcloud-controller-2 ~]# neutron l3-agent-list-hosting-router Router_eNet
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| e0ad8091-ef57-4950-9e7d-7549cc529b1d | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| aa48e625-19a4-4a38-96d2-34e85fe7cf6c | overcloud-controller-2.localdomain | True           | :-)   | active   |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Comment 23 errata-xmlrpc 2016-07-20 23:53:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1474