Hide Forgot
Description of problem: When executing `neutron l3-agent-list-hosting-router`: +--------------------------------------+-------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------+----------------+-------+----------+ | 04867f8c-5632-412a-8ce7-79bfccc2f620 | neutron-n-2 | True | :-) | active | *** | 8d24b6e8-fd63-4f84-93cd-9361ee5b9e4a | neutron-n-1 | True | :-) | standby | | 04867f8c-5632-412a-8ce7-79bfccc2f620 | neutron-n-2 | True | :-) | active | *** | fb833e11-ce79-4287-bd6b-6ef5aeed5814 | neutron-n-0 | True | :-) | standby | +--------------------------------------+-------------+----------------+-------+----------+ We see, for a 3 controller deployment, 3 different hosts, but 4 ports, in this case, the active port with same ID and same Host is listed as alive, active. Version-Release number of selected component (if applicable): openstack-neutron-2015.1.2-9.el7ost.noarch openstack-neutron-common-2015.1.2-9.el7ost.noarch openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch openstack-neutron-ml2-2015.1.2-9.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-9.el7ost.noarch openstack-neutron-vpnaas-2015.1.2-1.el7ost.noarch python-neutron-2015.1.2-9.el7ost.noarch python-neutron-lbaas-2015.1.2-1.el7ost.noarch python-neutron-vpnaas-2015.1.2-1.el7ost.noarch python-neutronclient-2.4.0-2.el7ost.noarch This was reported to also be happening on OSP6 version before it was upgraded to OSP7
This could be related to the on-flight patches @assaf and @jschwarz are working on U/S to fix a few race conditions existing in the l3_ha part.
This is a different issue than what me and @assaf are working on U/S - we're dealing with not enough l3 ha ports, not too many. I've looked at the attached logs but did not find anything about the port's UUID in question (04867f8c-5632-412a-8ce7-79bfccc2f620), so not a lot to go on. Pablo, can you perhaps try giving a rough outline of what scenario was running on the servers, so that we might be able to reproduce this?
Hi John, I'm asking my customer on this, the background at the moment is that they had this in OSP6 and after the upgrade it's still there. Not sure from the comments if this was cleaned up before upgrade or if it's an issue appearing in OSP6 and carried over to OSP7 setup. Initial request from them was on how to properly clean this up and the availability implications of the cleaning procedure. Thanks, Pablo
One possible option could be to delete the specific agents via neutron client, and wait for heart beat to come back, so they are re-registered. But probably, that would also disassociate the routers from the agent. @pablo, could we check that procedure/workaround in an OSP7: 1) Create a few routers, in HA 2) List l3-agent-list-hosting-router for one of the routers 3) Delete the agent holding the ACTIVE instance of the router 4) Wait for hearbeat to come back so agent appears in neutron agent-list again 5) List l3-agent-list-hosting-router for the same router as in (2) 6) If the agent is not there, we could do: neutron l3-agent-router-add $agent-id $router 7) repeat 5 and verify the list is ok (2 agents backup, one active) I believe such procedure would be harmless even if (6) happened, one of the backup routers would take the traffic until we do (7). But it's better if we could verify this first.
I've added a patch that should be backported from upstream to the tracker. Once the upstream patch has been merged we can continue work on this patch.
Please remember to flip this bug to MODIFIED with the appropriate 'Fixed in version' when you rebase OSP 7. Thank you.
The fix is incorporated in the rebase. See bug 1350400
At this point there is no version to verify on. Tnx
openstack-neutron-2015.1.4-2.el7ost.noarch neutron l3-agent-list-hosting-router Router_eNet +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | e0ad8091-ef57-4950-9e7d-7549cc529b1d | overcloud-controller-1.localdomain | True | :-) | standby | | aa48e625-19a4-4a38-96d2-34e85fe7cf6c | overcloud-controller-2.localdomain | True | :-) | active | | caef1d9c-d65b-4ea3-bd05-6814efc5c934 | overcloud-controller-0.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+ [root@overcloud-controller-2 ~]# neutron l3-agent-router-remove caef1d9c-d65b-4ea3-bd05-6814efc5c934 Router_eNet Removed router Router_eNet from L3 agent [root@overcloud-controller-2 ~]# neutron l3-agent-list-hosting-router Router_eNet +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | e0ad8091-ef57-4950-9e7d-7549cc529b1d | overcloud-controller-1.localdomain | True | :-) | standby | | aa48e625-19a4-4a38-96d2-34e85fe7cf6c | overcloud-controller-2.localdomain | True | :-) | active | +--------------------------------------+------------------------------------+----------------+-------+----------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1474