Bug 1695073

Summary: networking-ovn does not clean up old stale agents entries in "openstack network agent list"
Product: Red Hat OpenStack Reporter: Lucas Alvares Gomes <lmartins>
Component: python-networking-ovnAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED DUPLICATE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: apevec, jlibosva, lhh, majopela, scohen, twilson
Target Milestone: zstreamKeywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-15 22:55:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lucas Alvares Gomes 2019-04-02 12:43:02 UTC
Description of problem:

In a conversation with slage on IRC he showed a case where running overcloud redeploy does not clean up old entries in the "openstack network agent list" command.

I believe this is because we do not have any periodic task in core OVN or networking-ovn to clean up old/dead entries from the OVS SBDB Chassis table.

Running "openstack network agent delete" also does not solve the problem because at the moment that method will return 400 (Bad Request) if the Chassis entry exists [0] (it does not check whether it's alive or not).

We need to think about a mechanism which would remove those old entries or we should at least allow deleting agents that are considered dead already.

[0] https://github.com/openstack/networking-ovn/blob/41f34f819381b524a7881ed865bccb3317dbf43c/networking_ovn/ml2/mech_driver.py#L1045-L1048

-------

Here's the logs/outputs he provided:

Before, after a few overcloud redeploy cycles:

(control-plane) [centos@scale ~]$ openstack network agent list 
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+
| ID                                   | Agent Type           | Host                 | Availability Zone | Alive | State | Binary                        |
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+
| 02cdbc19-0815-423e-a114-b508074f5ac3 | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 49b175e2-7ee0-489a-93fb-cac50b0a9199 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 2ab10cd6-3133-4b81-931a-0cf95ec3d002 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| ab423212-f884-49e1-b9d5-48d5874770e0 | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 0dfd7313-fca9-4724-81d0-afff948aaa3f | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 9f40ec85-eb9a-4021-8ef5-0bdfca2614c1 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| afb57c87-1602-4eef-8523-cb3465c131a4 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 3d4332f8-e89e-4e2d-a4f7-efc6d75a4df7 | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| bb764ae2-e352-4368-bf96-0c7a73d9232e | OVN Metadata agent   | compute-1.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| b339edc8-4ad5-456f-b926-959a7480c715 | OVN Controller agent | compute-1.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| b5af470a-7fea-43bc-9b2d-43ce05329a51 | OVN Metadata agent   | compute-1.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 002f4a31-a701-4f99-b517-c6de9272bc4e | OVN Controller agent | compute-1.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| f74d888b-5960-48a8-82c5-59d259f57345 | OVN Controller agent | openstack-0.rdocloud | n/a               | XXX   | UP    | ovn-controller                |
| c37e9557-8e55-41d7-935f-7c0f8098f023 | OVN Metadata agent   | compute-3.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| cdeb9198-58f6-4e01-ba6b-e3af1398098d | OVN Controller agent | compute-3.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 9c3fd18c-041f-48e6-b8b0-ba299231a3a9 | OVN Controller agent | compute-0.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| c35ca64b-b43c-49a7-b913-0defc66fc486 | OVN Metadata agent   | compute-0.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 543794ad-34a0-4d04-8b18-9f3c8c84ddf2 | OVN Metadata agent   | compute-0.rdocloud   | n/a               | :-)   | UP    | networking-ovn-metadata-agent |
| 1378075f-15a9-4b0d-bc0a-a8000af757c4 | OVN Controller agent | compute-0.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 87149e6c-c9d1-45e3-bac5-83f01f948a9a | OVN Metadata agent   | compute-1.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 54f4964c-c1e3-4b7d-946f-57f0bb4de4d7 | OVN Controller agent | compute-1.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| f22756c4-2930-46f2-9a40-5e0638064059 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| b93a0df5-ccec-471f-8130-cb7256524116 | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 269cfdc8-a204-4564-91e6-d708eaa7d650 | OVN Controller agent | compute-3.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 780be525-9cd0-4c4b-a474-ffdda7d72673 | OVN Metadata agent   | compute-3.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| eeb25250-459f-46a4-908c-31a6f63d1d22 | OVN Metadata agent   | compute-2.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| d8ec54f0-d320-4087-bbf8-e877de79f0d2 | OVN Controller agent | compute-2.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 78b19c23-db23-416e-8074-1ec9c6db0b8d | OVN Controller agent | compute-1.rdocloud   | n/a               | XXX   | UP    | ovn-controller                |
| 53ff0594-22cb-484f-b8bd-58f4d24ea82c | OVN Metadata agent   | compute-1.rdocloud   | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+

Then on my controller node (openstack-0) I did:

$ pcs resource disable ovn-dbs-bundle
$ docker stop ovn_controller

And on compute-0 I did:

$ docker stop ovn_controller
$ docker stop ovn_metadata_agent

Then back on openstack-0 I moved out all contents from /var/lib/openvswitch/ovn

Then:

$ pcs resource enable ovn-dbs-bundle
$ docker start ovn_controller

and on compute-0:

$ docker start ovn_controller
$ docker start ovn_metadata_agent

Now the agent list looks ok:

(control-plane) [centos@scale ~]$ openstack network agent list 
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+
| ID                                   | Agent Type           | Host                 | Availability Zone | Alive | State | Binary                        |
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+
| 05d88972-8182-4b7c-b2de-641169e5a69d | OVN Controller agent | openstack-0.rdocloud | n/a               | :-)   | UP    | ovn-controller                |
| d0c88257-953d-4c00-b6ae-183f97f21c0f | OVN Metadata agent   | compute-0.rdocloud   | n/a               | :-)   | UP    | networking-ovn-metadata-agent |
| fc39442b-be5c-4dfb-a4e7-dc6e34910d67 | OVN Controller agent | compute-0.rdocloud   | n/a               | :-)   | UP    | ovn-controller                |
+--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+

Version-Release number of selected component (if applicable):
Upstream master, but it should also be present in OSP 14.

Comment 1 Lucas Alvares Gomes 2019-04-02 21:54:58 UTC
*** Bug 1695071 has been marked as a duplicate of this bug. ***

Comment 4 Terry Wilson 2020-01-07 00:56:31 UTC
I think it is possible that part of the fix for https://review.opendev.org/#/c/696936/1/networking_ovn/ml2/mech_driver.py (line 1027) could be backported pre-train to at least partially handle this issue. I think the issue is that the agents are cached by UUID and not by name, ovn-controller sets a unique chassis *name* that matches the system id, it doesn't set the UUID of the chassis row to the system-id. So if ovn-controller is restarted, it creates a new row with a new uuid (the old row does go away--because Chassis.name is an indexed column). It's just that networking-ovn is maintaining an in-memory cache of the chassis by UUID and doesn't realize that new row represents the old one.

Comment 5 Jakub Libosvar 2020-01-13 15:05:54 UTC
OSP14 is EOL, I'm moving this to 16.1 as what and how we need to fix it is not clear yet.

Comment 8 Jakub Libosvar 2020-08-31 07:49:59 UTC

*** This bug has been marked as a duplicate of bug 1828889 ***

Comment 9 Terry Wilson 2021-01-15 22:55:12 UTC

*** This bug has been marked as a duplicate of bug 1828889 ***