Description of problem: In a conversation with slage on IRC he showed a case where running overcloud redeploy does not clean up old entries in the "openstack network agent list" command. I believe this is because we do not have any periodic task in core OVN or networking-ovn to clean up old/dead entries from the OVS SBDB Chassis table. Running "openstack network agent delete" also does not solve the problem because at the moment that method will return 400 (Bad Request) if the Chassis entry exists [0] (it does not check whether it's alive or not). We need to think about a mechanism which would remove those old entries or we should at least allow deleting agents that are considered dead already. [0] https://github.com/openstack/networking-ovn/blob/41f34f819381b524a7881ed865bccb3317dbf43c/networking_ovn/ml2/mech_driver.py#L1045-L1048 ------- Here's the logs/outputs he provided: Before, after a few overcloud redeploy cycles: (control-plane) [centos@scale ~]$ openstack network agent list +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ | 02cdbc19-0815-423e-a114-b508074f5ac3 | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | 49b175e2-7ee0-489a-93fb-cac50b0a9199 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | 2ab10cd6-3133-4b81-931a-0cf95ec3d002 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | ab423212-f884-49e1-b9d5-48d5874770e0 | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | 0dfd7313-fca9-4724-81d0-afff948aaa3f | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | 9f40ec85-eb9a-4021-8ef5-0bdfca2614c1 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | afb57c87-1602-4eef-8523-cb3465c131a4 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | 3d4332f8-e89e-4e2d-a4f7-efc6d75a4df7 | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | bb764ae2-e352-4368-bf96-0c7a73d9232e | OVN Metadata agent | compute-1.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | b339edc8-4ad5-456f-b926-959a7480c715 | OVN Controller agent | compute-1.rdocloud | n/a | XXX | UP | ovn-controller | | b5af470a-7fea-43bc-9b2d-43ce05329a51 | OVN Metadata agent | compute-1.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | 002f4a31-a701-4f99-b517-c6de9272bc4e | OVN Controller agent | compute-1.rdocloud | n/a | XXX | UP | ovn-controller | | f74d888b-5960-48a8-82c5-59d259f57345 | OVN Controller agent | openstack-0.rdocloud | n/a | XXX | UP | ovn-controller | | c37e9557-8e55-41d7-935f-7c0f8098f023 | OVN Metadata agent | compute-3.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | cdeb9198-58f6-4e01-ba6b-e3af1398098d | OVN Controller agent | compute-3.rdocloud | n/a | XXX | UP | ovn-controller | | 9c3fd18c-041f-48e6-b8b0-ba299231a3a9 | OVN Controller agent | compute-0.rdocloud | n/a | XXX | UP | ovn-controller | | c35ca64b-b43c-49a7-b913-0defc66fc486 | OVN Metadata agent | compute-0.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | 543794ad-34a0-4d04-8b18-9f3c8c84ddf2 | OVN Metadata agent | compute-0.rdocloud | n/a | :-) | UP | networking-ovn-metadata-agent | | 1378075f-15a9-4b0d-bc0a-a8000af757c4 | OVN Controller agent | compute-0.rdocloud | n/a | XXX | UP | ovn-controller | | 87149e6c-c9d1-45e3-bac5-83f01f948a9a | OVN Metadata agent | compute-1.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | 54f4964c-c1e3-4b7d-946f-57f0bb4de4d7 | OVN Controller agent | compute-1.rdocloud | n/a | XXX | UP | ovn-controller | | f22756c4-2930-46f2-9a40-5e0638064059 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | b93a0df5-ccec-471f-8130-cb7256524116 | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | 269cfdc8-a204-4564-91e6-d708eaa7d650 | OVN Controller agent | compute-3.rdocloud | n/a | XXX | UP | ovn-controller | | 780be525-9cd0-4c4b-a474-ffdda7d72673 | OVN Metadata agent | compute-3.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | eeb25250-459f-46a4-908c-31a6f63d1d22 | OVN Metadata agent | compute-2.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | | d8ec54f0-d320-4087-bbf8-e877de79f0d2 | OVN Controller agent | compute-2.rdocloud | n/a | XXX | UP | ovn-controller | | 78b19c23-db23-416e-8074-1ec9c6db0b8d | OVN Controller agent | compute-1.rdocloud | n/a | XXX | UP | ovn-controller | | 53ff0594-22cb-484f-b8bd-58f4d24ea82c | OVN Metadata agent | compute-1.rdocloud | n/a | XXX | UP | networking-ovn-metadata-agent | +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ Then on my controller node (openstack-0) I did: $ pcs resource disable ovn-dbs-bundle $ docker stop ovn_controller And on compute-0 I did: $ docker stop ovn_controller $ docker stop ovn_metadata_agent Then back on openstack-0 I moved out all contents from /var/lib/openvswitch/ovn Then: $ pcs resource enable ovn-dbs-bundle $ docker start ovn_controller and on compute-0: $ docker start ovn_controller $ docker start ovn_metadata_agent Now the agent list looks ok: (control-plane) [centos@scale ~]$ openstack network agent list +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ | 05d88972-8182-4b7c-b2de-641169e5a69d | OVN Controller agent | openstack-0.rdocloud | n/a | :-) | UP | ovn-controller | | d0c88257-953d-4c00-b6ae-183f97f21c0f | OVN Metadata agent | compute-0.rdocloud | n/a | :-) | UP | networking-ovn-metadata-agent | | fc39442b-be5c-4dfb-a4e7-dc6e34910d67 | OVN Controller agent | compute-0.rdocloud | n/a | :-) | UP | ovn-controller | +--------------------------------------+----------------------+----------------------+-------------------+-------+-------+-------------------------------+ Version-Release number of selected component (if applicable): Upstream master, but it should also be present in OSP 14.
*** Bug 1695071 has been marked as a duplicate of this bug. ***
I think it is possible that part of the fix for https://review.opendev.org/#/c/696936/1/networking_ovn/ml2/mech_driver.py (line 1027) could be backported pre-train to at least partially handle this issue. I think the issue is that the agents are cached by UUID and not by name, ovn-controller sets a unique chassis *name* that matches the system id, it doesn't set the UUID of the chassis row to the system-id. So if ovn-controller is restarted, it creates a new row with a new uuid (the old row does go away--because Chassis.name is an indexed column). It's just that networking-ovn is maintaining an in-memory cache of the chassis by UUID and doesn't realize that new row represents the old one.
OSP14 is EOL, I'm moving this to 16.1 as what and how we need to fix it is not clear yet.
*** This bug has been marked as a duplicate of bug 1828889 ***