Bug 1788336

Summary: ovn-controllers are listed as agents but cannot be removed
Product: Red Hat OpenStack Reporter: Alex Schultz <aschultz>
Component: python-networking-ovnAssignee: Terry Wilson <twilson>
Status: CLOSED CURRENTRELEASE QA Contact: Eduardo Olivares <eolivare>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: apevec, bcafarel, chopark, dhill, ebarrera, eolivare, jlibosva, ldenny, lhh, majopela, pmannidi, scohen, twilson
Target Milestone: z8Keywords: Reopened, TestOnly, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.3.1-1.20210809163307.4e24f4c.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2002099 (view as bug list) Environment:
Last Closed: 2022-05-05 10:37:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1768678, 1841011, 1948472, 2002099, 2018013, 2298873, 2299409    

Description Alex Schultz 2020-01-06 23:51:19 UTC
Description of problem:

In previous versions of OSP, we tell the end user to cleanup neutron agents when removing nodes (e.g. controllers). With the switch to OVN, you cannot actually remove an ovn-controller because it errors.

+--------------------------------------+----------------------+---------------------------+-------------------+-------+----------------+-------------------------------+
| id                                   | agent_type           | host                      | availability_zone | alive | admin_state_up | binary                        |
+--------------------------------------+----------------------+---------------------------+-------------------+-------+----------------+-------------------------------+
| 0c447e90-4aa9-42c1-8d2e-d86a5c6bbcbb | OVN Controller agent | controller-0.redhat.local | n/a               | xxx   | True           | ovn-controller                |
| 331cc8b5-8b83-4ffa-8efe-e50aaf43b2c7 | OVN Controller agent | compute-1.redhat.local    | n/a               | :-)   | True           | ovn-controller                |
| 4bbf6c90-e5cf-42c6-9f69-949d64484640 | OVN Metadata agent   | compute-1.redhat.local    | n/a               | :-)   | True           | networking-ovn-metadata-agent |
| e914fb7f-4134-4453-90f3-9d89f79647e1 | OVN Controller agent | controller-1.redhat.local | n/a               | :-)   | True           | ovn-controller                |
| 9ac0774e-9df0-4f83-b7ab-8f63e8877724 | OVN Controller agent | controller-2.redhat.local | n/a               | :-)   | True           | ovn-controller                |
| cec804db-8c02-47d7-b7cc-304f8aafc7b7 | OVN Controller agent | compute-0.redhat.local    | n/a               | :-)   | True           | ovn-controller                |
| 01471f00-85d6-41b5-93cf-78b2105c08dd | OVN Metadata agent   | compute-0.redhat.local    | n/a               | :-)   | True           | networking-ovn-metadata-agent |
| 9d728dc6-137b-4cb7-8d36-81d0fd63a847 | OVN Controller agent | compute-2.redhat.local    | n/a               | :-)   | True           | ovn-controller                |
| 11866321-a80b-4db4-a8a8-e1dd939910ba | OVN Metadata agent   | compute-2.redhat.local    | n/a               | :-)   | True           | networking-ovn-metadata-agent |
| bb8c82e6-2982-45da-a7a1-33b60a696dba | OVN Controller agent | controller-3.redhat.local | n/a               | :-)   | True           | ovn-controller                |
+--------------------------------------+----------------------+---------------------------+-------------------+-------+----------------+-------------------------------+

source /home/stack/overcloudrc
neutron agent-delete 0c447e90-4aa9-42c1-8d2e-d86a5c6bbcbb
Bad agent request: OVN agents cannot be deleted.
Neutron server returns request_ids: ['req-88d8234f-4dc9-4f09-8375-91cdba356e3c']"]


Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1. Deploy a cloud with OVN
2. Have a controller fail
3. Replace the controller with a new one
4. Attempt to cleanup the old neutron 'agent'

Actual results:
You cannot remove an ovn-controller 'agent'

Expected results:
We should be able to remove resources for non-existent systems.


Additional info:
Related Bug 1695073

Comment 1 Terry Wilson 2020-01-07 00:37:58 UTC
This was by design as deleting an agent was not well-defined for networking-ovn. We don't, exactly, have agents. So the implementation of the agent api was mapped as best we could. We specifically return NotImplemented when deleting an agent.

The "controller" agent, is essentially a OVN_Southbound DB Chassis entry. When shutdown cleanly, ovn-controller will remove this entry and it will disappear automatically from the agent list. If there is a network interruption or hardware failure, after the configurable "alive timeout" has passed, it will show up as dead. In the case that the server is never coming back, we could just delete the Chassis entry ourselves if the "agent delete" API request is received. We should probably only allow the delete request when the agent shows up as down. If it is up and we delete the entry, I think ovn-controller will just re-add it. I can't imagine that we'd actually kill agent processes/keep them from restarting/etc. on an agent delete call. It would solely be for cleaning up the agent list display when a server was truly already gone.

I'd have to check, but I think we could do something similar with the external_ids that store the metadata agent info.

Comment 2 Terry Wilson 2021-01-15 22:52:31 UTC
The upstream patches need to be backported to 16.1. In addition to adding support for deleting agents, it also increases performance by minimizing the amount of db writes (which are replicated to each connection on each server).

Comment 3 Rodolfo Alonso 2021-04-14 14:57:37 UTC
*** Bug 1946835 has been marked as a duplicate of this bug. ***

Comment 5 PURANDHAR SAIRAM MANNIDI 2021-08-03 13:48:19 UTC
*** Bug 1982130 has been marked as a duplicate of this bug. ***

Comment 7 Jakub Libosvar 2021-12-22 19:29:03 UTC
*** Bug 1887866 has been marked as a duplicate of this bug. ***

Comment 8 Jakub Libosvar 2021-12-22 19:55:22 UTC
I don't know why is this BZ still in MODIFIED. It's been released in 16.1.7

Comment 9 ldenny 2022-03-31 06:24:45 UTC
Hi Jakub, 

Looks like this was actually released in 16.1.8 looking at the package version in the container catalogue:

https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-neutron-server-ovn/5de6be20dd19c71643b78104?tag=16.1.7-12.1646286259&push_date=1646661535000

https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-neutron-server-ovn/5de6be20dd19c71643b78104?tag=16.1.8-7&push_date=1648122338000

Looks like there might have been some confusion with the fixed in version 

"Fixed In Version: python-networking-ovn-7.3.1-1.20210714143306.el8ost → python-networking-ovn-7.3.1-1.20210809163307.4e24f4c.el8ost"

We just had some confusion with one of the customers and double checked it, please let me know if this is correct.

Comment 10 Jakub Libosvar 2022-03-31 14:12:53 UTC
(In reply to ldenny from comment #9)
> Hi Jakub, 
> 
> Looks like this was actually released in 16.1.8 looking at the package
> version in the container catalogue:
> 
> https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-neutron-
> server-ovn/5de6be20dd19c71643b78104?tag=16.1.7-12.
> 1646286259&push_date=1646661535000
> 
> https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-neutron-
> server-ovn/5de6be20dd19c71643b78104?tag=16.1.8-7&push_date=1648122338000
> 
> Looks like there might have been some confusion with the fixed in version 
> 
> "Fixed In Version: python-networking-ovn-7.3.1-1.20210714143306.el8ost →
> python-networking-ovn-7.3.1-1.20210809163307.4e24f4c.el8ost"
> 
> We just had some confusion with one of the customers and double checked it,
> please let me know if this is correct.

You're right - looking at the changelog:
16.1.7: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1791046
    - Simplify OVN Agent API implementation (rhbz#1788336)
    - Avoid race condition when processing RowEvents (rhbz#1788336)

16.1.8: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1873883
    - Reset "AgentCache" singleton in functional tests (rhbz#1788336)
    - Don't update AgentCache when Chassis_Private.chassis == [] (rhbz#1788336)
    - Convert OvnDbNotifyHandler rows to frozen rows (rhbz#1788336)
    - Add support for deleting ml2/ovn agents (rhbz#1788336)
    - Simplify OVN Agent API implementation (rhbz#1788336)
    - Avoid race condition when processing RowEvents (rhbz#1788336)


Now since this was not tested, I wonder if we should target it to z9 to get it properly verified. I'm moving this to ON_QA. Thanks for pointing this out Lewis!

Comment 11 ldenny 2022-04-01 01:38:26 UTC
Awesome Jakub, thanks for the follow up!

Comment 12 OSP Team 2022-04-01 10:35:21 UTC
According to our records, this should be resolved by python-networking-ovn-7.3.1-1.20220113183502.el8ost.  This build is available now.

Comment 14 Jakub Libosvar 2022-11-14 20:10:46 UTC
*** Bug 2114723 has been marked as a duplicate of this bug. ***