Bug 1495692

Summary:	[RFE] [OVN] Implement the agents API
Product:	Red Hat OpenStack	Reporter:	Eran Kuris <ekuris>
Component:	python-networking-ovn	Assignee:	Terry Wilson <twilson>
Status:	CLOSED ERRATA	QA Contact:	Eran Kuris <ekuris>
Severity:	high	Docs Contact:
Priority:	medium
Version:	14.0 (Rocky)	CC:	abregman, amuller, apevec, bcafarel, dalvarez, ekuris, jlibosva, lhh, lmartins, majopela, nyechiel, sclewis, tfreger, twilson
Target Milestone:	Upstream M3	Keywords:	FutureFeature, Reopened, RFE, Triaged, UserExperience
Target Release:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-networking-ovn-5.0.2-0.20181009120341.99b02f6.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-11 11:48:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1653274
Bug Blocks:	1619537

Description Eran Kuris 2017-09-26 11:38:34 UTC

Description of problem:

The command line: 'openstack network agent list' doesn't display the status of OVN services of each node.

In OVN there are no agents (like in ODL) but I expected to get any indication of the connection status between OVN services of each node (Controller & compute nodes)

Version-Release: Pike / OSP12 HA OVN 
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
python-networking-ovn-3.0.1-0.20170906223255.c663db6.el7ost.noarch
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64

How reproducible: 100%

Steps to Reproduce:
1. Deploy Director HA OSP12 with OVN
2. type the command: 'openstack network agent list'

Actual results: no output!

Expected results:
Took example from ODL
# openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| 2f6fae0d-8cf4-4b99-8c6c-d371fe5827c8 | ODL L2 | controller-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding |
| 4024d452-91cc-4b80-87b4-a771212744e1 | ODL L2 | compute-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding |
| 69d9b928-54fd-42e2-ab4e-46744f98d1fc | Metadata agent | controller-0.localdomain | None | :-) | UP | neutron-metadata-agent |
| b0a28f38-e991-4768-87dc-cedf8a02e721 | DHCP agent | controller-0.localdomain | nova | :-) | UP | neutron-dhcp-agent |
| eecb8ce1-8589-411e-b5c6-72251af100e2 | ODL L2 | compute-1.localdomain | None | :-) | UP | neutron-odlagent-portbinding |

Expect- status of each OVN on each node-
# openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------+--------------------------+-------------
| 2f6fae0d-8cf4-4b99-8c6c-d371fe5827c8 | OVN | controller-0.localdomain | None | :-) | UP | neutron-ovnagent-portbinding |
| 4024d452-91cc-4b80-87b4-a771212744e1 | OVN | compute-0.localdomain | None | :-) | UP | neutron-ovnagent-portbinding | |

Comment 1 Miguel Angel Ajo 2017-09-26 12:46:19 UTC

This is something not supported yet, which we could implement. From my POV this is more an RFE. You could argue that we shouldn't expose the agent API if we don't support it. But that'd be a different bug.

IMO this is something we may eventually implement, and that it's interesting for l3ha, because the openstack l3ha router API is tightly coupled to agents, and we can consider ovn-controller agents.

We could use nb_cfg / sb_cfg counters to detect which chassis are more or less alive.

Comment 7 Miguel Angel Ajo 2018-06-10 22:31:04 UTC

*** Bug 1530194 has been marked as a duplicate of this bug. ***

Comment 20 Bernard Cafarelli 2018-09-26 10:00:30 UTC

Only https://review.openstack.org/#/c/583959/ remaining to merge

Comment 21 Lucas Alvares Gomes 2018-10-01 08:52:30 UTC

(In reply to Bernard Cafarelli from comment #20)
> Only https://review.openstack.org/#/c/583959/ remaining to merge

That's now merged and I've proposed the backport to stable/rocky upstream at https://review.openstack.org/#/c/606904/

Comment 25 Ofer Blaut 2018-11-11 10:25:08 UTC

*** Bug 1483842 has been marked as a duplicate of this bug. ***

Comment 27 Daniel Alvarez Sanchez 2018-11-26 10:57:00 UTC

Some thoughts here:

* If the node is rebooted properly, it should clear itself up from the chassis table. So this should make the agent in this node look as dead.
* When the node comes back up again it should register itself as a OVN chassis so it should show as alive
* If this is not triggering the proper updates, shall we increment NB_Global every X seconds (whatever neutron configures for the report interval).

Comment 28 Lucas Alvares Gomes 2018-11-26 11:00:56 UTC

(In reply to Daniel Alvarez Sanchez from comment #27)
> Some thoughts here:
> 
> * If the node is rebooted properly, it should clear itself up from the
> chassis table. So this should make the agent in this node look as dead.
> * When the node comes back up again it should register itself as a OVN
> chassis so it should show as alive
> * If this is not triggering the proper updates, shall we increment NB_Global
> every X seconds (whatever neutron configures for the report interval).

Took a quick look and the "timeout" used to determine whether the agent is alive or not is the agent_down_time from neutron [0].

But as you said, we don't seem to be triggering any updates to the NB_Global using that time [0] so, perhaps this is not been triggered when we don't have a clean shutdown (clean because ovn-controller would remove it's chassis entry and that would trigger the update I assume).

[0] https://github.com/openstack/networking-ovn/blob/687889d92706c9120b3fdd0efcc22a72e0c64637/networking_ovn/ml2/mech_driver.py#L890

Comment 29 Eran Kuris 2018-11-26 12:10:18 UTC

Daniel & Lucas this is exactly the issue when I reboot the node  the node did not
 update that its dead and I still see it as "alive".
with bug basically, I can't verify/confirm the RFE

Comment 30 Eran Kuris 2018-11-26 13:12:49 UTC

After running the test plan/ run on OpenStack/14.0-RHEL-7/2018-11-22.2/

I found some issues that block me from verifying this RFE.
All the info you can find in the Polarion test run report:

https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402

Comment 31 Terry Wilson 2018-11-26 19:06:59 UTC

Eran, when you say "I noticed that when I reboot one of the nodes the status did not change" do you mean that it doesn't change *ever*? Because it should change after Neutron's agent_down_time value has been reached (which I think defaults to 75 seconds).

The code should return False if we don't have a record of the agent or if the value in our record is older than agent_down_time seconds.

The config values that are important in neutron are

[DEFAULT]
agent_down_time = 75 # If something hasn't been down this long, it isn't down

[AGENT]
report_interval = 30 # should be < 0.5 * agent_down_time

Comment 32 Eran Kuris 2018-11-27 07:18:48 UTC

(In reply to Terry Wilson from comment #31)
> Eran, when you say "I noticed that when I reboot one of the nodes the status
> did not change" do you mean that it doesn't change *ever*? Because it should
> change after Neutron's agent_down_time value has been reached (which I think
> defaults to 75 seconds).
> 
> The code should return False if we don't have a record of the agent or if
> the value in our record is older than agent_down_time seconds.
> 
> The config values that are important in neutron are
> 
> [DEFAULT]
> agent_down_time = 75 # If something hasn't been down this long, it isn't down
> 
> [AGENT]
> report_interval = 30 # should be < 0.5 * agent_down_time

when I reboot compute node for example maybe it starts up faster than 75 seconds.
I think in that case we do not have to rely on time but the agent is alive or not.

Comment 33 Daniel Alvarez Sanchez 2018-11-27 08:43:01 UTC

(In reply to Eran Kuris from comment #32)
> (In reply to Terry Wilson from comment #31)
> > Eran, when you say "I noticed that when I reboot one of the nodes the status
> > did not change" do you mean that it doesn't change *ever*? Because it should
> > change after Neutron's agent_down_time value has been reached (which I think
> > defaults to 75 seconds).
> > 
> > The code should return False if we don't have a record of the agent or if
> > the value in our record is older than agent_down_time seconds.
> > 
> > The config values that are important in neutron are
> > 
> > [DEFAULT]
> > agent_down_time = 75 # If something hasn't been down this long, it isn't down
> > 
> > [AGENT]
> > report_interval = 30 # should be < 0.5 * agent_down_time
> 
> when I reboot compute node for example maybe it starts up faster than 75
> seconds.
> I think in that case we do not have to rely on time but the agent is alive
> or not.

Hi Eran,

In order to be able to tell if an agent is down or not, we need reports/checks and there's always a tradeoff between the responsiveness of the healthcheck and the overhead that sending and processing those reports/checks cause in the system. The more frequent they are, the more responsive the healthcheck system will be but more network traffic/CPU load we'll observe. Besides, imagine a monitoring system triggering alarms when all agents show as down just because of a glitch of the messaging system or some spurious error; perhaps in this situation, we want to avoid those alarms until we confirm (after some time) that the agent is actually down.

This is the reason why those parameters that Terry mentioned exist, so that they can be tuned by operators depending on the actual needs of the environment, criticality, etc. For the purpose of this BZ, if the reboot time is lower than the agent_down_time, then it's not expected that you see the agent status transitioning to 'dead'.

Comment 34 Eran Kuris 2018-11-27 08:58:47 UTC

According to comment 33 and my tests the I ran in test run: https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402

on OpenStack/14.0-RHEL-7/2018-11-22.2/

I am verifying this RFE.

Comment 36 errata-xmlrpc 2019-01-11 11:48:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045