Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1495692 - [RFE] [OVN] Implement the agents API
Summary: [RFE] [OVN] Implement the agents API
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: Upstream M3
: 14.0 (Rocky)
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1483842 1530194 (view as bug list)
Depends On: 1653274
Blocks: 1619537
TreeView+ depends on / blocked
 
Reported: 2017-09-26 11:38 UTC by Eran Kuris
Modified: 2019-09-09 16:38 UTC (History)
14 users (show)

Fixed In Version: python-networking-ovn-5.0.2-0.20181009120341.99b02f6.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:48:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1719574 0 None None None 2017-09-26 11:38:34 UTC
OpenStack gerrit 578225 0 None MERGED Add support for Neutron agent api 2021-02-17 20:20:28 UTC
OpenStack gerrit 583959 0 None MERGED Register OVN metadata as an agent 2021-02-17 20:20:28 UTC
OpenStack gerrit 585027 0 None MERGED Use neutron's agent_down_time for agent liveness 2021-02-17 20:20:28 UTC
OpenStack gerrit 606904 0 None MERGED Register OVN metadata as an agent 2021-02-17 20:20:28 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:48:38 UTC

Description Eran Kuris 2017-09-26 11:38:34 UTC
Description of problem:

The command line: 'openstack network agent list' doesn't display the status of OVN services of each node.

In OVN there are no agents (like in ODL) but I expected to get any indication of the connection status between OVN services of each node (Controller & compute nodes)

Version-Release: Pike / OSP12 HA OVN 
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
python-networking-ovn-3.0.1-0.20170906223255.c663db6.el7ost.noarch
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64

How reproducible: 100%

Steps to Reproduce:
1. Deploy Director HA OSP12 with OVN
2. type the command: 'openstack network agent list'

Actual results: no output!

Expected results:
Took example from ODL
# openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------+--------------------------+-------------------+-------+-------+------------------------------+
| 2f6fae0d-8cf4-4b99-8c6c-d371fe5827c8 | ODL L2 | controller-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding |
| 4024d452-91cc-4b80-87b4-a771212744e1 | ODL L2 | compute-0.localdomain | None | :-) | UP | neutron-odlagent-portbinding |
| 69d9b928-54fd-42e2-ab4e-46744f98d1fc | Metadata agent | controller-0.localdomain | None | :-) | UP | neutron-metadata-agent |
| b0a28f38-e991-4768-87dc-cedf8a02e721 | DHCP agent | controller-0.localdomain | nova | :-) | UP | neutron-dhcp-agent |
| eecb8ce1-8589-411e-b5c6-72251af100e2 | ODL L2 | compute-1.localdomain | None | :-) | UP | neutron-odlagent-portbinding |

Expect- status of each OVN on each node-
# openstack network agent list
+--------------------------------------+----------------+--------------------------+-------------
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+----------------+--------------------------+-------------
| 2f6fae0d-8cf4-4b99-8c6c-d371fe5827c8 | OVN | controller-0.localdomain | None | :-) | UP | neutron-ovnagent-portbinding |
| 4024d452-91cc-4b80-87b4-a771212744e1 | OVN | compute-0.localdomain | None | :-) | UP | neutron-ovnagent-portbinding | |

Comment 1 Miguel Angel Ajo 2017-09-26 12:46:19 UTC
This is something not supported yet, which we could implement. From my POV this is more an RFE. You could argue that we shouldn't expose the agent API if we don't support it. But that'd be a different bug.

IMO this is something we may eventually implement, and that it's interesting for l3ha, because the openstack l3ha router API is tightly coupled to agents, and we can consider ovn-controller agents.

We could use nb_cfg / sb_cfg counters to detect which chassis are more or less alive.

Comment 7 Miguel Angel Ajo 2018-06-10 22:31:04 UTC
*** Bug 1530194 has been marked as a duplicate of this bug. ***

Comment 20 Bernard Cafarelli 2018-09-26 10:00:30 UTC
Only https://review.openstack.org/#/c/583959/ remaining to merge

Comment 21 Lucas Alvares Gomes 2018-10-01 08:52:30 UTC
(In reply to Bernard Cafarelli from comment #20)
> Only https://review.openstack.org/#/c/583959/ remaining to merge

That's now merged and I've proposed the backport to stable/rocky upstream at https://review.openstack.org/#/c/606904/

Comment 25 Ofer Blaut 2018-11-11 10:25:08 UTC
*** Bug 1483842 has been marked as a duplicate of this bug. ***

Comment 27 Daniel Alvarez Sanchez 2018-11-26 10:57:00 UTC
Some thoughts here:

* If the node is rebooted properly, it should clear itself up from the chassis table. So this should make the agent in this node look as dead.
* When the node comes back up again it should register itself as a OVN chassis so it should show as alive
* If this is not triggering the proper updates, shall we increment NB_Global every X seconds (whatever neutron configures for the report interval).

Comment 28 Lucas Alvares Gomes 2018-11-26 11:00:56 UTC
(In reply to Daniel Alvarez Sanchez from comment #27)
> Some thoughts here:
> 
> * If the node is rebooted properly, it should clear itself up from the
> chassis table. So this should make the agent in this node look as dead.
> * When the node comes back up again it should register itself as a OVN
> chassis so it should show as alive
> * If this is not triggering the proper updates, shall we increment NB_Global
> every X seconds (whatever neutron configures for the report interval).

Took a quick look and the "timeout" used to determine whether the agent is alive or not is the agent_down_time from neutron [0].

But as you said, we don't seem to be triggering any updates to the NB_Global using that time [0] so, perhaps this is not been triggered when we don't have a clean shutdown (clean because ovn-controller would remove it's chassis entry and that would trigger the update I assume).

[0] https://github.com/openstack/networking-ovn/blob/687889d92706c9120b3fdd0efcc22a72e0c64637/networking_ovn/ml2/mech_driver.py#L890

Comment 29 Eran Kuris 2018-11-26 12:10:18 UTC
Daniel & Lucas this is exactly the issue when I reboot the node  the node did not
 update that its dead and I still see it as "alive".
with bug basically, I can't verify/confirm the RFE

Comment 30 Eran Kuris 2018-11-26 13:12:49 UTC
After running the test plan/ run on OpenStack/14.0-RHEL-7/2018-11-22.2/

I found some issues that block me from verifying this RFE.
All the info you can find in the Polarion test run report:

https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402

Comment 31 Terry Wilson 2018-11-26 19:06:59 UTC
Eran, when you say "I noticed that when I reboot one of the nodes the status did not change" do you mean that it doesn't change *ever*? Because it should change after Neutron's agent_down_time value has been reached (which I think defaults to 75 seconds).

The code should return False if we don't have a record of the agent or if the value in our record is older than agent_down_time seconds.

The config values that are important in neutron are

[DEFAULT]
agent_down_time = 75 # If something hasn't been down this long, it isn't down

[AGENT]
report_interval = 30 # should be < 0.5 * agent_down_time

Comment 32 Eran Kuris 2018-11-27 07:18:48 UTC
(In reply to Terry Wilson from comment #31)
> Eran, when you say "I noticed that when I reboot one of the nodes the status
> did not change" do you mean that it doesn't change *ever*? Because it should
> change after Neutron's agent_down_time value has been reached (which I think
> defaults to 75 seconds).
> 
> The code should return False if we don't have a record of the agent or if
> the value in our record is older than agent_down_time seconds.
> 
> The config values that are important in neutron are
> 
> [DEFAULT]
> agent_down_time = 75 # If something hasn't been down this long, it isn't down
> 
> [AGENT]
> report_interval = 30 # should be < 0.5 * agent_down_time

when I reboot compute node for example maybe it starts up faster than 75 seconds.
I think in that case we do not have to rely on time but the agent is alive or not.

Comment 33 Daniel Alvarez Sanchez 2018-11-27 08:43:01 UTC
(In reply to Eran Kuris from comment #32)
> (In reply to Terry Wilson from comment #31)
> > Eran, when you say "I noticed that when I reboot one of the nodes the status
> > did not change" do you mean that it doesn't change *ever*? Because it should
> > change after Neutron's agent_down_time value has been reached (which I think
> > defaults to 75 seconds).
> > 
> > The code should return False if we don't have a record of the agent or if
> > the value in our record is older than agent_down_time seconds.
> > 
> > The config values that are important in neutron are
> > 
> > [DEFAULT]
> > agent_down_time = 75 # If something hasn't been down this long, it isn't down
> > 
> > [AGENT]
> > report_interval = 30 # should be < 0.5 * agent_down_time
> 
> when I reboot compute node for example maybe it starts up faster than 75
> seconds.
> I think in that case we do not have to rely on time but the agent is alive
> or not.

Hi Eran,

In order to be able to tell if an agent is down or not, we need reports/checks and there's always a tradeoff between the responsiveness of the healthcheck and the overhead that sending and processing those reports/checks cause in the system. The more frequent they are, the more responsive the healthcheck system will be but more network traffic/CPU load we'll observe. Besides, imagine a monitoring system triggering alarms when all agents show as down just because of a glitch of the messaging system or some spurious error; perhaps in this situation, we want to avoid those alarms until we confirm (after some time) that the agent is actually down.

This is the reason why those parameters that Terry mentioned exist, so that they can be tuned by operators depending on the actual needs of the environment, criticality, etc. For the purpose of this BZ, if the reboot time is lower than the agent_down_time, then it's not expected that you see the agent status transitioning to 'dead'.

Comment 34 Eran Kuris 2018-11-27 08:58:47 UTC
According to comment 33 and my tests the I ran in test run: https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402

on OpenStack/14.0-RHEL-7/2018-11-22.2/

I am verifying this RFE.

Comment 36 errata-xmlrpc 2019-01-11 11:48:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.