Bug 1495692
Summary: | [RFE] [OVN] Implement the agents API | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> |
Component: | python-networking-ovn | Assignee: | Terry Wilson <twilson> |
Status: | CLOSED ERRATA | QA Contact: | Eran Kuris <ekuris> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 14.0 (Rocky) | CC: | abregman, amuller, apevec, bcafarel, dalvarez, ekuris, jlibosva, lhh, lmartins, majopela, nyechiel, sclewis, tfreger, twilson |
Target Milestone: | Upstream M3 | Keywords: | FutureFeature, Reopened, RFE, Triaged, UserExperience |
Target Release: | 14.0 (Rocky) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | python-networking-ovn-5.0.2-0.20181009120341.99b02f6.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-01-11 11:48:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1653274 | ||
Bug Blocks: | 1619537 |
Description
Eran Kuris
2017-09-26 11:38:34 UTC
This is something not supported yet, which we could implement. From my POV this is more an RFE. You could argue that we shouldn't expose the agent API if we don't support it. But that'd be a different bug. IMO this is something we may eventually implement, and that it's interesting for l3ha, because the openstack l3ha router API is tightly coupled to agents, and we can consider ovn-controller agents. We could use nb_cfg / sb_cfg counters to detect which chassis are more or less alive. *** Bug 1530194 has been marked as a duplicate of this bug. *** Only https://review.openstack.org/#/c/583959/ remaining to merge (In reply to Bernard Cafarelli from comment #20) > Only https://review.openstack.org/#/c/583959/ remaining to merge That's now merged and I've proposed the backport to stable/rocky upstream at https://review.openstack.org/#/c/606904/ *** Bug 1483842 has been marked as a duplicate of this bug. *** Some thoughts here: * If the node is rebooted properly, it should clear itself up from the chassis table. So this should make the agent in this node look as dead. * When the node comes back up again it should register itself as a OVN chassis so it should show as alive * If this is not triggering the proper updates, shall we increment NB_Global every X seconds (whatever neutron configures for the report interval). (In reply to Daniel Alvarez Sanchez from comment #27) > Some thoughts here: > > * If the node is rebooted properly, it should clear itself up from the > chassis table. So this should make the agent in this node look as dead. > * When the node comes back up again it should register itself as a OVN > chassis so it should show as alive > * If this is not triggering the proper updates, shall we increment NB_Global > every X seconds (whatever neutron configures for the report interval). Took a quick look and the "timeout" used to determine whether the agent is alive or not is the agent_down_time from neutron [0]. But as you said, we don't seem to be triggering any updates to the NB_Global using that time [0] so, perhaps this is not been triggered when we don't have a clean shutdown (clean because ovn-controller would remove it's chassis entry and that would trigger the update I assume). [0] https://github.com/openstack/networking-ovn/blob/687889d92706c9120b3fdd0efcc22a72e0c64637/networking_ovn/ml2/mech_driver.py#L890 Daniel & Lucas this is exactly the issue when I reboot the node the node did not update that its dead and I still see it as "alive". with bug basically, I can't verify/confirm the RFE After running the test plan/ run on OpenStack/14.0-RHEL-7/2018-11-22.2/ I found some issues that block me from verifying this RFE. All the info you can find in the Polarion test run report: https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402 Eran, when you say "I noticed that when I reboot one of the nodes the status did not change" do you mean that it doesn't change *ever*? Because it should change after Neutron's agent_down_time value has been reached (which I think defaults to 75 seconds). The code should return False if we don't have a record of the agent or if the value in our record is older than agent_down_time seconds. The config values that are important in neutron are [DEFAULT] agent_down_time = 75 # If something hasn't been down this long, it isn't down [AGENT] report_interval = 30 # should be < 0.5 * agent_down_time (In reply to Terry Wilson from comment #31) > Eran, when you say "I noticed that when I reboot one of the nodes the status > did not change" do you mean that it doesn't change *ever*? Because it should > change after Neutron's agent_down_time value has been reached (which I think > defaults to 75 seconds). > > The code should return False if we don't have a record of the agent or if > the value in our record is older than agent_down_time seconds. > > The config values that are important in neutron are > > [DEFAULT] > agent_down_time = 75 # If something hasn't been down this long, it isn't down > > [AGENT] > report_interval = 30 # should be < 0.5 * agent_down_time when I reboot compute node for example maybe it starts up faster than 75 seconds. I think in that case we do not have to rely on time but the agent is alive or not. (In reply to Eran Kuris from comment #32) > (In reply to Terry Wilson from comment #31) > > Eran, when you say "I noticed that when I reboot one of the nodes the status > > did not change" do you mean that it doesn't change *ever*? Because it should > > change after Neutron's agent_down_time value has been reached (which I think > > defaults to 75 seconds). > > > > The code should return False if we don't have a record of the agent or if > > the value in our record is older than agent_down_time seconds. > > > > The config values that are important in neutron are > > > > [DEFAULT] > > agent_down_time = 75 # If something hasn't been down this long, it isn't down > > > > [AGENT] > > report_interval = 30 # should be < 0.5 * agent_down_time > > when I reboot compute node for example maybe it starts up faster than 75 > seconds. > I think in that case we do not have to rely on time but the agent is alive > or not. Hi Eran, In order to be able to tell if an agent is down or not, we need reports/checks and there's always a tradeoff between the responsiveness of the healthcheck and the overhead that sending and processing those reports/checks cause in the system. The more frequent they are, the more responsive the healthcheck system will be but more network traffic/CPU load we'll observe. Besides, imagine a monitoring system triggering alarms when all agents show as down just because of a glitch of the messaging system or some spurious error; perhaps in this situation, we want to avoid those alarms until we confirm (after some time) that the agent is actually down. This is the reason why those parameters that Terry mentioned exist, so that they can be tuned by operators depending on the actual needs of the environment, criticality, etc. For the purpose of this BZ, if the reboot time is lower than the agent_down_time, then it's not expected that you see the agent status transitioning to 'dead'. According to comment 33 and my tests the I ran in test run: https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/testrun?id=20181017-1402 on OpenStack/14.0-RHEL-7/2018-11-22.2/ I am verifying this RFE. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |