Bug 1793440 - OSP16 - Network agents often not alive
Summary: OSP16 - Network agents often not alive
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z1
: 16.0 (Train on RHEL 8.1)
Assignee: Jakub Libosvar
QA Contact: Eduardo Olivares
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-21 11:21 UTC by Eduardo Olivares
Modified: 2024-03-25 15:38 UTC (History)
13 users (show)

Fixed In Version: python-networking-ovn-7.1.0-0.20200124055736.75fe49a.el8ost
Doc Type: Known Issue
Doc Text:
In Red Hat OpenStack 16.0, there is a known issue where the command `openstack network agent list` intermittently indicates that the OVN agents are down, when the agents are actually alive and the cloud is operational. The affected agents are: OVN Controller agent, OVN Metadata agent, and OVN Controller Gateway agent. There is currently no workaround for this issue. Ignore the output of the "openstack network agent list" command.
Clone Of:
Environment:
Last Closed: 2020-03-03 09:41:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs and screenshots (379.50 KB, application/gzip)
2020-01-21 11:21 UTC, Eduardo Olivares
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 703612 0 None MERGED [ovn] Agent liveness - allow time to propagate checks 2021-02-16 12:14:10 UTC
OpenStack gerrit 704112 0 None MERGED Agent liveness - allow time to propagate checks 2021-02-16 12:14:11 UTC
Red Hat Product Errata RHBA-2020:0654 0 None None None 2020-03-03 09:42:08 UTC

Description Eduardo Olivares 2020-01-21 11:21:16 UTC
Created attachment 1654162 [details]
logs and screenshots

Description of problem:
Every few (~5) minutes, some network agents are down. Frequently, I see in the neutron server logs that only OVN Metadata Agent from compute-0 is down. But sometimes it happens with some more agents from compute-0 and compute-1. I have deployed OSP16 (puddle from Jan13) with podman 1.6, with 3 controllers and 2 computes. I created a few VM instances, but I'm not doing anything with them, so I don't think the system is overloaded (top and free commands confirm it is not).

Heartbeat time configured is 60.
()[neutron@controller-2 /]$ grep ^heart /etc/neutron/neutron.conf
# heartbeat's keep-alive fails (0 disables heartbeat). (integer value)
heartbeat_timeout_threshold=60

I have been filtering some logs and taking some snapshots, you can find them attached.
Agent updates with enough frequency the SB DB (UpdateChassisExtIdsCommand). However, timestamp values in the SB DB are not always updated (see attached screenshots or server.log). In the logs you can see timestamps at some moment were more than 2 minutes old:
07:54:15.817 'neutron:liveness_check_at': '2020-01-21T07:52:05.537559+00:00', 'neutron:metadata_liveness_check_at': '2020-01-21T07:52:05.547188+00:00', 'neutron:ovn-metadata-sb-cfg': '33989',





Version-Release number of selected component (if applicable):
RHOS_TRUNK-16.0-RHEL-8-20200113.n.0


How reproducible:
It happens at least once every 5 minutes


Steps to Reproduce:
1. Deploy osp16 with OVN
2. watch -d -n5 "openstack network agent list"
3. 

Actual results:
Agents are often set as not alive


Expected results:
Network agents should always be set as alive in normal circumstances.


Additional info:

Comment 1 Daniel Alvarez Sanchez 2020-01-21 11:44:07 UTC
After some digging, it looks like nb_cfg is always ahead of what we read from the Chassis rows:


1) neutron-server increments the nb_cfg in NB_Global table from X to X+1
2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1]
3) neutron-server process the updates from each agent from X to X+1

*Most* of the times, in step number 2, this condition doesn't hold so the timestamp is not updated. The result is that after 60 seconds (agent timeout default value), the agent is shown as dead. Sometimes, 3) happens before 2) so the timestamp gets updated and all is fine but this is not the normal case:


1) Bump of nb_cfg
2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915
2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916


2) Check of each chassis ext_id against our new bumped nb_cfg: 
2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915


3) Processing updates [2] in the ChassisEvent (some are even older!)
2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update:
2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update:
2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update:
2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916
2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916


[1] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988
[2] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74

Comment 2 Daniel Alvarez Sanchez 2020-01-21 11:49:17 UTC
IMO, we need to space the bump of nb_cfg [0] and the check [1] in time as the NB_Global changes needs to be propagated to the SB, processed by all agents and then back to neutron-server which needs to process the JSON stuff and update the internal tables. So even if it's fast, most of the times it is not fast enough
 

[0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093
[1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098

Comment 4 Daniel Alvarez Sanchez 2020-01-21 14:37:47 UTC
One solution that we can implement is allowing a difference of 1.

1) Neutron server bumps nb_cfg from x to x+1
2) Neutron server reads the nb_cfg from every agent and if (expected-actual) <= 1, then update timestamps


If the agent is dead, next time that we check, (expected-actual) will be > 1 so the timestamps will not be updated and the agent will be eventually detected as dead.
If the agent is alive, we'll keep the values within the margin.

Tried it here:

https://review.opendev.org/#/c/703612/

Comment 5 Eduardo Olivares 2020-01-22 08:39:07 UTC
(In reply to Daniel Alvarez Sanchez from comment #4)
> One solution that we can implement is allowing a difference of 1.
> 
> 1) Neutron server bumps nb_cfg from x to x+1
> 2) Neutron server reads the nb_cfg from every agent and if (expected-actual)
> <= 1, then update timestamps
> 
> 
> If the agent is dead, next time that we check, (expected-actual) will be > 1
> so the timestamps will not be updated and the agent will be eventually
> detected as dead.
> If the agent is alive, we'll keep the values within the margin.
> 
> Tried it here:
> 
> https://review.opendev.org/#/c/703612/

This patch worked. For several hours, neither neutron logs nor output of 'openstack network agent list' show any dead agents.

Comment 6 Eran Kuris 2020-01-22 08:45:05 UTC
BTW Candido checked if he see this issue on ML2/OVS setup and it does not reproduce.

Comment 15 Eduardo Olivares 2020-02-14 15:43:15 UTC
Verified on OSP16 z1:
RHOS_TRUNK-16.0-RHEL-8-20200213.n.1
python3-networking-ovn-7.1.0-0.20200204065607.57ac389.el8ost.noarch

The following command does not show any error (have been monitoring it for enough minutes): openstack network agent list
No errors shown in neutron server logs either.

Comment 17 errata-xmlrpc 2020-03-03 09:41:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0654


Note You need to log in before you can comment on or make changes to this bug.