Created attachment 1654162 [details] logs and screenshots Description of problem: Every few (~5) minutes, some network agents are down. Frequently, I see in the neutron server logs that only OVN Metadata Agent from compute-0 is down. But sometimes it happens with some more agents from compute-0 and compute-1. I have deployed OSP16 (puddle from Jan13) with podman 1.6, with 3 controllers and 2 computes. I created a few VM instances, but I'm not doing anything with them, so I don't think the system is overloaded (top and free commands confirm it is not). Heartbeat time configured is 60. ()[neutron@controller-2 /]$ grep ^heart /etc/neutron/neutron.conf # heartbeat's keep-alive fails (0 disables heartbeat). (integer value) heartbeat_timeout_threshold=60 I have been filtering some logs and taking some snapshots, you can find them attached. Agent updates with enough frequency the SB DB (UpdateChassisExtIdsCommand). However, timestamp values in the SB DB are not always updated (see attached screenshots or server.log). In the logs you can see timestamps at some moment were more than 2 minutes old: 07:54:15.817 'neutron:liveness_check_at': '2020-01-21T07:52:05.537559+00:00', 'neutron:metadata_liveness_check_at': '2020-01-21T07:52:05.547188+00:00', 'neutron:ovn-metadata-sb-cfg': '33989', Version-Release number of selected component (if applicable): RHOS_TRUNK-16.0-RHEL-8-20200113.n.0 How reproducible: It happens at least once every 5 minutes Steps to Reproduce: 1. Deploy osp16 with OVN 2. watch -d -n5 "openstack network agent list" 3. Actual results: Agents are often set as not alive Expected results: Network agents should always be set as alive in normal circumstances. Additional info:
After some digging, it looks like nb_cfg is always ahead of what we read from the Chassis rows: 1) neutron-server increments the nb_cfg in NB_Global table from X to X+1 2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1] 3) neutron-server process the updates from each agent from X to X+1 *Most* of the times, in step number 2, this condition doesn't hold so the timestamp is not updated. The result is that after 60 seconds (agent timeout default value), the agent is shown as dead. Sometimes, 3) happens before 2) so the timestamp gets updated and all is fine but this is not the normal case: 1) Bump of nb_cfg 2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915 2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916 2) Check of each chassis ext_id against our new bumped nb_cfg: 2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 3) Processing updates [2] in the ChassisEvent (some are even older!) 2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update: 2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update: 2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update: 2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916 2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916 [1] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988 [2] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74
IMO, we need to space the bump of nb_cfg [0] and the check [1] in time as the NB_Global changes needs to be propagated to the SB, processed by all agents and then back to neutron-server which needs to process the JSON stuff and update the internal tables. So even if it's fast, most of the times it is not fast enough [0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093 [1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098
One solution that we can implement is allowing a difference of 1. 1) Neutron server bumps nb_cfg from x to x+1 2) Neutron server reads the nb_cfg from every agent and if (expected-actual) <= 1, then update timestamps If the agent is dead, next time that we check, (expected-actual) will be > 1 so the timestamps will not be updated and the agent will be eventually detected as dead. If the agent is alive, we'll keep the values within the margin. Tried it here: https://review.opendev.org/#/c/703612/
(In reply to Daniel Alvarez Sanchez from comment #4) > One solution that we can implement is allowing a difference of 1. > > 1) Neutron server bumps nb_cfg from x to x+1 > 2) Neutron server reads the nb_cfg from every agent and if (expected-actual) > <= 1, then update timestamps > > > If the agent is dead, next time that we check, (expected-actual) will be > 1 > so the timestamps will not be updated and the agent will be eventually > detected as dead. > If the agent is alive, we'll keep the values within the margin. > > Tried it here: > > https://review.opendev.org/#/c/703612/ This patch worked. For several hours, neither neutron logs nor output of 'openstack network agent list' show any dead agents.
BTW Candido checked if he see this issue on ML2/OVS setup and it does not reproduce.
Verified on OSP16 z1: RHOS_TRUNK-16.0-RHEL-8-20200213.n.1 python3-networking-ovn-7.1.0-0.20200204065607.57ac389.el8ost.noarch The following command does not show any error (have been monitoring it for enough minutes): openstack network agent list No errors shown in neutron server logs either.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0654