2233797 – Neutron agent list request succeeds despite not contacting OVN DBs after controllers reboot

Bug 2233797 - Neutron agent list request succeeds despite not contacting OVN DBs after controllers reboot

Summary: Neutron agent list request succeeds despite not contacting OVN DBs after cont...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-neutron-lib
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z3
Target Release:	17.1
Assignee:	Rodolfo Alonso
QA Contact:	Eduardo Olivares
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2252947
TreeView+	depends on / blocked

Reported:	2023-08-23 13:13 UTC by Eduardo Olivares
Modified:	2024-05-22 20:39 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-neutron-lib-2.10.3-17.1.20230706110935.el9ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2252947 (view as bug list)
Environment:
Last Closed:	2024-05-22 20:39:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	2036607	None	None	None	2023-09-19 13:54:11 UTC
OpenStack gerrit	895940	None	MERGED	Add the "cancellable" flag to the ``CallbacksManager`` events	2023-12-05 12:13:04 UTC
OpenStack gerrit	895946	None	MERGED	Make ``OVNMechanismDriver.post_fork_initialize`` callback cancellable	2023-12-05 12:13:05 UTC
OpenStack gerrit	896009	None	MERGED	[OVN] Add a log message after the "post_fork_initialize" method	2023-12-05 12:13:05 UTC
Red Hat Issue Tracker	OSP-27734	None	None	None	2023-08-23 13:15:36 UTC
Red Hat Issue Tracker	OSP-31298	None	None	None	2024-01-30 14:39:08 UTC
Red Hat Product Errata	RHBA-2024:2741	None	None	None	2024-05-22 20:39:38 UTC

Description Eduardo Olivares 2023-08-23 13:13:17 UTC

Description of problem:
After tobiko faults test test_z99_hard_reboot_controllers_recovery reboots the three controller nodes using [1], many tests fail because the neutron agent list request is answered with 200 OK but with 0 agents.
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/test_results/tobiko_gate_1/tobiko_gate_1_03_faults_faults.html

The reboots happen here:
2023-08-22 00:16:19.657 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.17' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger'').
2023-08-22 00:16:19.671 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.28' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger'').
2023-08-22 00:16:19.685 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.21' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger'').


The first request returning 0 agents happens 3 minutes later:
2023-08-22 00:19:19.305 191140 WARNING tobiko.openstack.tests._neutron - neutron returned 0 agents, which is not valid... retrying...

And the last one happens 2 hours later, i.e. neutron never recovers from this issue:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/test_results/tobiko_gate_1/tobiko_gate_1_04_after_faults_sanity.html
2023-08-22 02:16:22.816 335216 WARNING tobiko.openstack.tests._neutron - neutron returned 0 agents, which is not valid... retrying...


haproxy logs confirm that those requests were received, distributed among the 3 controllers and answered with 200 OK during that period of time:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-1/var/log/containers/haproxy/haproxy.log.gz
Aug 22 00:19:19 controller-1 haproxy[7]: 10.0.0.41:33602 [22/Aug/2023:00:19:19.011] neutron neutron/controller-0.internalapi.redhat.local 0/0/0/291/291 200 187 - - ---- 2/1/0/0/0 0/0 "GET /v2.0/agents HTTP/1.1"
Aug 22 00:19:24 controller-1 haproxy[7]: 10.0.0.41:33602 [22/Aug/2023:00:19:24.014] neutron neutron/controller-2.internalapi.redhat.local 0/0/3/346/349 200 187 - - ---- 2/1/0/0/0 0/0 "GET /v2.0/agents HTTP/1.1"
...



From the neutron server logs, it seems the problem comes from the following OVSDB Error:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-0/var/log/containers/neutron/server.log.1.gz
2023-08-22 00:19:17.020 54 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it
2023-08-22 00:19:17.022 54 ERROR ovsdbapp.backend.ovs_idl.transaction [req-a9c8db34-4526-4a79-afa2-dfd0659fadc8 - - - - -] Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 118, in run
    txn.results.put(txn.do_commit())
  File "/usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 123, in do_commit
    raise RuntimeError(msg)
RuntimeError: OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it

Before that error, there are a log of logs from ovsdbapp.backend.ovs_idl.transaction. After that error, there is none.
It seems neutron is unable to request info from OVN DBs. Shouldn't it answer neutron requests with some error?


OVN NBDB logs (log level set to info) include some errors at 00:17 and then no logs for more than 1 hour:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-0/var/log/containers/openvswitch/ovsdb-server-nb.log.gz
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-1/var/log/containers/openvswitch/ovsdb-server-nb.log.gz
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-2/var/log/containers/openvswitch/ovsdb-server-nb.log.gz





[1] sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger'



Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230802.n.1

How reproducible:
it is the first time I see this (that's the reason why I set severity to medium)

Steps to Reproduce:
1. run OSP17.1 tobiko job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/
2.
3.

Comment 3 Rodolfo Alonso 2023-09-18 09:16:14 UTC

Hello:

After several retries trying to reproduce the error with the corresponding logs, we have managed to find the problem.

We realized that the "openstack agent list" empty list returned was happening randomly in the Neutron APIs. This issue was not always happening: the same Neutron API was returning the correct result or an empty list. Adding new logs we found the issue was specifically affecting some API workers, after some hard reboots. In the OVN mech driver, the method "get_agents" is patched to call both:
* The OVN mech driver "get_agents" method. This method retrieves the OVN agent list stored in a local cache (no DB access).
* The extension "agents" "get_agents" method. This method retrieves the Neutron DB "agents" list.

The patched method returns the combined results. This method is patched in [1]. The problem we found in the testing environment was that, sometimes, the "_setup_hash_ring" method called before [2] fails [3]. The OVN mech driver is loaded but a list of methods are not correctly patched (for example, "get_agents").

ACTIONS:
* It is needed to properly handle any possible error in the "_setup_hash_ring" call.
* Log a message at the end of the "post_fork_initialize" method to check that this event method has finished properly.

Regards.

[1]https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L355
[2]https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L349
[3]https://paste.opendev.org/show/bqzDPR5TukLq9d1GIcnz/

Comment 7 Elvira 2024-02-05 12:23:03 UTC

Raised priority to high since neutron CI testing is blocked because of this

Comment 14 Eduardo Olivares 2024-04-19 07:26:38 UTC

Verified on:
RHOS-17.1-RHEL-9-20240415.n.1
openstack-neutron-18.6.1-17.1.20231025110810.el9ost.noarch
python3-neutron-lib-2.10.3-17.1.20230706110935.el9ost.noarch

After executing the test that reboots the 3/3 controllers 100 times, the output obtained with the command `openstack network agent list` was consistent.


The following script was used:
```
$ cat run.sh 
#!/bin/bash

set -ex

export OS_CLOUD=overcloud

for i in {0..99}; do
  tox -e faults -- tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_z99_hard_reboot_controllers_recovery | tee report/test_z99_hard_reboot_controllers_recovery-$i.log
  for j in {0..9}; do
    AGENTS=`openstack network agent list -f value`
    NAGENTS=`echo $AGENTS | grep -o " True True " | wc -l`
    if [ "$NAGENTS" != "7" ]; then
      echo "Wrong number of agents! $NAGENTS"
      echo "$AGENTS"
      exit 1
    fi
  done
done
```

Comment 22 errata-xmlrpc 2024-05-22 20:39:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2741

Note You need to log in before you can comment on or make changes to this bug.