Description of problem: After tobiko faults test test_z99_hard_reboot_controllers_recovery reboots the three controller nodes using [1], many tests fail because the neutron agent list request is answered with 200 OK but with 0 agents. http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/test_results/tobiko_gate_1/tobiko_gate_1_03_faults_faults.html The reboots happen here: 2023-08-22 00:16:19.657 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.17' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger''). 2023-08-22 00:16:19.671 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.28' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger''). 2023-08-22 00:16:19.685 191140 INFO tobiko.shell.sh._reboot - Host '192.168.24.21' is rebooting (command='sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger''). The first request returning 0 agents happens 3 minutes later: 2023-08-22 00:19:19.305 191140 WARNING tobiko.openstack.tests._neutron - neutron returned 0 agents, which is not valid... retrying... And the last one happens 2 hours later, i.e. neutron never recovers from this issue: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/test_results/tobiko_gate_1/tobiko_gate_1_04_after_faults_sanity.html 2023-08-22 02:16:22.816 335216 WARNING tobiko.openstack.tests._neutron - neutron returned 0 agents, which is not valid... retrying... haproxy logs confirm that those requests were received, distributed among the 3 controllers and answered with 200 OK during that period of time: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-1/var/log/containers/haproxy/haproxy.log.gz Aug 22 00:19:19 controller-1 haproxy[7]: 10.0.0.41:33602 [22/Aug/2023:00:19:19.011] neutron neutron/controller-0.internalapi.redhat.local 0/0/0/291/291 200 187 - - ---- 2/1/0/0/0 0/0 "GET /v2.0/agents HTTP/1.1" Aug 22 00:19:24 controller-1 haproxy[7]: 10.0.0.41:33602 [22/Aug/2023:00:19:24.014] neutron neutron/controller-2.internalapi.redhat.local 0/0/3/346/349 200 187 - - ---- 2/1/0/0/0 0/0 "GET /v2.0/agents HTTP/1.1" ... From the neutron server logs, it seems the problem comes from the following OVSDB Error: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-0/var/log/containers/neutron/server.log.1.gz 2023-08-22 00:19:17.020 54 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it 2023-08-22 00:19:17.022 54 ERROR ovsdbapp.backend.ovs_idl.transaction [req-a9c8db34-4526-4a79-afa2-dfd0659fadc8 - - - - -] Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 118, in run txn.results.put(txn.do_commit()) File "/usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 123, in do_commit raise RuntimeError(msg) RuntimeError: OVSDB Error: The transaction failed because the IDL has been configured to require a database lock but didn't get it yet or has already lost it Before that error, there are a log of logs from ovsdbapp.backend.ovs_idl.transaction. After that error, there is none. It seems neutron is unable to request info from OVN DBs. Shouldn't it answer neutron requests with some error? OVN NBDB logs (log level set to info) include some errors at 00:17 and then no logs for more than 1 hour: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-0/var/log/containers/openvswitch/ovsdb-server-nb.log.gz http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-1/var/log/containers/openvswitch/ovsdb-server-nb.log.gz http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/234/controller-2/var/log/containers/openvswitch/ovsdb-server-nb.log.gz [1] sudo /bin/sh -c 'echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger' Version-Release number of selected component (if applicable): RHOS-17.1-RHEL-9-20230802.n.1 How reproducible: it is the first time I see this (that's the reason why I set severity to medium) Steps to Reproduce: 1. run OSP17.1 tobiko job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/gate-sanity-17.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko_faults/ 2. 3.
Hello: After several retries trying to reproduce the error with the corresponding logs, we have managed to find the problem. We realized that the "openstack agent list" empty list returned was happening randomly in the Neutron APIs. This issue was not always happening: the same Neutron API was returning the correct result or an empty list. Adding new logs we found the issue was specifically affecting some API workers, after some hard reboots. In the OVN mech driver, the method "get_agents" is patched to call both: * The OVN mech driver "get_agents" method. This method retrieves the OVN agent list stored in a local cache (no DB access). * The extension "agents" "get_agents" method. This method retrieves the Neutron DB "agents" list. The patched method returns the combined results. This method is patched in [1]. The problem we found in the testing environment was that, sometimes, the "_setup_hash_ring" method called before [2] fails [3]. The OVN mech driver is loaded but a list of methods are not correctly patched (for example, "get_agents"). ACTIONS: * It is needed to properly handle any possible error in the "_setup_hash_ring" call. * Log a message at the end of the "post_fork_initialize" method to check that this event method has finished properly. Regards. [1]https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L355 [2]https://github.com/openstack/neutron/blob/c33805919eb89f5ea6d8b54d4142d4c829d9a2a0/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L349 [3]https://paste.opendev.org/show/bqzDPR5TukLq9d1GIcnz/
Raised priority to high since neutron CI testing is blocked because of this
Verified on: RHOS-17.1-RHEL-9-20240415.n.1 openstack-neutron-18.6.1-17.1.20231025110810.el9ost.noarch python3-neutron-lib-2.10.3-17.1.20230706110935.el9ost.noarch After executing the test that reboots the 3/3 controllers 100 times, the output obtained with the command `openstack network agent list` was consistent. The following script was used: ``` $ cat run.sh #!/bin/bash set -ex export OS_CLOUD=overcloud for i in {0..99}; do tox -e faults -- tobiko/tests/faults/ha/test_cloud_recovery.py::DisruptTripleoNodesTest::test_z99_hard_reboot_controllers_recovery | tee report/test_z99_hard_reboot_controllers_recovery-$i.log for j in {0..9}; do AGENTS=`openstack network agent list -f value` NAGENTS=`echo $AGENTS | grep -o " True True " | wc -l` if [ "$NAGENTS" != "7" ]; then echo "Wrong number of agents! $NAGENTS" echo "$AGENTS" exit 1 fi done done ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.1.3 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:2741