Description of problem: There are some failures that occurred 3/3 times with RHOS-16.2-RHEL-8-20210804.n.0. They didn't occur with previous puddles, AFAICS in the job history. These are basically the steps: 1. Migration to OVN 2. Run neutron tests -> green, no failures -> these are just API and scenario tests, no nasty actions with the nodes 3. Run neutron_single_threaded tests -> all tests fail -> the first test from this suite reboot a controller node and it seems this triggers the rest of the failures The fact that it didn't happen with previous puddles and happens 3/3 with the latest one is concerning (we don't have results of the new puddle RHOS-16.2-RHEL-8-20210811.n.1 yet), it looks like a regression. On the other hand, the reproduction procedure (rebooting a controller) and the errors seen (500 server errors) look like the known issue BZ1986341. The difference with BZ1986341 seems to be that, after ovs2ovn migration, rebooting a node always reproduces this issue. Results with RHOS-16.2-RHEL-8-20210722.n.0 (no failures): http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/37/infrared/.workspaces/workspace_2021-08-01_14-22-14/tempest_results/tempest-results-neutron_single_threaded.1.html Results with RHOS-16.2-RHEL-8-20210804.n.0: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/41/infrared/.workspaces/workspace_2021-08-11_16-34-48/tempest_results/tempest-results-neutron_single_threaded.1.html The first failure after the controller rebooted is that a VM cannot be deleted because the deletion of its port timed out. When you search for the deletion of its port (12c7957e-5cbd-40f0-bd0d-47ddf9ee51f6), there seem to be some issues in the connection to the OVN DBs: [root@controller-1 ~]# zgrep -c "OVSDB transaction returned TRY_AGAIN" /var/log/containers/neutron/server.log.2.gz 140049 This issue may be related to BZ1986341. Terry also mentioned BZ1980269. This happens during the tests too, but apparently it happens later: [root@controller-2 ~]# zgrep -c "KeyError: UUID" /var/log/containers/neutron/server.log.2.gz 15 The neutron_single_threaded tests (the suite that includes the overcloud reboot) ran on other 16.2 CI jobs (not migration) on RHOS-16.2-RHEL-8-20210804.n.0 and they passed. Hence, this issue only happens after migration. I only see the failures in the ovs2ovn nodvr to dvr and nodvr to nodvr jobs. It seems the dvr to dvr jobs (normal and composable) didn't reproduce this issue, but maybe this was a coincidence, I don't know. The job I ran on my server,seal05, is nondvr to dvr and it failed, as I said: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/41 Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210804.n.0 How reproducible: 3/3 times with RHOS-16.2-RHEL-8-20210804.n.0 Steps to Reproduce: 1. run an ovs2ovn nodvr to dvr or nodvr to nodvr job 2. 3.
*** This bug has been marked as a duplicate of bug 1986341 ***