1994274 – Neutron server errors after OVS2OVN migration and reboot of a controller node

Bug 1994274 - Neutron server errors after OVS2OVN migration and reboot of a controller node

Summary: Neutron server errors after OVS2OVN migration and reboot of a controller node

Keywords:
Status:	CLOSED DUPLICATE of bug 1986341
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-ovn
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP Team
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-17 08:01 UTC by Eduardo Olivares
Modified:	2021-08-18 14:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-18 14:19:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Eduardo Olivares 2021-08-17 08:01:34 UTC

Description of problem:
There are some failures that occurred 3/3 times with RHOS-16.2-RHEL-8-20210804.n.0. They didn't occur with previous puddles, AFAICS in the job history. These are basically the steps:
1. Migration to OVN
2. Run neutron tests -> green, no failures -> these are just API and scenario tests, no nasty actions with the nodes
3. Run neutron_single_threaded tests -> all tests fail -> the first test from this suite reboot a controller node and it seems this triggers the rest of the failures

The fact that it didn't happen with previous puddles and happens 3/3 with the latest one is concerning (we don't have results of the new puddle RHOS-16.2-RHEL-8-20210811.n.1 yet), it looks like a regression.
On the other hand, the reproduction procedure (rebooting a controller) and the errors seen (500 server errors) look like the known issue BZ1986341. The difference with BZ1986341 seems to be that, after ovs2ovn migration, rebooting a node always reproduces this issue.

Results with RHOS-16.2-RHEL-8-20210722.n.0 (no failures): http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/37/infrared/.workspaces/workspace_2021-08-01_14-22-14/tempest_results/tempest-results-neutron_single_threaded.1.html

Results with RHOS-16.2-RHEL-8-20210804.n.0: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/41/infrared/.workspaces/workspace_2021-08-11_16-34-48/tempest_results/tempest-results-neutron_single_threaded.1.html

The first failure after the controller rebooted is that a VM cannot be deleted because the deletion of its port timed out. When you search for the deletion of its port (12c7957e-5cbd-40f0-bd0d-47ddf9ee51f6), there seem to be some issues in the connection to the OVN DBs:
[root@controller-1 ~]# zgrep -c "OVSDB transaction returned TRY_AGAIN" /var/log/containers/neutron/server.log.2.gz
140049

This issue may be related to BZ1986341.
Terry also mentioned BZ1980269. This happens during the tests too, but apparently it happens later:
[root@controller-2 ~]# zgrep -c "KeyError: UUID" /var/log/containers/neutron/server.log.2.gz
15

The neutron_single_threaded tests (the suite that includes the overcloud reboot) ran on other 16.2 CI jobs (not migration) on RHOS-16.2-RHEL-8-20210804.n.0 and they passed. Hence, this issue only happens after migration.

I only see the failures in the ovs2ovn nodvr to dvr and nodvr to nodvr jobs. It seems the dvr to dvr jobs (normal and composable) didn't reproduce this issue, but maybe this was a coincidence, I don't know.

The job I ran on my server,seal05, is nondvr to dvr and it failed, as I said:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration_nodvr-to-dvr/41

Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20210804.n.0

How reproducible:
3/3 times with RHOS-16.2-RHEL-8-20210804.n.0

Steps to Reproduce:
1. run an ovs2ovn nodvr to dvr or nodvr to nodvr job
2.
3.

Comment 3 Kamil Sambor 2021-08-18 14:19:05 UTC


*** This bug has been marked as a duplicate of bug 1986341 ***

Note You need to log in before you can comment on or make changes to this bug.