Bug 2068065

Summary: tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_live_migration failing
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: python-networking-ovnAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED ERRATA QA Contact: Bharath M V <bmv>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: apevec, dceara, egarciar, i.maximets, lhh, majopela, mtomaska, scohen
Target Milestone: z9Keywords: AutomationBlocker, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.3.1-1.20220502142031.4e24f4c.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2078544 (view as bug list) Environment:
Last Closed: 2022-12-07 20:26:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2078544    

Description Eran Kuris 2022-03-24 11:38:38 UTC
Description of problem:
After live migration, the VM is not reachable via external connectivity.



Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 263, in test_server_connectivity_live_migration
    server, keypair, floating_ip)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 103, in _wait_server_status_and_check_network_connectivity
    self._check_network_connectivity(server, keypair, floating_ip)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 96, in _check_network_connectivity
    server)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 723, in check_vm_connectivity
    msg=msg)
  File "/usr/lib/python3.6/site-packages/unittest2/case.py", line 705, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Public network connectivity check failed
Timed out waiting for 10.0.0.245 to become reachable

Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20220315.n.1

How reproducible:
15% 
Steps to Reproduce:
1. run ci job:https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-dvr-rally/20/testReport/tempest.scenario.test_network_advanced_server_ops/TestNetworkAdvancedServerOps/Finally_Steps___test_server_connectivity_live_migration_compute_id_03fd1562_faad_11e7_9ea0_fa163e65f5ce_network_slow_/
2.
3.

Actual results:
VM migration is failing 

Expected results:
pass 

Additional info:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-dvr-rally/20/compute-1/var/log/containers/nova/nova-compute.log.gz

Comment 1 Slawek Kaplonski 2022-03-29 10:08:55 UTC
I took a look at this issue today with Rodolfo and we think that the issue may be in the core OVN. Here's what we found:

- Issue happened with instance c7d85029-d347-4661-8dc5-ca2cb451490a which has port "0dea4060-6ef4-454c-ba1f-7e72caf4ff50" (neutron port uuid).
- instance was first started on compute-1 and that was done without problems. Here's log from the compute-1:

ovs-vswitchd.log
2022-03-16T17:24:04.864Z|00145|bridge|INFO|bridge br-int: added interface tap0dea4060-6e on port 41

ovn-controller.log
2022-03-16T17:24:04.867Z|00101|binding|INFO|Claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 for this chassis.
2022-03-16T17:24:04.867Z|00102|binding|INFO|0dea4060-6ef4-454c-ba1f-7e72caf4ff50: Claiming fa:16:3e:79:44:48 10.100.0.3

- next there was live migration of that vm to the compute-0
Port was added by nova to the br-int:
2022-03-16T17:25:28.021Z|00173|bridge|INFO|bridge br-int: added interface tap0dea4060-6e on port 48

but it was not claimed by that chassis:
2022-03-16T17:25:28.024Z|00264|binding|INFO|Not claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50, chassis 536843b0-df62-441d-9ab4-e11743969074 requested-chassis compute-1.redhat.local

This seems to be fine because it wasn't yet released by the compute-1 chassis but that happened 2 seconds later:
2022-03-16T17:25:30.666Z|00252|binding|INFO|Releasing lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 from this chassis.

And even then, ovn-controller on compute-0 couldn't claim lport:
2022-03-16T17:26:23.068Z|00270|binding|INFO|Not claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50, chassis 536843b0-df62-441d-9ab4-e11743969074 requested-chassis compute-1.redhat.local

That's all regarding that port what we found in the ovn-controllers on both nodes.
In Neutron we can see LogicalSwitchPortDown event after port was deleted from compute-1:

2022-03-16 17:25:30.725 28 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: LogicalSwitchPortUpdateDownEvent(events=('update',), table='Logical_Switch_Port', conditions=(('up', '=', False),), old_conditions=(('up', '=', True),)) to row=Logical_Switch_Port(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, up=[False], external_ids={'neutron:cidrs': '10.100.0.3/26', 'neutron:device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'neutron:port_name': '', 'neutron:project_id': '428d480e704b4412aa47a08b0fc86265', 'neutron:revision_number': '8', 'neutron:security_group_ids': '4669e182-ec03-43ba-8eae-f61c47b1d788', 'neutron:port_fip': '10.0.0.245'}, ha_chassis_group=[], options={'mcast_flood_reports': 'true', 'requested-chassis': 'compute-1.redhat.local'}, parent_name=[], enabled=[True], type=, dhcpv6_options=[], addresses=['fa:16:3e:79:44:48 10.100.0.3'], port_security=['fa:16:3e:79:44:48 10.100.0.3'], tag_request=[], tag=[], dhcpv4_options=[<ovs.db.idl.Row object at 0x7f627c7925c0>], dynamic_addresses=[]) old=Logical_Switch_Port(up=[True]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44

Comment 4 Slawek Kaplonski 2022-03-29 12:24:18 UTC
After Dumitru's advice I took a look once again in the neutron-server logs and here's what I found:

When port was first created, neutron server set LSwitchPort like below:

2022-03-16 17:25:27.362 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckRevisionNumberCommand(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, resource={'id': '0dea4060-6ef4-454c-ba1f-7e72caf4ff50', 'name': '', 'network_id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'mac_address': 'fa:16:3e:79:44:48', 'admin_state_up': True, 'status': 'ACTIVE'
, 'device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'device_owner': 'compute:nova', 'fixed_ips': [{'subnet_id': '6d9c998d-7f88-4d9b-bad5-91d1fca80aaf', 'ip_address': '10.100.0.3'}], 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'security_groups': ['4669e182-ec03-43ba-8eae-f61c47b1d788'], 'description': '', 'binding:vnic_type': 'normal', 'binding:profile': {'migrating_to': 'compute-0.redhat.local'}, 'binding:host_
id': 'compute-1.redhat.local', 'binding:vif_type': 'ovs', 'binding:vif_details': {'port_filter': True}, 'qos_policy_id': None, 'port_security_enabled': True, 'dns_name': 'tempest-testnetworkadvancedserverops-server-309627669', 'dns_assignment': [{'ip_address': '10.100.0.3', 'hostname': 'tempest-testnetworkadvancedserverops-server-309627669', 'fqdn': 'tempest-testnetworkadvancedserverops-server-309627669.openstackgate.loca
l.'}], 'resource_request': None, 'ip_allocation': 'immediate', 'tags': [], 'created_at': '2022-03-16T17:24:00Z', 'updated_at': '2022-03-16T17:25:27Z', 'revision_number': 8, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'network': {'id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'name': 'tempest-network-smoke--387686606', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'admin_state_up': True, 'mtu': 1442, 'status': 'ACT
IVE', 'subnets': ['6d9c998d-7f88-4d9b-bad5-91d1fca80aaf'], 'shared': False, 'availability_zone_hints': [], 'availability_zones': [], 'ipv4_address_scope': None, 'ipv6_address_scope': None, 'router:external': False, 'vlan_transparent': False, 'description': '', 'qos_policy_id': None, 'port_security_enabled': True, 'dns_domain': '', 'l2_adjacency': True, 'tags': [], 'created_at': '2022-03-16T17:23:44Z', 'updated_at': '2022-
03-16T17:23:50Z', 'revision_number': 2, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'provider:network_type': 'geneve', 'provider:physical_network': None, 'provider:segmentation_id': 3}}, resource_type=ports, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:27.363 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): SetLSwitchPortCommand(lport=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, columns={'external_ids': {'neutron:port_name': '', 'neutron:device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'neutron:project_id': '428d480e704b4412aa47a08b0fc86265', 'neutron:cidrs': '10.100.0.3/26', 'neutron:device_owner': 'compute:nova', 'neutron:
network_name': 'neutron-df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'neutron:security_group_ids': '4669e182-ec03-43ba-8eae-f61c47b1d788', 'neutron:revision_number': '8', 'neutron:port_fip': '10.0.0.245'}, 'parent_name': [], 'tag': [], 'options': {'requested-chassis': 'compute-1.redhat.local', 'mcast_flood_reports': 'true'}, 'enabled': True, 'port_security': ['fa:16:3e:79:44:48 10.100.0.3'], 'dhcpv4_options': [UUID('9e6b3339-17
e8-441b-b472-114da2dc9765')], 'dhcpv6_options': [], 'type': '', 'addresses': ['fa:16:3e:79:44:48 10.100.0.3'], 'ha_chassis_group': []}, if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:27.365 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): PgAddPortCommand(port_group=neutron_pg_drop, lsp=['0dea4060-6ef4-454c-ba1f-7e72caf4ff50'], if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84



But later, after migration was done it was:
2022-03-16 17:25:33.281 28 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckRevisionNumberCommand(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, resource={'id': '0dea4060-6ef4-454c-ba1f-7e72caf4ff50', 'name': '', 'network_id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'mac_address': 'fa:16:3e:79:44:48', 'admin_state_up': True, 'status': 'DOWN', 'device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'device_owner': 'compute:nova', 'fixed_ips': [{'subnet_id': '6d9c998d-7f88-4d9b-bad5-91d1fca80aaf', 'ip_address': '10.100.0.3'}], 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'security_groups': ['4669e182-ec03-43ba-8eae-f61c47b1d788'], 'description': '', 'binding:vnic_type': 'normal', 'binding:profile': {}, 'binding:host_id': 'compute-0.redhat.local', 'binding:vif_type': 'ovs', 'binding:vif_details': {'port_filter': True}, 'qos_policy_id': None, 'port_security_enabled': True, 'dns_name': 'tempest-testnetworkadvancedserverops-server-309627669', 'dns_assignment': [{'ip_address': '10.100.0.3', 'hostname': 'tempest-testnetworkadvancedserverops-server-309627669', 'fqdn': 'tempest-testnetworkadvancedserverops-server-309627669.openstackgate.local.'}], 'resource_request': None, 'ip_allocation': 'immediate', 'tags': [], 'created_at': '2022-03-16T17:24:00Z', 'updated_at': '2022-03-16T17:25:32Z', 'revision_number': 10, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'network': {'id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'name': 'tempest-network-smoke--387686606', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'admin_state_up': True, 'mtu': 1442, 'status': 'ACTIVE', 'subnets': ['6d9c998d-7f88-4d9b-bad5-91d1fca80aaf'], 'shared': False, 'availability_zone_hints': [], 'availability_zones': [], 'ipv4_address_scope': None, 'ipv6_address_scope': None, 'router:external': False, 'vlan_transparent': False, 'description': '', 'qos_policy_id': None, 'port_security_enabled': True, 'dns_domain': '', 'l2_adjacency': True, 'tags': [], 'created_at': '2022-03-16T17:23:44Z', 'updated_at': '2022-03-16T17:23:50Z', 'revision_number': 2, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'provider:network_type': 'geneve', 'provider:physical_network': None, 'provider:segmentation_id': 3}}, resource_type=ports, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:33.282 28 ERROR ovsdbapp.backend.ovs_idl.transaction [req-bad09089-e54d-4b5d-9da8-6e34f036b72e 81b2cd2b1d4a4b6f96bb9a6c26f0d3f1 2a80b1acbc7440da80533d75337a7d69 - default default] Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run
    txn.results.put(txn.do_commit())
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit
    command.run_idl(txn)
  File "/usr/lib/python3.6/site-packages/networking_ovn/ovsdb/commands.py", line 953, in run_idl
    resource_id=self.name, resource_type=self.resource_type)
networking_ovn.common.exceptions.RevisionConflict: OVN revision number for 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 (type: ports) is equal or higher than the given resource. Skipping update

2022-03-16 17:25:33.283 28 INFO networking_ovn.ovsdb.impl_idl_ovn [req-bad09089-e54d-4b5d-9da8-6e34f036b72e 81b2cd2b1d4a4b6f96bb9a6c26f0d3f1 2a80b1acbc7440da80533d75337a7d69 - default default] Transaction aborted. Reason: OVN revision number for 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 (type: ports) is equal or higher than the given resource. Skipping update

So I guess that because that revision number mismatch, requested-chassis for that LSwitchPort wasn't updated thus port wasn't claimed on the compute-0. Now we need to find out why the revision number didn't match.

Comment 15 errata-xmlrpc 2022-12-07 20:26:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795