Bug 2068065
| Summary: | tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_live_migration failing | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | |
| Component: | python-networking-ovn | Assignee: | Slawek Kaplonski <skaplons> | |
| Status: | CLOSED ERRATA | QA Contact: | Bharath M V <bmv> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 16.1 (Train) | CC: | apevec, dceara, egarciar, i.maximets, lhh, majopela, mtomaska, scohen | |
| Target Milestone: | z9 | Keywords: | AutomationBlocker, Triaged | |
| Target Release: | 16.1 (Train on RHEL 8.2) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | python-networking-ovn-7.3.1-1.20220502142031.4e24f4c.el8ost | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2078544 (view as bug list) | Environment: | ||
| Last Closed: | 2022-12-07 20:26:20 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2078544 | |||
I took a look at this issue today with Rodolfo and we think that the issue may be in the core OVN. Here's what we found:
- Issue happened with instance c7d85029-d347-4661-8dc5-ca2cb451490a which has port "0dea4060-6ef4-454c-ba1f-7e72caf4ff50" (neutron port uuid).
- instance was first started on compute-1 and that was done without problems. Here's log from the compute-1:
ovs-vswitchd.log
2022-03-16T17:24:04.864Z|00145|bridge|INFO|bridge br-int: added interface tap0dea4060-6e on port 41
ovn-controller.log
2022-03-16T17:24:04.867Z|00101|binding|INFO|Claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 for this chassis.
2022-03-16T17:24:04.867Z|00102|binding|INFO|0dea4060-6ef4-454c-ba1f-7e72caf4ff50: Claiming fa:16:3e:79:44:48 10.100.0.3
- next there was live migration of that vm to the compute-0
Port was added by nova to the br-int:
2022-03-16T17:25:28.021Z|00173|bridge|INFO|bridge br-int: added interface tap0dea4060-6e on port 48
but it was not claimed by that chassis:
2022-03-16T17:25:28.024Z|00264|binding|INFO|Not claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50, chassis 536843b0-df62-441d-9ab4-e11743969074 requested-chassis compute-1.redhat.local
This seems to be fine because it wasn't yet released by the compute-1 chassis but that happened 2 seconds later:
2022-03-16T17:25:30.666Z|00252|binding|INFO|Releasing lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 from this chassis.
And even then, ovn-controller on compute-0 couldn't claim lport:
2022-03-16T17:26:23.068Z|00270|binding|INFO|Not claiming lport 0dea4060-6ef4-454c-ba1f-7e72caf4ff50, chassis 536843b0-df62-441d-9ab4-e11743969074 requested-chassis compute-1.redhat.local
That's all regarding that port what we found in the ovn-controllers on both nodes.
In Neutron we can see LogicalSwitchPortDown event after port was deleted from compute-1:
2022-03-16 17:25:30.725 28 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: LogicalSwitchPortUpdateDownEvent(events=('update',), table='Logical_Switch_Port', conditions=(('up', '=', False),), old_conditions=(('up', '=', True),)) to row=Logical_Switch_Port(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, up=[False], external_ids={'neutron:cidrs': '10.100.0.3/26', 'neutron:device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'neutron:port_name': '', 'neutron:project_id': '428d480e704b4412aa47a08b0fc86265', 'neutron:revision_number': '8', 'neutron:security_group_ids': '4669e182-ec03-43ba-8eae-f61c47b1d788', 'neutron:port_fip': '10.0.0.245'}, ha_chassis_group=[], options={'mcast_flood_reports': 'true', 'requested-chassis': 'compute-1.redhat.local'}, parent_name=[], enabled=[True], type=, dhcpv6_options=[], addresses=['fa:16:3e:79:44:48 10.100.0.3'], port_security=['fa:16:3e:79:44:48 10.100.0.3'], tag_request=[], tag=[], dhcpv4_options=[<ovs.db.idl.Row object at 0x7f627c7925c0>], dynamic_addresses=[]) old=Logical_Switch_Port(up=[True]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
After Dumitru's advice I took a look once again in the neutron-server logs and here's what I found:
When port was first created, neutron server set LSwitchPort like below:
2022-03-16 17:25:27.362 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckRevisionNumberCommand(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, resource={'id': '0dea4060-6ef4-454c-ba1f-7e72caf4ff50', 'name': '', 'network_id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'mac_address': 'fa:16:3e:79:44:48', 'admin_state_up': True, 'status': 'ACTIVE'
, 'device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'device_owner': 'compute:nova', 'fixed_ips': [{'subnet_id': '6d9c998d-7f88-4d9b-bad5-91d1fca80aaf', 'ip_address': '10.100.0.3'}], 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'security_groups': ['4669e182-ec03-43ba-8eae-f61c47b1d788'], 'description': '', 'binding:vnic_type': 'normal', 'binding:profile': {'migrating_to': 'compute-0.redhat.local'}, 'binding:host_
id': 'compute-1.redhat.local', 'binding:vif_type': 'ovs', 'binding:vif_details': {'port_filter': True}, 'qos_policy_id': None, 'port_security_enabled': True, 'dns_name': 'tempest-testnetworkadvancedserverops-server-309627669', 'dns_assignment': [{'ip_address': '10.100.0.3', 'hostname': 'tempest-testnetworkadvancedserverops-server-309627669', 'fqdn': 'tempest-testnetworkadvancedserverops-server-309627669.openstackgate.loca
l.'}], 'resource_request': None, 'ip_allocation': 'immediate', 'tags': [], 'created_at': '2022-03-16T17:24:00Z', 'updated_at': '2022-03-16T17:25:27Z', 'revision_number': 8, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'network': {'id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'name': 'tempest-network-smoke--387686606', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'admin_state_up': True, 'mtu': 1442, 'status': 'ACT
IVE', 'subnets': ['6d9c998d-7f88-4d9b-bad5-91d1fca80aaf'], 'shared': False, 'availability_zone_hints': [], 'availability_zones': [], 'ipv4_address_scope': None, 'ipv6_address_scope': None, 'router:external': False, 'vlan_transparent': False, 'description': '', 'qos_policy_id': None, 'port_security_enabled': True, 'dns_domain': '', 'l2_adjacency': True, 'tags': [], 'created_at': '2022-03-16T17:23:44Z', 'updated_at': '2022-
03-16T17:23:50Z', 'revision_number': 2, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'provider:network_type': 'geneve', 'provider:physical_network': None, 'provider:segmentation_id': 3}}, resource_type=ports, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:27.363 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): SetLSwitchPortCommand(lport=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, columns={'external_ids': {'neutron:port_name': '', 'neutron:device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'neutron:project_id': '428d480e704b4412aa47a08b0fc86265', 'neutron:cidrs': '10.100.0.3/26', 'neutron:device_owner': 'compute:nova', 'neutron:
network_name': 'neutron-df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'neutron:security_group_ids': '4669e182-ec03-43ba-8eae-f61c47b1d788', 'neutron:revision_number': '8', 'neutron:port_fip': '10.0.0.245'}, 'parent_name': [], 'tag': [], 'options': {'requested-chassis': 'compute-1.redhat.local', 'mcast_flood_reports': 'true'}, 'enabled': True, 'port_security': ['fa:16:3e:79:44:48 10.100.0.3'], 'dhcpv4_options': [UUID('9e6b3339-17
e8-441b-b472-114da2dc9765')], 'dhcpv6_options': [], 'type': '', 'addresses': ['fa:16:3e:79:44:48 10.100.0.3'], 'ha_chassis_group': []}, if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:27.365 26 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): PgAddPortCommand(port_group=neutron_pg_drop, lsp=['0dea4060-6ef4-454c-ba1f-7e72caf4ff50'], if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
But later, after migration was done it was:
2022-03-16 17:25:33.281 28 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckRevisionNumberCommand(name=0dea4060-6ef4-454c-ba1f-7e72caf4ff50, resource={'id': '0dea4060-6ef4-454c-ba1f-7e72caf4ff50', 'name': '', 'network_id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'mac_address': 'fa:16:3e:79:44:48', 'admin_state_up': True, 'status': 'DOWN', 'device_id': 'c7d85029-d347-4661-8dc5-ca2cb451490a', 'device_owner': 'compute:nova', 'fixed_ips': [{'subnet_id': '6d9c998d-7f88-4d9b-bad5-91d1fca80aaf', 'ip_address': '10.100.0.3'}], 'allowed_address_pairs': [], 'extra_dhcp_opts': [], 'security_groups': ['4669e182-ec03-43ba-8eae-f61c47b1d788'], 'description': '', 'binding:vnic_type': 'normal', 'binding:profile': {}, 'binding:host_id': 'compute-0.redhat.local', 'binding:vif_type': 'ovs', 'binding:vif_details': {'port_filter': True}, 'qos_policy_id': None, 'port_security_enabled': True, 'dns_name': 'tempest-testnetworkadvancedserverops-server-309627669', 'dns_assignment': [{'ip_address': '10.100.0.3', 'hostname': 'tempest-testnetworkadvancedserverops-server-309627669', 'fqdn': 'tempest-testnetworkadvancedserverops-server-309627669.openstackgate.local.'}], 'resource_request': None, 'ip_allocation': 'immediate', 'tags': [], 'created_at': '2022-03-16T17:24:00Z', 'updated_at': '2022-03-16T17:25:32Z', 'revision_number': 10, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'network': {'id': 'df2a78e9-94cb-4d56-8874-1a997a2ab3d5', 'name': 'tempest-network-smoke--387686606', 'tenant_id': '428d480e704b4412aa47a08b0fc86265', 'admin_state_up': True, 'mtu': 1442, 'status': 'ACTIVE', 'subnets': ['6d9c998d-7f88-4d9b-bad5-91d1fca80aaf'], 'shared': False, 'availability_zone_hints': [], 'availability_zones': [], 'ipv4_address_scope': None, 'ipv6_address_scope': None, 'router:external': False, 'vlan_transparent': False, 'description': '', 'qos_policy_id': None, 'port_security_enabled': True, 'dns_domain': '', 'l2_adjacency': True, 'tags': [], 'created_at': '2022-03-16T17:23:44Z', 'updated_at': '2022-03-16T17:23:50Z', 'revision_number': 2, 'project_id': '428d480e704b4412aa47a08b0fc86265', 'provider:network_type': 'geneve', 'provider:physical_network': None, 'provider:segmentation_id': 3}}, resource_type=ports, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2022-03-16 17:25:33.282 28 ERROR ovsdbapp.backend.ovs_idl.transaction [req-bad09089-e54d-4b5d-9da8-6e34f036b72e 81b2cd2b1d4a4b6f96bb9a6c26f0d3f1 2a80b1acbc7440da80533d75337a7d69 - default default] Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run
txn.results.put(txn.do_commit())
File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit
command.run_idl(txn)
File "/usr/lib/python3.6/site-packages/networking_ovn/ovsdb/commands.py", line 953, in run_idl
resource_id=self.name, resource_type=self.resource_type)
networking_ovn.common.exceptions.RevisionConflict: OVN revision number for 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 (type: ports) is equal or higher than the given resource. Skipping update
2022-03-16 17:25:33.283 28 INFO networking_ovn.ovsdb.impl_idl_ovn [req-bad09089-e54d-4b5d-9da8-6e34f036b72e 81b2cd2b1d4a4b6f96bb9a6c26f0d3f1 2a80b1acbc7440da80533d75337a7d69 - default default] Transaction aborted. Reason: OVN revision number for 0dea4060-6ef4-454c-ba1f-7e72caf4ff50 (type: ports) is equal or higher than the given resource. Skipping update
So I guess that because that revision number mismatch, requested-chassis for that LSwitchPort wasn't updated thus port wasn't claimed on the compute-0. Now we need to find out why the revision number didn't match.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795 |
Description of problem: After live migration, the VM is not reachable via external connectivity. Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper return f(*func_args, **func_kwargs) File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 263, in test_server_connectivity_live_migration server, keypair, floating_ip) File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 103, in _wait_server_status_and_check_network_connectivity self._check_network_connectivity(server, keypair, floating_ip) File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_advanced_server_ops.py", line 96, in _check_network_connectivity server) File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 723, in check_vm_connectivity msg=msg) File "/usr/lib/python3.6/site-packages/unittest2/case.py", line 705, in assertTrue raise self.failureException(msg) AssertionError: False is not true : Public network connectivity check failed Timed out waiting for 10.0.0.245 to become reachable Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20220315.n.1 How reproducible: 15% Steps to Reproduce: 1. run ci job:https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-dvr-rally/20/testReport/tempest.scenario.test_network_advanced_server_ops/TestNetworkAdvancedServerOps/Finally_Steps___test_server_connectivity_live_migration_compute_id_03fd1562_faad_11e7_9ea0_fa163e65f5ce_network_slow_/ 2. 3. Actual results: VM migration is failing Expected results: pass Additional info: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-dvr-rally/20/compute-1/var/log/containers/nova/nova-compute.log.gz