Bug 2081631 - VM creation times out due to inconsistent port
Summary: VM creation times out due to inconsistent port
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 16.2 (Train on RHEL 8.4)
Assignee: Terry Wilson
QA Contact: Fiorella Yanac
URL:
Whiteboard:
Depends On: 1974898 2189267 2196286
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-04 09:10 UTC by Eduardo Olivares
Modified: 2023-05-08 15:54 UTC (History)
15 users (show)

Fixed In Version: python-networking-ovn-7.4.2-2.20220409154849.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-22 16:07:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1964940 0 None None None 2022-05-26 07:29:54 UTC
OpenStack gerrit 843391 0 None MERGED Revert "Use Port_Binding up column to set Neutron port status" 2022-06-01 10:56:09 UTC
Red Hat Issue Tracker OSP-15034 0 None None None 2022-05-04 09:25:39 UTC
Red Hat Product Errata RHBA-2022:4793 0 None None None 2022-06-22 16:07:54 UTC

Description Eduardo Olivares 2022-05-04 09:10:08 UTC
Description of problem:
Issue reproduced with two tempest tests executed on the same job:
test_attach_qos_port_to_vm_with_another_port [1]
test_snat_external_ip [2]

In both cases, the test fails because a created VM doesn't reach the status ACTIVE after 5 minutes:
Details: (QosTestOvn:test_attach_qos_port_to_vm_with_another_port) Server c0b49322-8682-4573-9f7d-af2330fe7c01 failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: spawning.


Relevant info from nova-compute logs at [3]:
A) Server creation starts at 21:27:14
2022-05-03 21:27:14.824 7 INFO nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Creating image
2022-05-03 21:27:16.005 7 WARNING nova.compute.manager [req-ac7cfe69-f1af-4d25-9f43-c1eb3aaf15c6 - - - - -] While synchronizing instance power states, found 1 instances in the database and 0 instances on the hypervisor.
2022-05-03 21:27:17.754 7 INFO os_vif [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:d0:cd:38,bridge_name='br-int',has_traffic_filtering=True,id=fbd5a5ac-ac23-41a5-95fd-c2432de210c9,network=Network(411ed6bb-78ef-4177-9f6f-da0fbb5c85b0),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfbd5a5ac-ac')
2022-05-03 21:27:18.256 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Started (Lifecycle Event)
2022-05-03 21:27:18.302 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Paused (Lifecycle Event)
2022-05-03 21:27:18.386 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] During sync_power_state the instance has a pending task (spawning). Skip.

B) Server creation times out 5 minutes later
2022-05-03 21:32:18.260 7 WARNING nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Timeout waiting for [('network-vif-plugged', 'fbd5a5ac-ac23-41a5-95fd-c2432de210c9')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds



Relevant info from neutron server logs at [4]:
A) port creation at 21:27:26
2022-05-03 21:27:16.316 17 DEBUG neutron.notifiers.nova [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Ignoring state change previous_port_status: DOWN current_port_status: DOWN port_id fbd5a5ac-ac23-41a5-95fd-c2432de210c9 record_port_status_changed /usr/lib/python3.6/site-packages/neutron/notifiers/nova.py:229
2022-05-03 21:27:16.323 17 DEBUG neutron_lib.callbacks.manager [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-317832'] for port, precommit_update _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
2022-05-03 21:27:16.332 17 DEBUG neutron.db.provisioning_blocks [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Transition to ACTIVE for port object fbd5a5ac-ac23-41a5-95fd-c2432de210c9 will not be triggered until provisioned by entity L2. add_provisioning_component /usr/lib/python3.6/site-packages/neutron/db/provisioning_blocks.py:75
...
2022-05-03 21:27:16.780 17 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): PgAddPortCommand(port_group=neutron_pg_drop, lsp=['fbd5a5ac-ac23-41a5-95fd-c2432de210c9'], if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
...
2022-05-03 21:27:16.881 17 DEBUG neutron.plugins.ml2.managers [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Bound port: fbd5a5ac-ac23-41a5-95fd-c2432de210c9, host: computesriov-1.localdomain, vif_type: ovs, vif_details: {"port_filter": true}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '99a3800a-4aea-4760-ba2b-0c4794c8def7', 'network_type': 'vlan', 'physical_network': 'datacentre', 'segmentation_id': 323, 'network_id': '411ed6bb-78ef-4177-9f6f-da0fbb5c85b0'}}] _bind_port_level /usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py:937


B) Maintenance task finds port inconsistency ~3 min later
2022-05-03 21:30:24.929 22 DEBUG networking_ovn.common.maintenance [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Maintenance task: Fixing resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) at create/update check_for_inconsistencies /usr/lib/python3.6/site-packages/networking_ovn/common/maintenance.py:354
...
2022-05-03 21:30:25.069 22 INFO networking_ovn.db.revision [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Successfully bumped revision number for resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) to 3



[1] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/junit/neutron_plugin.tests.scenario.test_qos/QosTestOvn/test_attach_qos_port_to_vm_with_another_port_id_8dc4f365_d226_4bf5_a2cc_60f4655de504_/
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/neutron_tempest_plugin.scenario.test_floatingip/DefaultSnatToExternal/test_snat_external_ip_id_3d73ea1a_27c6_45a9_b0f8_04a283d9d764_/
[3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/computesriov-1/var/log/containers/nova/nova-compute.log.1.gz
[4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/controller-1/var/log/containers/neutron/server.log.2.gz



Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20220427.n.3 / python3-networking-ovn-7.4.2-2.20220409154847

How reproducible:
2 failures out of 1032 tests executed


Steps to Reproduce:
we don't have a reproducer yet, working on it

Comment 5 spower 2022-05-16 16:52:55 UTC
This bug needs to be triaged and it needs its acks to be included into the 16.2 zstream

Comment 6 Yatin Karel 2022-05-25 14:49:22 UTC
So we further debugged this and below are the findings:-

When the issue reproduces for a server/port:-
- PortBindingUpdateUpEvent is received and put into queue, at this point self.notifications Queue size is large, seen 250+
- The Queue is filled with PortBindingChassisEvent for chassisredirect port in just 2-3 seconds
- All the PortBindingChassisEvent is for same port just switching chassis[1], this is just snippet there were total 274 enteries for this particular case, for some cases seen 350+ too.
- Same can be seen in ovn-controller log[2], added just snippet and there were in total 134 enteries on one controller and 135 on other.

And this resulted into a known old unfixed OVN bug https://bugzilla.redhat.com/show_bug.cgi?id=1974898. So until that is fixed seems we need to revert https://review.opendev.org/c/openstack/networking-ovn/+/823279 which likely causing the issue more often as that switched monitoring to SB DB instead of NB DB, and NB and SB queues are different and NB events will not be impacted with large SB event queue.



[1] 2022-05-25 09:11:04.511 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.513 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[], up=[False]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.554 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.557 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.560 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.563 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.567 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.569 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44

[2] Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb to 30a04401-a973-4ddd-a087-fd45b12116b7 on one controller
Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 30a04401-a973-4ddd-a087-fd45b12116b7 to 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb on other controller

Comment 18 errata-xmlrpc 2022-06-22 16:07:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793


Note You need to log in before you can comment on or make changes to this bug.