Description of problem: Issue reproduced with two tempest tests executed on the same job: test_attach_qos_port_to_vm_with_another_port [1] test_snat_external_ip [2] In both cases, the test fails because a created VM doesn't reach the status ACTIVE after 5 minutes: Details: (QosTestOvn:test_attach_qos_port_to_vm_with_another_port) Server c0b49322-8682-4573-9f7d-af2330fe7c01 failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: spawning. Relevant info from nova-compute logs at [3]: A) Server creation starts at 21:27:14 2022-05-03 21:27:14.824 7 INFO nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Creating image 2022-05-03 21:27:16.005 7 WARNING nova.compute.manager [req-ac7cfe69-f1af-4d25-9f43-c1eb3aaf15c6 - - - - -] While synchronizing instance power states, found 1 instances in the database and 0 instances on the hypervisor. 2022-05-03 21:27:17.754 7 INFO os_vif [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:d0:cd:38,bridge_name='br-int',has_traffic_filtering=True,id=fbd5a5ac-ac23-41a5-95fd-c2432de210c9,network=Network(411ed6bb-78ef-4177-9f6f-da0fbb5c85b0),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfbd5a5ac-ac') 2022-05-03 21:27:18.256 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Started (Lifecycle Event) 2022-05-03 21:27:18.302 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Paused (Lifecycle Event) 2022-05-03 21:27:18.386 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] During sync_power_state the instance has a pending task (spawning). Skip. B) Server creation times out 5 minutes later 2022-05-03 21:32:18.260 7 WARNING nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Timeout waiting for [('network-vif-plugged', 'fbd5a5ac-ac23-41a5-95fd-c2432de210c9')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds Relevant info from neutron server logs at [4]: A) port creation at 21:27:26 2022-05-03 21:27:16.316 17 DEBUG neutron.notifiers.nova [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Ignoring state change previous_port_status: DOWN current_port_status: DOWN port_id fbd5a5ac-ac23-41a5-95fd-c2432de210c9 record_port_status_changed /usr/lib/python3.6/site-packages/neutron/notifiers/nova.py:229 2022-05-03 21:27:16.323 17 DEBUG neutron_lib.callbacks.manager [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-317832'] for port, precommit_update _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193 2022-05-03 21:27:16.332 17 DEBUG neutron.db.provisioning_blocks [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Transition to ACTIVE for port object fbd5a5ac-ac23-41a5-95fd-c2432de210c9 will not be triggered until provisioned by entity L2. add_provisioning_component /usr/lib/python3.6/site-packages/neutron/db/provisioning_blocks.py:75 ... 2022-05-03 21:27:16.780 17 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): PgAddPortCommand(port_group=neutron_pg_drop, lsp=['fbd5a5ac-ac23-41a5-95fd-c2432de210c9'], if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89 ... 2022-05-03 21:27:16.881 17 DEBUG neutron.plugins.ml2.managers [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Bound port: fbd5a5ac-ac23-41a5-95fd-c2432de210c9, host: computesriov-1.localdomain, vif_type: ovs, vif_details: {"port_filter": true}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '99a3800a-4aea-4760-ba2b-0c4794c8def7', 'network_type': 'vlan', 'physical_network': 'datacentre', 'segmentation_id': 323, 'network_id': '411ed6bb-78ef-4177-9f6f-da0fbb5c85b0'}}] _bind_port_level /usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py:937 B) Maintenance task finds port inconsistency ~3 min later 2022-05-03 21:30:24.929 22 DEBUG networking_ovn.common.maintenance [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Maintenance task: Fixing resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) at create/update check_for_inconsistencies /usr/lib/python3.6/site-packages/networking_ovn/common/maintenance.py:354 ... 2022-05-03 21:30:25.069 22 INFO networking_ovn.db.revision [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Successfully bumped revision number for resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) to 3 [1] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/junit/neutron_plugin.tests.scenario.test_qos/QosTestOvn/test_attach_qos_port_to_vm_with_another_port_id_8dc4f365_d226_4bf5_a2cc_60f4655de504_/ [2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/neutron_tempest_plugin.scenario.test_floatingip/DefaultSnatToExternal/test_snat_external_ip_id_3d73ea1a_27c6_45a9_b0f8_04a283d9d764_/ [3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/computesriov-1/var/log/containers/nova/nova-compute.log.1.gz [4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/controller-1/var/log/containers/neutron/server.log.2.gz Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20220427.n.3 / python3-networking-ovn-7.4.2-2.20220409154847 How reproducible: 2 failures out of 1032 tests executed Steps to Reproduce: we don't have a reproducer yet, working on it
This bug needs to be triaged and it needs its acks to be included into the 16.2 zstream
So we further debugged this and below are the findings:- When the issue reproduces for a server/port:- - PortBindingUpdateUpEvent is received and put into queue, at this point self.notifications Queue size is large, seen 250+ - The Queue is filled with PortBindingChassisEvent for chassisredirect port in just 2-3 seconds - All the PortBindingChassisEvent is for same port just switching chassis[1], this is just snippet there were total 274 enteries for this particular case, for some cases seen 350+ too. - Same can be seen in ovn-controller log[2], added just snippet and there were in total 134 enteries on one controller and 135 on other. And this resulted into a known old unfixed OVN bug https://bugzilla.redhat.com/show_bug.cgi?id=1974898. So until that is fixed seems we need to revert https://review.opendev.org/c/openstack/networking-ovn/+/823279 which likely causing the issue more often as that switched monitoring to SB DB instead of NB DB, and NB and SB queues are different and NB events will not be impacted with large SB event queue. [1] 2022-05-25 09:11:04.511 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742 2022-05-25 09:11:04.513 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[], up=[False]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 2022-05-25 09:11:04.554 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742 2022-05-25 09:11:04.557 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 2022-05-25 09:11:04.560 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742 2022-05-25 09:11:04.563 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 2022-05-25 09:11:04.567 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742 2022-05-25 09:11:04.569 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 [2] Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb to 30a04401-a973-4ddd-a087-fd45b12116b7 on one controller Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 30a04401-a973-4ddd-a087-fd45b12116b7 to 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb on other controller
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4793