Bug 2081631

Summary: VM creation times out due to inconsistent port
Product: Red Hat OpenStack Reporter: Eduardo Olivares <eolivare>
Component: python-networking-ovnAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Fiorella Yanac <fyanac>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: apevec, froyo, lhh, majopela, mblue, mtomaska, pgrist, ralonsoh, scohen, shrjoshi, skaplons, spower, stchen, tvignaud, ykarel
Target Milestone: z3Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-7.4.2-2.20220409154849.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-22 16:07:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1974898, 2189267, 2196286    
Bug Blocks:    

Description Eduardo Olivares 2022-05-04 09:10:08 UTC
Description of problem:
Issue reproduced with two tempest tests executed on the same job:
test_attach_qos_port_to_vm_with_another_port [1]
test_snat_external_ip [2]

In both cases, the test fails because a created VM doesn't reach the status ACTIVE after 5 minutes:
Details: (QosTestOvn:test_attach_qos_port_to_vm_with_another_port) Server c0b49322-8682-4573-9f7d-af2330fe7c01 failed to reach ACTIVE status and task state "None" within the required time (300 s). Current status: BUILD. Current task state: spawning.


Relevant info from nova-compute logs at [3]:
A) Server creation starts at 21:27:14
2022-05-03 21:27:14.824 7 INFO nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Creating image
2022-05-03 21:27:16.005 7 WARNING nova.compute.manager [req-ac7cfe69-f1af-4d25-9f43-c1eb3aaf15c6 - - - - -] While synchronizing instance power states, found 1 instances in the database and 0 instances on the hypervisor.
2022-05-03 21:27:17.754 7 INFO os_vif [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:d0:cd:38,bridge_name='br-int',has_traffic_filtering=True,id=fbd5a5ac-ac23-41a5-95fd-c2432de210c9,network=Network(411ed6bb-78ef-4177-9f6f-da0fbb5c85b0),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapfbd5a5ac-ac')
2022-05-03 21:27:18.256 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Started (Lifecycle Event)
2022-05-03 21:27:18.302 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] VM Paused (Lifecycle Event)
2022-05-03 21:27:18.386 7 INFO nova.compute.manager [req-8b78597a-bbb2-4263-a25a-df003091fcad - - - - -] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] During sync_power_state the instance has a pending task (spawning). Skip.

B) Server creation times out 5 minutes later
2022-05-03 21:32:18.260 7 WARNING nova.virt.libvirt.driver [req-f30dfece-6bee-4d6a-898c-dd8b3ae2db43 461766fdecdf4b1aa65f758ea0acb338 660b1312dba44dafbca5201e828a06a4 - default default] [instance: c0b49322-8682-4573-9f7d-af2330fe7c01] Timeout waiting for [('network-vif-plugged', 'fbd5a5ac-ac23-41a5-95fd-c2432de210c9')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds



Relevant info from neutron server logs at [4]:
A) port creation at 21:27:26
2022-05-03 21:27:16.316 17 DEBUG neutron.notifiers.nova [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Ignoring state change previous_port_status: DOWN current_port_status: DOWN port_id fbd5a5ac-ac23-41a5-95fd-c2432de210c9 record_port_status_changed /usr/lib/python3.6/site-packages/neutron/notifiers/nova.py:229
2022-05-03 21:27:16.323 17 DEBUG neutron_lib.callbacks.manager [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Notify callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-317832'] for port, precommit_update _notify_loop /usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
2022-05-03 21:27:16.332 17 DEBUG neutron.db.provisioning_blocks [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Transition to ACTIVE for port object fbd5a5ac-ac23-41a5-95fd-c2432de210c9 will not be triggered until provisioned by entity L2. add_provisioning_component /usr/lib/python3.6/site-packages/neutron/db/provisioning_blocks.py:75
...
2022-05-03 21:27:16.780 17 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): PgAddPortCommand(port_group=neutron_pg_drop, lsp=['fbd5a5ac-ac23-41a5-95fd-c2432de210c9'], if_exists=False) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
...
2022-05-03 21:27:16.881 17 DEBUG neutron.plugins.ml2.managers [req-2a12aab0-50f3-4fcd-82d7-9c090d57616b 4169a006324e45a687c58c574cd6dd5e 79722d19630341d89a337bbfbda75d57 - default default] Bound port: fbd5a5ac-ac23-41a5-95fd-c2432de210c9, host: computesriov-1.localdomain, vif_type: ovs, vif_details: {"port_filter": true}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '99a3800a-4aea-4760-ba2b-0c4794c8def7', 'network_type': 'vlan', 'physical_network': 'datacentre', 'segmentation_id': 323, 'network_id': '411ed6bb-78ef-4177-9f6f-da0fbb5c85b0'}}] _bind_port_level /usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py:937


B) Maintenance task finds port inconsistency ~3 min later
2022-05-03 21:30:24.929 22 DEBUG networking_ovn.common.maintenance [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Maintenance task: Fixing resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) at create/update check_for_inconsistencies /usr/lib/python3.6/site-packages/networking_ovn/common/maintenance.py:354
...
2022-05-03 21:30:25.069 22 INFO networking_ovn.db.revision [req-1db0a71d-20d3-48fa-a6c5-7cd216ef05e5 - - - - -] Successfully bumped revision number for resource fbd5a5ac-ac23-41a5-95fd-c2432de210c9 (type: ports) to 3



[1] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/junit/neutron_plugin.tests.scenario.test_qos/QosTestOvn/test_attach_qos_port_to_vm_with_another_port_id_8dc4f365_d226_4bf5_a2cc_60f4655de504_/
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/testReport/neutron_tempest_plugin.scenario.test_floatingip/DefaultSnatToExternal/test_snat_external_ip_id_3d73ea1a_27c6_45a9_b0f8_04a283d9d764_/
[3] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/computesriov-1/var/log/containers/nova/nova-compute.log.1.gz
[4] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/53/controller-1/var/log/containers/neutron/server.log.2.gz



Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20220427.n.3 / python3-networking-ovn-7.4.2-2.20220409154847

How reproducible:
2 failures out of 1032 tests executed


Steps to Reproduce:
we don't have a reproducer yet, working on it

Comment 5 spower 2022-05-16 16:52:55 UTC
This bug needs to be triaged and it needs its acks to be included into the 16.2 zstream

Comment 6 Yatin Karel 2022-05-25 14:49:22 UTC
So we further debugged this and below are the findings:-

When the issue reproduces for a server/port:-
- PortBindingUpdateUpEvent is received and put into queue, at this point self.notifications Queue size is large, seen 250+
- The Queue is filled with PortBindingChassisEvent for chassisredirect port in just 2-3 seconds
- All the PortBindingChassisEvent is for same port just switching chassis[1], this is just snippet there were total 274 enteries for this particular case, for some cases seen 350+ too.
- Same can be seen in ovn-controller log[2], added just snippet and there were in total 134 enteries on one controller and 135 on other.

And this resulted into a known old unfixed OVN bug https://bugzilla.redhat.com/show_bug.cgi?id=1974898. So until that is fixed seems we need to revert https://review.opendev.org/c/openstack/networking-ovn/+/823279 which likely causing the issue more often as that switched monitoring to SB DB instead of NB DB, and NB and SB queues are different and NB events will not be impacted with large SB event queue.



[1] 2022-05-25 09:11:04.511 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.513 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[], up=[False]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.554 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.557 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.560 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.563 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2022-05-25 09:11:04.567 15 DEBUG networking_ovn.ovsdb.ovsdb_monitor [-] Hash Ring: Node a3570719-1079-4d61-a0c8-f3171fb07f85 (host: controller-2.redhat.local) handling event "update" for row 3831cbcf-fc7c-4b55-8af4-12e3a3dc21c2 (table: Port_Binding) notify /usr/lib/python3.6/site-packages/networking_ovn/ovsdb/ovsdb_monitor.py:742
2022-05-25 09:11:04.569 15 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=(('type', '=', 'chassisredirect'),), old_conditions=None) to row=Port_Binding(parent_port=[], chassis=[<ovs.db.idl.Row object at 0x7fb4a75b2198>], mac=['fa:16:3e:70:a1:12 10.0.0.220/24 2620:52:0:13b8::1000:21/64'], options={'always-redirect': 'true', 'distributed-port': 'lrp-b0858034-b5e1-475e-a59e-f19ce3191155'}, ha_chassis_group=[], type=chassisredirect, tag=[], requested_chassis=[], tunnel_key=2, up=[True], logical_port=cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155, gateway_chassis=[], encap=[], external_ids={}, virtual_parent=[], nat_addresses=[], datapath=75657e9e-7e7d-4cb5-95bc-97f0e3a37d9a) old=Port_Binding(chassis=[<ovs.db.idl.Row object at 0x7fb4a760e710>]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44

[2] Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb to 30a04401-a973-4ddd-a087-fd45b12116b7 on one controller
Changing chassis for lport cr-lrp-b0858034-b5e1-475e-a59e-f19ce3191155 from 30a04401-a973-4ddd-a087-fd45b12116b7 to 2dd2e070-e65d-47b0-a458-49fb7eb3e0eb on other controller

Comment 18 errata-xmlrpc 2022-06-22 16:07:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793