Bug 2168667 - KeyError exception due to a race condition executing withdraw_subnet
Summary: KeyError exception due to a race condition executing withdraw_subnet
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ovn-bgp-agent
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: beta
: 17.1
Assignee: Luis Tomas Bolivar
QA Contact: Eduardo Olivares
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-09 17:30 UTC by Eduardo Olivares
Modified: 2023-08-16 01:14 UTC (History)
3 users (show)

Fixed In Version: ovn-bgp-agent-0.3.1-1.20230213131027.a4c1fde.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-16 01:13:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
script to reproduce this bug (653 bytes, text/plain)
2023-02-09 17:30 UTC, Eduardo Olivares
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 873080 0 None MERGED Avoid race condition at _process_lrp_port 2023-02-10 13:38:38 UTC
OpenStack gerrit 873081 0 None MERGED Minor code fix to reuse cr_lrp_info 2023-02-10 13:38:39 UTC
OpenStack gerrit 873082 0 None MERGED Avoid race between withdraw_subnet and cr-lrp association 2023-02-10 13:38:39 UTC
Red Hat Issue Tracker OSP-22227 0 None None None 2023-02-09 17:31:27 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:14:11 UTC

Description Eduardo Olivares 2023-02-09 17:30:06 UTC
Created attachment 1943121 [details]
script to reproduce this bug

Description of problem:
This bug has been reproduced running the attached script on an OSP17.1 environment with BGP configured.
The bug is not always reproduced. It happens ~10% times.

The ovn-bgp-agent exposes the IPs from the ports added to the VMs immediately, which is correct.
Then, the script performs two sets of actions almost simultaneously:
1) it detaches the subnets from the router -> this makes the agent run the withdraw_subnet function,because these internal ports don't have external (north-south) access anymore.
2) it deletes the ports

While running this script, we checked on an external router whether the BGP routes to those ports had been removed or not. All neutron ports had been removed, but a few routes still existed:
# vtysh -c "show ip route" | grep 10.100                                                                                                                                
B>* 10.100.6.114/32 [200/0] via 5.1.1.2, eth0, weight 1, 00:03:04
B>* 10.100.6.152/32 [200/0] via 5.1.1.2, eth0, weight 1, 00:03:13
B>* 10.100.9.71/32 [200/0] via 5.1.1.2, eth0, weight 1, 00:02:05
B>* 10.100.9.103/32 [200/0] via 5.1.1.2, eth0, weight 1, 00:01:56

Some minutes later, those routes were removed too (the periodic sync process from the agent removed them).


During this test, some exceptions were found in the ovn bgp agent logs.

1) The following exception is captured by oslo_service, meaning the exception is not really raised, but this is just printed using log.exception:
2023-02-09T16:46:04.051727434+00:00 stdout F 2023-02-09 16:46:04.051 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" acquired by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_remote_ip" :: waited 5.204s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355^[[00m
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task [-] Error during BGPAgent.sync: KeyError: 'peer'
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task Traceback (most recent call last):
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     task(self, context)
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/agent.py", line 61, in sync
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     self.agent_driver.sync()
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     return f(*args, **kwargs)
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 209, in sync
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     self._ensure_port_exposed(port, exposed_ips, ovn_ip_rules)
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 281, in _ensure_port_exposed
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     ips_adv = self._expose_ip(port_ips, port)
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 525, in _expose_ip
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     self._expose_cr_lrp_port(ips, mac, bridge_device, bridge_vlan,
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 750, in _expose_cr_lrp_port
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     self._process_lrp_port(lrp, cr_lrp_port)
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 701, in _process_lrp_port
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task     lrp.options['peer'])
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task KeyError: 'peer'
2023-02-09T16:46:04.054119230+00:00 stdout F 2023-02-09 16:46:04.051 100169 ERROR oslo_service.periodic_task ^[[00m
2023-02-09T16:46:04.054856166+00:00 stdout F 2023-02-09 16:46:04.053 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" released by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_remote_ip" :: held 0.002s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367^[[00m

Not sure if this is related, but we could see the following log almost at the same time in the neutron server logs from the same controller:
2023-02-09 16:46:04.194 15 DEBUG ovsdbapp.backend.ovs_idl.transaction [req-3debf90d-b158-4f41-bd50-83c3774f2b03 - - - - -] Running txn n=1 command(idx=0): DeleteNATRuleInLRouterCommand(lrouter=neutron-0de20dac-ffd2-4214-81ff-6f6db8f42d43, type=snat, logical_ip=10.100.3.0/24, external_ip=172.24.100.115, if_exists=True) do_commit /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89
2023-02-09 16:46:04.194 15 DEBUG ovsdbapp.backend.ovs_idl.transaction [req-3debf90d-b158-4f41-bd50-83c3774f2b03 - - - - -] Running txn n=1 command(idx=1): LrpDelCommand(table=Logical_Router_Port, port=lrp-250aaa2f-a86a-48a8-9ccf-c9d5ae3b155e, parent_table=Logical_Router, parent=neutron-0de20dac-ffd2-4214-81ff-6f6db8f42d43, if_exists=True) do_commit /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:89





2) Exception raised from withdraw_subnet

2023-02-09T16:46:06.334783896+00:00 stdout F 2023-02-09 16:46:06.334 100169 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched DELETE: TenantPortDeletedEvent(events=('update', 'delete'), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(mac=['fa:16:3e:da:98:cd 10.100.9.1/24'], port_security=[], type=patch, nat_addresses=[], virtual_parent=[], up=[False], options={}, parent_port=[], requested_additional_chassis=[], ha_chassis_group=[], external_ids={'neutron:network_name': 'neutron-c03d4924-7008-45bb-beee-8cec4a70f631', 'neutron:revision_number': '3', 'neutron:router_name': 'baf78364-fad4-41a5-af9d-5748b950318a', 'neutron:subnet_ids': '0b0f4e40-d7a0-4173-a74f-498a2cbecbe3'}, additional_chassis=[], tag=[], additional_encap=[], encap=[], mirror_rules=[], datapath=7f75966b-40c6-4437-918a-6635478b5b95, chassis=[], tunnel_key=3, gateway_chassis=[], requested_chassis=[], logical_port=lrp-5ca6f9c8-965e-48c9-b86f-311670bd4b7f) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43^[[00m
2023-02-09T16:46:06.335491835+00:00 stdout F 2023-02-09 16:46:06.335 100169 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched DELETE: SubnetRouterDetachedEvent(events=('delete',), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(mac=['fa:16:3e:da:98:cd 10.100.9.1/24'], port_security=[], type=patch, nat_addresses=[], virtual_parent=[], up=[False], options={}, parent_port=[], requested_additional_chassis=[], ha_chassis_group=[], external_ids={'neutron:network_name': 'neutron-c03d4924-7008-45bb-beee-8cec4a70f631', 'neutron:revision_number': '3', 'neutron:router_name': 'baf78364-fad4-41a5-af9d-5748b950318a', 'neutron:subnet_ids': '0b0f4e40-d7a0-4173-a74f-498a2cbecbe3'}, additional_chassis=[], tag=[], additional_encap=[], encap=[], mirror_rules=[], datapath=7f75966b-40c6-4437-918a-6635478b5b95, chassis=[], tunnel_key=3, gateway_chassis=[], requested_chassis=[], logical_port=lrp-5ca6f9c8-965e-48c9-b86f-311670bd4b7f) old= matches /usr/lib/python3.9/site-packages/ovsdbapp/backend/ovs_idl/event.py:43^[[00m
...

2023-02-09T16:46:06.876918303+00:00 stdout F 2023-02-09 16:46:06.876 100710 DEBUG oslo.privsep.daemon [-] privsep: reply[140081499762176]: (4, None) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m
2023-02-09T16:46:06.877184206+00:00 stdout F 2023-02-09 16:46:06.877 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" released by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_subnet" :: held 0.747s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367^[[00m
2023-02-09T16:46:06.877624743+00:00 stdout F 2023-02-09 16:46:06.877 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" acquired by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_subnet" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355^[[00m
2023-02-09T16:46:06.878761954+00:00 stdout F 2023-02-09 16:46:06.878 100169 DEBUG ovn_bgp_agent.drivers.openstack.ovn_bgp_driver [-] Deleting IP Rules for network 10.100.6.1/24 on chassis 103d1574-e992-4f46-860e-da6768fa7771 _withdraw_lrp_port /usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py:876^[[00m
2023-02-09T16:46:06.878860116+00:00 stdout F 2023-02-09 16:46:06.878 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" released by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_subnet" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367^[[00m
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: KeyError: 'cr-lrp-fcd48680-6aab-444c-ae0a-a2c3c186f389'
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event Traceback (most recent call last):
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovsdbapp/event.py", line 159, in notify_loop
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event     match.run(event, row, updates)
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/watchers/bgp_watcher.py", line 234, in run
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event     self.agent.withdraw_subnet(ip_address, row)
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event     return f(*args, **kwargs)
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 973, in withdraw_subnet
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event     self._withdraw_lrp_port(ip, row.logical_port, cr_lrp)
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 886, in _withdraw_lrp_port
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event     self.ovn_local_cr_lrps[associated_cr_lrp]['subnets_datapath'].pop(
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event KeyError: 'cr-lrp-fcd48680-6aab-444c-ae0a-a2c3c186f389'
2023-02-09T16:46:06.879618681+00:00 stdout F 2023-02-09 16:46:06.878 100169 ERROR ovsdbapp.event ^[[00m
2023-02-09T16:46:06.879944250+00:00 stdout F 2023-02-09 16:46:06.879 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" acquired by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_subnet" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355^[[00m
2023-02-09T16:46:06.880736075+00:00 stdout F 2023-02-09 16:46:06.880 100169 DEBUG ovn_bgp_agent.drivers.openstack.ovn_bgp_driver [-] Deleting IP Rules for network 10.100.9.1/24 on chassis 103d1574-e992-4f46-860e-da6768fa7771 _withdraw_lrp_port /usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py:876^[[00m
2023-02-09T16:46:06.880823162+00:00 stdout F 2023-02-09 16:46:06.880 100169 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" released by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.withdraw_subnet" :: held 0.001s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:367^[[00m
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: KeyError: 'cr-lrp-4ef49716-7188-4732-bc00-4d3f0e97c9e2'
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event Traceback (most recent call last):
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovsdbapp/event.py", line 159, in notify_loop
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event     match.run(event, row, updates)
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/watchers/bgp_watcher.py", line 234, in run
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event     self.agent.withdraw_subnet(ip_address, row)
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event     return f(*args, **kwargs)
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 973, in withdraw_subnet
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event     self._withdraw_lrp_port(ip, row.logical_port, cr_lrp)
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event   File "/usr/lib/python3.9/site-packages/ovn_bgp_agent/drivers/openstack/ovn_bgp_driver.py", line 886, in _withdraw_lrp_port
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event     self.ovn_local_cr_lrps[associated_cr_lrp]['subnets_datapath'].pop(
2023-02-09T16:46:06.881074193+00:00 stdout F 2023-02-09 16:46:06.880 100169 ERROR ovsdbapp.event KeyError: 'cr-lrp-4ef49716-7188-4732-bc00-4d3f0e97c9e2'







Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230131.n.2
ovn-bgp-agent-0.3.1-1.20230120160941.62a04d4.el9ost


How reproducible:
~10%

Steps to Reproduce:
1. use the attached script

Actual results:
the routes are not removed immediately after the ports are deleted - the periodic sync process removes them after some time

Expected results:
the routes should be removed immediately after the ports are deleted

Additional info:

Comment 16 errata-xmlrpc 2023-08-16 01:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.