Bug 2168329 - bgp agent doesn't process new events after pyroute2 crashed
Summary: bgp agent doesn't process new events after pyroute2 crashed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ovn-bgp-agent
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: beta
: 17.1
Assignee: Luis Tomas Bolivar
QA Contact: Eduardo Olivares
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-08 17:45 UTC by Eduardo Olivares
Modified: 2023-08-16 01:14 UTC (History)
4 users (show)

Fixed In Version: ovn-bgp-agent-0.3.1-1.20230422171003.2553998.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-16 01:13:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 873077 0 None MERGED Add retry mechanism for some pyroute actions 2023-02-09 15:20:26 UTC
OpenStack gerrit 874638 0 None MERGED Ensure exceptions on the sync function don't kill the agent 2023-02-27 06:45:43 UTC
OpenStack gerrit 879849 0 None MERGED Add protection from pyroute crashed 2023-04-21 05:20:58 UTC
Red Hat Issue Tracker OSP-22180 0 None None None 2023-02-08 17:49:53 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:14:11 UTC

Description Eduardo Olivares 2023-02-08 17:45:57 UTC
Description of problem:
This bug was reproduced running the tempest neutron regression on a BGP environment, but it is not clear exactly which test reproduced it.
It is probably due to a race condition and we have not identified the scenario that triggers it yet.



As you can see in the following logs (copied from ovn_bgp_agent.log file, from a compute node), the process 31903 was waiting in a lock. The next logs from that process show that the exception pr2modules.netlink.exceptions.NetlinkDumpInterrupted was raised by pyroute2:
2023-02-07T12:59:00.883906257+00:00 stdout F 2023-02-07 12:59:00.883 31903 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" acquired by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.expose_ip" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355^[[00m
2023-02-07T12:59:00.905342378+00:00 stdout F 2023-02-07 12:59:00.892 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-vsctl list-ports br-ex execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m
2023-02-07T12:59:00.950229280+00:00 stdout F 2023-02-07 12:59:00.948 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-vsctl list-ports br-ex" returned: 0 in 0.057s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m
2023-02-07T12:59:00.950471287+00:00 stdout F 2023-02-07 12:59:00.949 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m
2023-02-07T12:59:00.958454302+00:00 stdout F 2023-02-07 12:59:00.951 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-vsctl get Interface patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int ofport execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m
2023-02-07T12:59:00.975144889+00:00 stdout F 2023-02-07 12:59:00.974 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-vsctl get Interface patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int ofport" returned: 0 in 0.023s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m
2023-02-07T12:59:00.975270217+00:00 stdout F 2023-02-07 12:59:00.974 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('13\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m     
2023-02-07T12:59:00.976788890+00:00 stdout F 2023-02-07 12:59:00.975 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-ofctl dump-flows br-ex cookie=999/-1,in_port=13 execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m
2023-02-07T12:59:01.000432653+00:00 stdout F 2023-02-07 12:59:00.995 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-ofctl dump-flows br-ex cookie=999/-1,in_port=13" returned: 0 in 0.020s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m
2023-02-07T12:59:01.000432653+00:00 stdout F 2023-02-07 12:59:00.996 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('NXST_FLOW reply (xid=0x8):\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.017 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] init set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m                              
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.018 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] starting the source start /usr/lib/python3.9/site-packages/pr2modules/ndb/source.py:409^[[00m            
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] init set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m                    
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] starting the source start /usr/lib/python3.9/site-packages/pr2modules/ndb/source.py:409^[[00m  
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] connecting set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m                        
2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] loading set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m                           
2023-02-07T12:59:01.030439110+00:00 stdout F 2023-02-07 12:59:01.029 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] connecting set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m              
2023-02-07T12:59:01.045458698+00:00 stdout F 2023-02-07 12:59:01.042 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] loading set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m                 
2023-02-07T12:59:01.061256564+00:00 stdout F 2023-02-07 12:59:01.059 31903 ERROR pyroute2.ndb.139848769356944.main [-] exception <(-1, 'dump interrupted')> in source localhost: pr2modules.netlink.exceptions.NetlinkDumpInterrupted: (-1, 'dump interrupted')^[[00m


After this, the bgp agent running on this compute didn't process any more actions. There was a FIP exposed from this compute and it was never removed, although the VM using that FIP was removed. Besides, any other FIPs from new VMs created on this compute were not exposed (so they were unreachable).


This is related with the following pyroute2 issue: https://github.com/svinota/pyroute2/issues/874#issuecomment-1063139555
This issue affected neutron some time ago and the following fix was implemented: https://review.opendev.org/c/openstack/neutron/+/844366
A similar fix can be implemented on the ovn-bgp-agent.




Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230131.n.2
ovn-bgp-agent-0.3.1-1.20230120160941.62a04d4.el9ost

How reproducible:
It has happened only once

Steps to Reproduce:
we don't have a clear reproduced -  it was reproduced running the tempest neutron regression on a BGP d/s job

Comment 30 errata-xmlrpc 2023-08-16 01:13:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.