Description of problem: This bug was reproduced running the tempest neutron regression on a BGP environment, but it is not clear exactly which test reproduced it. It is probably due to a race condition and we have not identified the scenario that triggers it yet. As you can see in the following logs (copied from ovn_bgp_agent.log file, from a compute node), the process 31903 was waiting in a lock. The next logs from that process show that the exception pr2modules.netlink.exceptions.NetlinkDumpInterrupted was raised by pyroute2: 2023-02-07T12:59:00.883906257+00:00 stdout F 2023-02-07 12:59:00.883 31903 DEBUG oslo_concurrency.lockutils [-] Lock "bgp" acquired by "ovn_bgp_agent.drivers.openstack.ovn_bgp_driver.OVNBGPDriver.expose_ip" :: waited 0.000s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:355^[[00m 2023-02-07T12:59:00.905342378+00:00 stdout F 2023-02-07 12:59:00.892 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-vsctl list-ports br-ex execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m 2023-02-07T12:59:00.950229280+00:00 stdout F 2023-02-07 12:59:00.948 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-vsctl list-ports br-ex" returned: 0 in 0.057s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m 2023-02-07T12:59:00.950471287+00:00 stdout F 2023-02-07 12:59:00.949 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m 2023-02-07T12:59:00.958454302+00:00 stdout F 2023-02-07 12:59:00.951 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-vsctl get Interface patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int ofport execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m 2023-02-07T12:59:00.975144889+00:00 stdout F 2023-02-07 12:59:00.974 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-vsctl get Interface patch-provnet-5e55949f-c84f-49dd-b4ff-9a1c546bed77-to-br-int ofport" returned: 0 in 0.023s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m 2023-02-07T12:59:00.975270217+00:00 stdout F 2023-02-07 12:59:00.974 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('13\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m 2023-02-07T12:59:00.976788890+00:00 stdout F 2023-02-07 12:59:00.975 32246 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ovs-ofctl dump-flows br-ex cookie=999/-1,in_port=13 execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:384^[[00m 2023-02-07T12:59:01.000432653+00:00 stdout F 2023-02-07 12:59:00.995 32246 DEBUG oslo_concurrency.processutils [-] CMD "ovs-ofctl dump-flows br-ex cookie=999/-1,in_port=13" returned: 0 in 0.020s execute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422^[[00m 2023-02-07T12:59:01.000432653+00:00 stdout F 2023-02-07 12:59:00.996 32246 DEBUG oslo.privsep.daemon [-] privsep: reply[139848771384240]: (4, ('NXST_FLOW reply (xid=0x8):\n', '')) _call_back /usr/lib/python3.9/site-packages/oslo_privsep/daemon.py:510^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.017 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] init set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.018 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] starting the source start /usr/lib/python3.9/site-packages/pr2modules/ndb/source.py:409^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] init set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] starting the source start /usr/lib/python3.9/site-packages/pr2modules/ndb/source.py:409^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] connecting set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.023440781+00:00 stdout F 2023-02-07 12:59:01.019 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost [-] loading set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.030439110+00:00 stdout F 2023-02-07 12:59:01.029 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] connecting set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.045458698+00:00 stdout F 2023-02-07 12:59:01.042 31903 DEBUG pyroute2.ndb.139848769356944.sources.localhost/nsmanager [-] loading set /usr/lib/python3.9/site-packages/pr2modules/ndb/events.py:74^[[00m 2023-02-07T12:59:01.061256564+00:00 stdout F 2023-02-07 12:59:01.059 31903 ERROR pyroute2.ndb.139848769356944.main [-] exception <(-1, 'dump interrupted')> in source localhost: pr2modules.netlink.exceptions.NetlinkDumpInterrupted: (-1, 'dump interrupted')^[[00m After this, the bgp agent running on this compute didn't process any more actions. There was a FIP exposed from this compute and it was never removed, although the VM using that FIP was removed. Besides, any other FIPs from new VMs created on this compute were not exposed (so they were unreachable). This is related with the following pyroute2 issue: https://github.com/svinota/pyroute2/issues/874#issuecomment-1063139555 This issue affected neutron some time ago and the following fix was implemented: https://review.opendev.org/c/openstack/neutron/+/844366 A similar fix can be implemented on the ovn-bgp-agent. Version-Release number of selected component (if applicable): RHOS-17.1-RHEL-9-20230131.n.2 ovn-bgp-agent-0.3.1-1.20230120160941.62a04d4.el9ost How reproducible: It has happened only once Steps to Reproduce: we don't have a clear reproduced - it was reproduced running the tempest neutron regression on a BGP d/s job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577