Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2203811

Summary: [OVN] Spam of "openvswitch: ovs-system: deferred action limit reached, drop recirc action" messages in controller logs
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Alex Stupnikov <astupnik>
Component: ovn-2021Assignee: Mark Michelson <mmichels>
Status: CLOSED ERRATA QA Contact: Jianlin Shi <jishi>
Severity: medium Docs Contact:
Priority: high    
Version: FDP 21.ACC: alink, apevec, ctrautma, cylopez, dalvarez, echaudro, i.maximets, jiji, jlibosva, lhh, majopela, mlavalle, mmichels, ovnteam, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ovn-2021-21.12.0-137 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-30 00:16:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2023-05-15 10:14:26 UTC
Description of problem:
While investigating performance problem for one of our VIP customers we found out that openvswitch sends numerous messages like [1] to /var/log/messages on controller nodes. Since controller nodes are not actually forwarding significant amount of traffic in customer's case (ML2/OVN with DVR), these messages doesn't seem to be a root cause of customer's issues.

At the same time OVN discussions like https://mail.openvswitch.org/pipermail/ovs-discuss/2021-August/051345.html makes me think that something is wrong and there is possible routing loop.

I am reporting this bugzilla to ask for your opinion about messages [1] and for your advice about next steps: logged messages look too generic for me and I am not sure how to identify OVN entities possibly affected by them.

[1]
May  2 04:18:05 HOSTNAME kernel: openvswitch: ovs-system: deferred action limit reached, drop recirc action


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.4

Additional info:
sosreports from controller nodes are attached to support case.

Comment 1 Alex Stupnikov 2023-05-15 12:54:11 UTC
We also have /var/lib/openvswitch/ovn/ folder from controller node attached to case.

Comment 14 Mark Michelson 2023-09-12 20:35:27 UTC
Hi everyone. I've been doing some research on this. I think that this issue may already be fixed by the following OVN commit: https://github.com/ovn-org/ovn/commit/8c341b9d704cdf002126699527308203319954f0 .

To quote the commit:

"To reproduce the problem, simply configure SNAT on a LR with the SNAT IP
being the DGP's IP, and then send a packet from external (DGP's LS) to
the SNAT IP. Kernel logs like below will be seen:

openvswitch: ovs-system: deferred action limit reached, drop recirc action"

"LR" is "logical router" and "DGP" is "distributed gateway port". With OSP, I believe that it is common to have SNAT rules that will transform VM IP addresses into the gateway router port's IP address. If you send an unsolicited packet from outside the cluster to the gateway port's IP address, then the issue should be triggered. Packets sent to this address that match existing conntrack entries will not trigger the bug since the unSNAT stage will alter the destination IP address properly. If it's possible to test this on a customer's system, then it likely proves that SNAT is the culprit and the linked commit should fix the issue.

The linked commit is available in OVN 22.12 and later. If the customers can confirm that this is the issue, then the proper fix will be to backport the commit to older OVN versions.

Comment 15 Alex Stupnikov 2023-09-13 08:08:32 UTC
Thank you for investing your time Mark. Indeed it looks like a solid match for reported problem. But I am not sure how we can test this commit in RHOSP 16.2 environments, can we create some reproducer to emulate traffic that triggers this in any deployment (like our labs)? Most customers reported these messages from prod deployments, so I don't think that it is reasonable to ask them to implement this kind of change there...

Comment 16 Mark Michelson 2023-09-13 14:33:53 UTC
Hi Alex.

The bug can be reproduced by sending an unsolicited packet to the gateway router's public IP address. It should be as easy as:

nc -w 1 <ip_address> 80

The only tricky bit is finding an appropriate IP address to send the packet to. If you run the following ovn-nbctl command:

ovn-nbctl --columns=external_ip find nat type=snat

Then that will show you some possible IP addresses that you can attempt to send packets to. Try sending a packet to the IP address, then check dmesg to see if you see "openvswitch: ovs-system: deferred action limit reached, drop recirc action" . If you see that message, then the commit I linked should fix the problem.

BTW, if this is the problem, then it's not a very severe issue. The packets that trigger that message should be dropped anyway. The commit I linked will just drop them quicker instead of letting the TTL drop down to 0 and triggering that OVS message.

Comment 17 Alex Stupnikov 2023-09-13 15:43:42 UTC
Mark, I want to confirm that specified reproducer works for me. For some reason, message is not logged every time I run "nc -w 1 10.0.0.187 80" command, but one message logged for 2-3 executions. In my lab and in customer's deployments we have RHEL 8 and OVN 21.12 [1], while OVN 22.12 RPMs are built for RHEL 9 [2]. I am wondering if I can run OVN RPMs built for RHEL 9 inside podman on RHEL 8? I am also wondering if specified patch is back-portable to OVN 21.12?

[1]
# podman exec -it ovn-dbs-bundle-podman-0 rpm -qa | grep ovn | grep fdp
ovn-2021-central-21.12.0-116.el8fdp.x86_64
ovn-2021-21.12.0-116.el8fdp.x86_64

[2]
https://access.redhat.com/downloads/content/ovn22.12/22.12.0-108.el9fdp/x86_64/fd431d51/package
https://access.redhat.com/downloads/content/ovn22.12-central/22.12.0-108.el9fdp/x86_64/fd431d51/package

Comment 18 Alex Stupnikov 2023-09-13 15:45:26 UTC
To avoid having long ping-pongs, I want to also ask if it is possible to create test OVN 21.12 RPMs for my lab (customer is not going to get them, will use them in my lab)?

Comment 19 Mark Michelson 2023-09-13 17:31:04 UTC
From what I understand, OSP regularly uses RHEL 9 containers on RHEL 8 hosts, so I think it should work to use the ovn22.12 RHEL 9 RPMs in your scenario. I can also kick off a custom ovn-2021 build that has the patch backported. I'll ping this issue when the build is ready.

Comment 20 Mark Michelson 2023-09-13 18:37:32 UTC
An ovn-2021 build with the backported patch is in progress here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=55313804 . Hopefully by the time you're seeing this, the build is complete.

Comment 21 Alex Stupnikov 2023-09-14 19:40:02 UTC
Thank you so much for generating RPMs. I was able to install them inside ovn-northd container image in my lab and confirm that for new SNAT IP addresses problem is no longer reproduced. Unfortunately I had to re-deploy the lab, so not sure what about pre-existing ones.

Looking forward for a fix in OVN RPMs shipped with RHOSP 16.2.

Comment 22 Mark Michelson 2023-09-26 21:05:23 UTC
I backported this to old OVN versions, including branch-21.12. I've updated the state of this issue to MODIFIED and set the fixed-in version as appropriate.

Comment 23 Alex Stupnikov 2023-09-27 07:24:35 UTC
Thank you for solving this Mark. Looking forward to getting fix in RHOSP 16.2.

Comment 26 Jianlin Shi 2023-10-27 06:39:08 UTC
reproducer:

systemctl start openvswitch                                                                           
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641                                         
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=127.0.0.1
systemctl restart ovn-controller

ovn-nbctl lr-add r1 # Gateway router or LR with DGP on the ext side                                   
ovn-nbctl ls-add ext # simulate external LS                                                           
ovn-nbctl ls-add s2 # simulate internal LS

ovn-nbctl lrp-add r1 lrp-r1-ext 00:00:00:00:01:01 10.0.1.1/24
ovn-nbctl lrp-set-gateway-chassis lrp-r1-ext hv1 1

ovn-nbctl lsp-add ext lsp-ext-r1 -- set Logical_Switch_Port lsp-ext-r1 type=router \
            options:router-port=lrp-r1-ext addresses=router                                           

ovn-nbctl lrp-add r1 lrp-r1-s2 00:00:00:00:02:01 10.0.2.1/24
ovn-nbctl lsp-add s2 lsp-s2-r1 -- set Logical_Switch_Port lsp-s2-r1 type=router \                     
            options:router-port=lrp-r1-s2 addresses=router

ovn-nbctl lsp-add ext p1 \
        -- lsp-set-addresses p1 "f0:00:00:00:01:02 10.0.1.2" \
        -- lsp-set-port-security p1 "f0:00:00:00:01:02 10.0.1.2"

ovn-nbctl lsp-add s2 p2 \
        -- lsp-set-addresses p2 "f0:00:00:00:02:02 10.0.2.2"

ovn-nbctl lr-nat-add r1 snat 10.0.1.1 10.8.8.0/24

ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external_ids:iface-id=p1
ip netns add p1
ip link set p1 netns p1
ip netns exec p1 ip link set address f0:00:00:00:01:02 dev p1
ip netns exec p1 ip link set p1 up
ip netns exec p1 ip addr add 10.0.1.2/24 dev p1

ovs-vsctl add-port br-int p2 -- set interface p2 type=internal external_ids:iface-id=p2
ip netns add p2
ip link set p2 netns p2
ip netns exec p2 ip link set p2 address f0:00:00:00:02:02
ip netns exec p2 ip link set p2 up
ip netns exec p2 ip addr add 10.0.2.2/24 dev p2

ovn-nbctl --wait=hv sync
ip netns exec p1 ping 10.0.1.1 -c 1
dmesg -C
ip netns exec p1 nc 10.0.1.1 10010 <<< h
dmesg
ovs-appctl dpctl/dump-flows

reproduced on ovn-2021-21.12.0-134:

[root@wsfd-advnetlab18 bz2203811]# rpm -qa | grep -E "openvswitch2.15|ovn-2021"                       
openvswitch2.15-2.15.0-139.el8fdp.x86_64                                                              
ovn-2021-central-21.12.0-134.el8fdp.x86_64                                                            
ovn-2021-21.12.0-134.el8fdp.x86_64                                                                    
ovn-2021-host-21.12.0-134.el8fdp.x86_64

+ dmesg                                                                                                                                                                                                     [14678.601302] openvswitch: ovs-system: deferred action limit reached, drop recirc action                                                                                                                   
[14680.649116] openvswitch: ovs-system: deferred action limit reached, drop recirc action                                                                                                                   [14684.680837] openvswitch: ovs-system: deferred action limit reached, drop recirc action                                                                                                                   
+ ovs-appctl dpctl/dump-flows                                                                                                                                                                               recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=32,frag=no), packets:0, bytes:0, used:never, actions:
ct_clear,set(ipv4(ttl=31)),ct(zone=2,nat),recirc(0x2)                                                                                                                                                       recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=43,frag=no), packets:3, bytes:222, used:2.936s, flags
:S, actions:ct_clear,set(ipv4(ttl=42)),ct(zone=2,nat),recirc(0x2)                                                                                                                                           recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=13,frag=no), packets:0, bytes:0, used:never, actions:
ct_clear,set(ipv4(ttl=12)),ct(zone=2,nat),recirc(0x2)                                                                                                                                                       recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02),eth_type(0x86dd),ipv6(frag=no), packets:2, bytes:160, used:5.816s, actions:drop                                                                          
recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=30,frag=no), packets:0, bytes:0, used:never, actions:ct_clear,set(ipv4(ttl=29)),ct(zone=2,nat),recirc(0x2)                                                                                                                                                       
recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=17,frag=no), packets:0, bytes:0, used:never, actions:ct_clear,set(ipv4(ttl=16)),ct(zone=2,nat),recirc(0x2)

[root@wsfd-advnetlab18 bz2203811]# rpm -qa | grep -E "openvswitch2.15|ovn-2021"
openvswitch2.15-2.15.0-139.el8fdp.x86_64
ovn-2021-host-21.12.0-137.el8fdp.x86_64
ovn-2021-central-21.12.0-137.el8fdp.x86_64
ovn-2021-21.12.0-137.el8fdp.x86_64

+ dmesg
+ ovs-appctl dpctl/dump-flows
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2,dst=10.0.1.1,proto=6,ttl=64,frag=no), packets:3, bytes:222, used:2.928s, flags:S, actions:ct(zone=6,nat),recirc(0x1)
recirc_id(0x1),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(dst=10.0.1.1,proto=6,ttl=64,frag=no), packets:3, bytes:222, used:2.928s, flags:S, actions:drop
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::/ffc0::,dst=ff02::2,proto=58,hlimit=255,frag=no),icmpv6(type=133,code=0), packets:1, bytes:70, used:4.528s, actions:drop
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02),eth_type(0x86dd),ipv6(frag=no), packets:5, bytes:406, used:4.912s, actions:drop
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=10.0.1.2,tip=10.0.1.1,op=1/0xff,sha=f0:00:00:00:01:02,tha=00:00:00:00:00:00), packets:0, bytes:0, used:never, actions:userspace(pid=2947802789,slow_path(action))                                               
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2,dst=10.0.1.1,proto=1,ttl=64,frag=no),icmp(type=8,code=0), packets:0, bytes:0, used:never, actions:userspace(pid=2947802789,slow_path(action))
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:ff:00:02:02),eth_type(0x86dd),ipv6(src=::,dst=ff02::1:ff00:202,proto=58,hlimit=255,frag=no),icmpv6(type=135,code=0), packets:0, bytes:0, used:never, actions:drop
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::/ffc0::,dst=ff02::16,proto=58,hlimit=1,frag=no),icmpv6(type=143), packets:1, bytes:90, used:8.800s, actions:drop

Comment 28 errata-xmlrpc 2023-11-30 00:16:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7591