Bug 2203811
| Summary: | [OVN] Spam of "openvswitch: ovs-system: deferred action limit reached, drop recirc action" messages in controller logs | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Alex Stupnikov <astupnik> |
| Component: | ovn-2021 | Assignee: | Mark Michelson <mmichels> |
| Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | FDP 21.A | CC: | alink, apevec, ctrautma, cylopez, dalvarez, echaudro, i.maximets, jiji, jlibosva, lhh, majopela, mlavalle, mmichels, ovnteam, scohen |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | ovn-2021-21.12.0-137 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-11-30 00:16:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alex Stupnikov
2023-05-15 10:14:26 UTC
We also have /var/lib/openvswitch/ovn/ folder from controller node attached to case. Hi everyone. I've been doing some research on this. I think that this issue may already be fixed by the following OVN commit: https://github.com/ovn-org/ovn/commit/8c341b9d704cdf002126699527308203319954f0 . To quote the commit: "To reproduce the problem, simply configure SNAT on a LR with the SNAT IP being the DGP's IP, and then send a packet from external (DGP's LS) to the SNAT IP. Kernel logs like below will be seen: openvswitch: ovs-system: deferred action limit reached, drop recirc action" "LR" is "logical router" and "DGP" is "distributed gateway port". With OSP, I believe that it is common to have SNAT rules that will transform VM IP addresses into the gateway router port's IP address. If you send an unsolicited packet from outside the cluster to the gateway port's IP address, then the issue should be triggered. Packets sent to this address that match existing conntrack entries will not trigger the bug since the unSNAT stage will alter the destination IP address properly. If it's possible to test this on a customer's system, then it likely proves that SNAT is the culprit and the linked commit should fix the issue. The linked commit is available in OVN 22.12 and later. If the customers can confirm that this is the issue, then the proper fix will be to backport the commit to older OVN versions. Thank you for investing your time Mark. Indeed it looks like a solid match for reported problem. But I am not sure how we can test this commit in RHOSP 16.2 environments, can we create some reproducer to emulate traffic that triggers this in any deployment (like our labs)? Most customers reported these messages from prod deployments, so I don't think that it is reasonable to ask them to implement this kind of change there... Hi Alex. The bug can be reproduced by sending an unsolicited packet to the gateway router's public IP address. It should be as easy as: nc -w 1 <ip_address> 80 The only tricky bit is finding an appropriate IP address to send the packet to. If you run the following ovn-nbctl command: ovn-nbctl --columns=external_ip find nat type=snat Then that will show you some possible IP addresses that you can attempt to send packets to. Try sending a packet to the IP address, then check dmesg to see if you see "openvswitch: ovs-system: deferred action limit reached, drop recirc action" . If you see that message, then the commit I linked should fix the problem. BTW, if this is the problem, then it's not a very severe issue. The packets that trigger that message should be dropped anyway. The commit I linked will just drop them quicker instead of letting the TTL drop down to 0 and triggering that OVS message. Mark, I want to confirm that specified reproducer works for me. For some reason, message is not logged every time I run "nc -w 1 10.0.0.187 80" command, but one message logged for 2-3 executions. In my lab and in customer's deployments we have RHEL 8 and OVN 21.12 [1], while OVN 22.12 RPMs are built for RHEL 9 [2]. I am wondering if I can run OVN RPMs built for RHEL 9 inside podman on RHEL 8? I am also wondering if specified patch is back-portable to OVN 21.12? [1] # podman exec -it ovn-dbs-bundle-podman-0 rpm -qa | grep ovn | grep fdp ovn-2021-central-21.12.0-116.el8fdp.x86_64 ovn-2021-21.12.0-116.el8fdp.x86_64 [2] https://access.redhat.com/downloads/content/ovn22.12/22.12.0-108.el9fdp/x86_64/fd431d51/package https://access.redhat.com/downloads/content/ovn22.12-central/22.12.0-108.el9fdp/x86_64/fd431d51/package To avoid having long ping-pongs, I want to also ask if it is possible to create test OVN 21.12 RPMs for my lab (customer is not going to get them, will use them in my lab)? From what I understand, OSP regularly uses RHEL 9 containers on RHEL 8 hosts, so I think it should work to use the ovn22.12 RHEL 9 RPMs in your scenario. I can also kick off a custom ovn-2021 build that has the patch backported. I'll ping this issue when the build is ready. An ovn-2021 build with the backported patch is in progress here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=55313804 . Hopefully by the time you're seeing this, the build is complete. Thank you so much for generating RPMs. I was able to install them inside ovn-northd container image in my lab and confirm that for new SNAT IP addresses problem is no longer reproduced. Unfortunately I had to re-deploy the lab, so not sure what about pre-existing ones. Looking forward for a fix in OVN RPMs shipped with RHOSP 16.2. I backported this to old OVN versions, including branch-21.12. I've updated the state of this issue to MODIFIED and set the fixed-in version as appropriate. Thank you for solving this Mark. Looking forward to getting fix in RHOSP 16.2. reproducer:
systemctl start openvswitch
systemctl start ovn-northd
ovn-nbctl set-connection ptcp:6641
ovn-sbctl set-connection ptcp:6642
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:127.0.0.1:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=127.0.0.1
systemctl restart ovn-controller
ovn-nbctl lr-add r1 # Gateway router or LR with DGP on the ext side
ovn-nbctl ls-add ext # simulate external LS
ovn-nbctl ls-add s2 # simulate internal LS
ovn-nbctl lrp-add r1 lrp-r1-ext 00:00:00:00:01:01 10.0.1.1/24
ovn-nbctl lrp-set-gateway-chassis lrp-r1-ext hv1 1
ovn-nbctl lsp-add ext lsp-ext-r1 -- set Logical_Switch_Port lsp-ext-r1 type=router \
options:router-port=lrp-r1-ext addresses=router
ovn-nbctl lrp-add r1 lrp-r1-s2 00:00:00:00:02:01 10.0.2.1/24
ovn-nbctl lsp-add s2 lsp-s2-r1 -- set Logical_Switch_Port lsp-s2-r1 type=router \
options:router-port=lrp-r1-s2 addresses=router
ovn-nbctl lsp-add ext p1 \
-- lsp-set-addresses p1 "f0:00:00:00:01:02 10.0.1.2" \
-- lsp-set-port-security p1 "f0:00:00:00:01:02 10.0.1.2"
ovn-nbctl lsp-add s2 p2 \
-- lsp-set-addresses p2 "f0:00:00:00:02:02 10.0.2.2"
ovn-nbctl lr-nat-add r1 snat 10.0.1.1 10.8.8.0/24
ovs-vsctl add-port br-int p1 -- set interface p1 type=internal external_ids:iface-id=p1
ip netns add p1
ip link set p1 netns p1
ip netns exec p1 ip link set address f0:00:00:00:01:02 dev p1
ip netns exec p1 ip link set p1 up
ip netns exec p1 ip addr add 10.0.1.2/24 dev p1
ovs-vsctl add-port br-int p2 -- set interface p2 type=internal external_ids:iface-id=p2
ip netns add p2
ip link set p2 netns p2
ip netns exec p2 ip link set p2 address f0:00:00:00:02:02
ip netns exec p2 ip link set p2 up
ip netns exec p2 ip addr add 10.0.2.2/24 dev p2
ovn-nbctl --wait=hv sync
ip netns exec p1 ping 10.0.1.1 -c 1
dmesg -C
ip netns exec p1 nc 10.0.1.1 10010 <<< h
dmesg
ovs-appctl dpctl/dump-flows
reproduced on ovn-2021-21.12.0-134:
[root@wsfd-advnetlab18 bz2203811]# rpm -qa | grep -E "openvswitch2.15|ovn-2021"
openvswitch2.15-2.15.0-139.el8fdp.x86_64
ovn-2021-central-21.12.0-134.el8fdp.x86_64
ovn-2021-21.12.0-134.el8fdp.x86_64
ovn-2021-host-21.12.0-134.el8fdp.x86_64
+ dmesg [14678.601302] openvswitch: ovs-system: deferred action limit reached, drop recirc action
[14680.649116] openvswitch: ovs-system: deferred action limit reached, drop recirc action [14684.680837] openvswitch: ovs-system: deferred action limit reached, drop recirc action
+ ovs-appctl dpctl/dump-flows recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=32,frag=no), packets:0, bytes:0, used:never, actions:
ct_clear,set(ipv4(ttl=31)),ct(zone=2,nat),recirc(0x2) recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=43,frag=no), packets:3, bytes:222, used:2.936s, flags
:S, actions:ct_clear,set(ipv4(ttl=42)),ct(zone=2,nat),recirc(0x2) recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=13,frag=no), packets:0, bytes:0, used:never, actions:
ct_clear,set(ipv4(ttl=12)),ct(zone=2,nat),recirc(0x2) recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02),eth_type(0x86dd),ipv6(frag=no), packets:2, bytes:160, used:5.816s, actions:drop
recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=30,frag=no), packets:0, bytes:0, used:never, actions:ct_clear,set(ipv4(ttl=29)),ct(zone=2,nat),recirc(0x2)
recirc_id(0x2),in_port(2),eth(src=00:00:00:00:01:01,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2/255.255.255.254,dst=10.0.1.1,proto=6,ttl=17,frag=no), packets:0, bytes:0, used:never, actions:ct_clear,set(ipv4(ttl=16)),ct(zone=2,nat),recirc(0x2)
[root@wsfd-advnetlab18 bz2203811]# rpm -qa | grep -E "openvswitch2.15|ovn-2021"
openvswitch2.15-2.15.0-139.el8fdp.x86_64
ovn-2021-host-21.12.0-137.el8fdp.x86_64
ovn-2021-central-21.12.0-137.el8fdp.x86_64
ovn-2021-21.12.0-137.el8fdp.x86_64
+ dmesg
+ ovs-appctl dpctl/dump-flows
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2,dst=10.0.1.1,proto=6,ttl=64,frag=no), packets:3, bytes:222, used:2.928s, flags:S, actions:ct(zone=6,nat),recirc(0x1)
recirc_id(0x1),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(dst=10.0.1.1,proto=6,ttl=64,frag=no), packets:3, bytes:222, used:2.928s, flags:S, actions:drop
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=fe80::/ffc0::,dst=ff02::2,proto=58,hlimit=255,frag=no),icmpv6(type=133,code=0), packets:1, bytes:70, used:4.528s, actions:drop
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02),eth_type(0x86dd),ipv6(frag=no), packets:5, bytes:406, used:4.912s, actions:drop
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=10.0.1.2,tip=10.0.1.1,op=1/0xff,sha=f0:00:00:00:01:02,tha=00:00:00:00:00:00), packets:0, bytes:0, used:never, actions:userspace(pid=2947802789,slow_path(action))
recirc_id(0),in_port(2),eth(src=f0:00:00:00:01:02,dst=00:00:00:00:01:01),eth_type(0x0800),ipv4(src=10.0.1.2,dst=10.0.1.1,proto=1,ttl=64,frag=no),icmp(type=8,code=0), packets:0, bytes:0, used:never, actions:userspace(pid=2947802789,slow_path(action))
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:ff:00:02:02),eth_type(0x86dd),ipv6(src=::,dst=ff02::1:ff00:202,proto=58,hlimit=255,frag=no),icmpv6(type=135,code=0), packets:0, bytes:0, used:never, actions:drop
recirc_id(0),in_port(3),eth(src=f0:00:00:00:02:02,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=fe80::/ffc0::,dst=ff02::16,proto=58,hlimit=1,frag=no),icmpv6(type=143), packets:1, bytes:90, used:8.800s, actions:drop
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn-2021 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7591 |