Bug 2081069 - [sig-network] pods should successfully create sandboxes by adding pod to network
Summary: [sig-network] pods should successfully create sandboxes by adding pod to network
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.11.0
Assignee: Tim Rozet
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On: 2087194
Blocks: 2088582
TreeView+ depends on / blocked
 
Reported: 2022-05-02 16:59 UTC by Devan Goodwin
Modified: 2022-08-10 11:10 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2088582 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:09:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pcap (4.34 MB, application/vnd.tcpdump.pcap)
2022-05-13 19:41 UTC, Tim Rozet
no flags Details
2nd pcap with ports 48804 and 65349 (4.35 MB, application/vnd.tcpdump.pcap)
2022-05-13 22:33 UTC, Tim Rozet
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1100 0 None Merged Bug 2081069: Bumps OVN to 22.03.0-37.el8fdp 2022-05-24 02:10:02 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:10:20 UTC

Internal Links: 2081062

Description Devan Goodwin 2022-05-02 16:59:53 UTC
[sig-network] pods should successfully create sandboxes by adding pod to network

has begun gailing in CI on ovn, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network


error adding container to network "ovn-kubernetes": CNI request failed with status 400: failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:02:06 [10.128.2.6/23]

Possibly from https://github.com/openshift/ovn-kubernetes/pull/1064#issuecomment-1114212169, possibly from RHCOS 8.6.


We do not yet know why.

Comment 1 Tim Rozet 2022-05-10 19:44:33 UTC
The cause of the "timed out waiting for OVS port binding" is because ovn-controller is not receiving packets from the OVN SBDB for a period of time after it starts. ovn-controller has a default inactivity timeout of 180 seconds, so for 3 minutes after the node starts (and kubelet tries to launch new containers) all of them fail. We can see that ovn-controller complains here that is getting nothing from sbdb:

[trozet@fedora must-gather.local.7812632480284007845]$ omg logs ovnkube-node-h8vzv -c ovn-controller | grep 10.0.170.147
2022-05-09T22:22:57.520089247Z 2022-05-09T22:22:57.520Z|00006|reconnect|INFO|ssl:10.0.170.147:9642: connecting...
2022-05-09T22:22:57.527961633Z 2022-05-09T22:22:57.527Z|00008|reconnect|INFO|ssl:10.0.170.147:9642: connected
2022-05-09T22:25:16.528231692Z 2022-05-09T22:25:16.528Z|00031|reconnect|ERR|ssl:10.0.170.147:9642: no response to inactivity probe after 90 seconds, disconnecting
2022-05-09T22:25:16.528231692Z 2022-05-09T22:25:16.528Z|00032|reconnect|INFO|ssl:10.0.170.147:9642: connection dropped

^ 3 minutes of outage

During the same time period, SBDB also thinks its getting nothing from ovn-controller:
[trozet@fedora must-gather.local.7812632480284007845]$ omg logs  ovnkube-master-l85dd -c sbdb | grep 10.0.247.77
2022-05-09T22:23:57.696598680Z 2022-05-09T22:23:57.696Z|00125|reconnect|ERR|ssl:10.0.247.77:47710: no response to inactivity probe after 30 seconds, disconnecting

We then reproduced this issue while running tcpdump on the br-ex kernel interfaces to see what actual traffic is being sent between the nodes. From the packet captures I can see that SBDB is continuously re-transmitting packets that make it into ovn-controller's node, but the node never sends back an ACK. The final packets we see from ovn-controller are a TCP window size increase and a few more packets and then it stops responding all together.

The question is why the kernel stops ACKing packets that are arriving into the node? The only explanation I can come up with for that is perhaps the receive queue is full and the kernel is refusing to ACK any new packets that come in. My theory is the increase in window size (receive buffer) helps support this, but I really need help from a kernel network stack expert to debug further.

With all that being said, this problem was only seen when moving to RHCOS 8.6 and not seen in earlier versions. After reverting the bump to RHCOS 8.6 it seems to have fixed the issue. This is either a bug in the 8.6 kernel or some compatibility problem with the same OVN version on the new kernel. Going to open a bugzilla on the kernel team to help in investigate.

Comment 2 Tim Rozet 2022-05-11 14:22:38 UTC
8.5 kernel - 4.18.0-348.23.1.el8_5.x86_64

8.6 kernel - 4.18.0-369.el8.x86_64

Florian is looking into this from the kernel side.

Comment 3 David Eads 2022-05-11 15:13:18 UTC
Linking to an example failure for future me: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1521037775749517312 .  The ci search results will clear in another five days or so.

Comment 6 Tim Rozet 2022-05-12 17:13:19 UTC
Florian used tcpreplay on the capture and didn't find any irregularities with the affected kernel. I think the next step is to reproduce with net.netfilter.nf_conntrack_log_invalid=6 set so we can see if CT is dropping the packet as invalid, as well as enable some iptables tracing to see if the packet is making into/through iptables.

Comment 7 Tim Rozet 2022-05-13 19:27:00 UTC
After some further analysis I think we know whats happening here. Somehow mid TCP conversation (following a TCP window update) the client starts sending TCP packets on with a *different* TCP source port that are part of the *same* TCP conversation:

 1337 2022-05-13 12:54:54.629149   10.0.201.4 → 10.0.172.7   TCP 66 54728 → 9642 [ACK] Seq=5592 Ack=1212160 Win=162176 Len=0 TSval=1001755650 TSecr=4035653492
 1338 2022-05-13 12:54:54.629436   10.0.172.7 → 10.0.201.4   TLSv1.3 9015 Application Data [TCP segment of a reassembled PDU]
 1339 2022-05-13 12:54:54.629805   10.0.172.7 → 10.0.201.4   TLSv1.3 26913 Application Data [TCP segment of a reassembled PDU]
 1340 2022-05-13 12:54:54.629852   10.0.201.4 → 10.0.172.7   TCP 66 54728 → 9642 [ACK] Seq=5592 Ack=1247956 Win=174592 Len=0 TSval=1001755651 TSecr=4035653493
 1341 2022-05-13 12:54:54.629882   10.0.172.7 → 10.0.201.4   TLSv1.3 62709 Application Data, Application Data, Application Data, Application Data
 1342 2022-05-13 12:54:54.629892   10.0.172.7 → 10.0.201.4   TLSv1.3 26288 Application Data, Application Data, Application Data, Application Data
 1343 2022-05-13 12:54:54.630144   10.0.201.4 → 10.0.172.7   TCP 66 54728 → 9642 [ACK] Seq=5592 Ack=1336821 Win=112768 Len=0 TSval=1001755651 TSecr=4035653493
 1344 2022-05-13 12:54:54.633146   10.0.201.4 → 10.0.172.7   TCP 66 [TCP Window Update] 54728 → 9642 [ACK] Seq=5592 Ack=1336821 Win=248576 Len=0 TSval=1001755654 TSecr=4035653493
 1345 2022-05-13 12:54:55.077203   10.0.172.7 → 10.0.201.4   TLSv1.3 229 Application Data
 1346 2022-05-13 12:54:55.077285   10.0.172.7 → 10.0.201.4   TLSv1.3 1168 Application Data
 1347 2022-05-13 12:54:55.083290   10.0.172.7 → 10.0.201.4   TCP 1168 [TCP Retransmission] 9642 → 54728 [PSH, ACK] Seq=1336984 Ack=5592 Win=54400 Len=1102 TSval=4035653947 TSecr=1001755654
 1348 2022-05-13 12:54:55.083378   10.0.201.4 → 10.0.172.7   TCP 78 50464 → 9642 [ACK] Seq=1 Ack=1 Win=4144 Len=0 TSval=1001756104 TSecr=4035653940 SLE=4294966195 SRE=1
 1349 2022-05-13 12:54:55.124728   10.0.172.7 → 10.0.201.4   TLSv1.3 229 Application Data
 1350 2022-05-13 12:54:55.124766   10.0.201.4 → 10.0.172.7   TCP 66 50464 → 9642 [ACK] Seq=1 Ack=164 Win=4284 Len=0 TSval=1001756146 TSecr=4035653988
 1351 2022-05-13 12:54:55.124780   10.0.172.7 → 10.0.201.4   TLSv1.3 1117 Application Data
 1352 2022-05-13 12:54:55.124803   10.0.201.4 → 10.0.172.7   TCP 66 50464 → 9642 [ACK] Seq=1 Ack=1215 Win=4424 Len=0 TSval=1001756146 TSecr=4035653988
 1353 2022-05-13 12:54:55.284294   10.0.172.7 → 10.0.201.4   TCP 2545 [TCP Retransmission] 9642 → 54728 [PSH, ACK] Seq=1336821 Ack=5592 Win=54400 Len=2479 TSval=4035654148 TSecr=1001755654
 1354 2022-05-13 12:54:55.284328   10.0.201.4 → 10.0.172.7   TCP 78 [TCP Dup ACK 1352#1] 50464 → 9642 [ACK] Seq=1 Ack=1215 Win=4424 Len=0 TSval=1001756305 TSecr=4035654148 SLE=4294966032 SRE=1215
 1355 2022-05-13 12:54:55.692288   10.0.172.7 → 10.0.201.4   TCP 2545 [TCP Retransmission] 9642 → 54728 [PSH, ACK] Seq=1336821 Ack=5592 Win=54400 Len=2479 TSval=4035654556 TSecr=1001755654
 1356 2022-05-13 12:54:55.692326   10.0.201.4 → 10.0.172.7   TCP 78 [TCP Dup ACK 1352#2] 50464 → 9642 [ACK] Seq=1 Ack=1215 Win=4424 Len=0 TSval=1001756713 TSecr=4035654556 SLE=4294966032 SRE=1215
 1357 2022-05-13 12:54:56.525308   10.0.172.7 → 10.0.201.4   TCP 2545 [TCP Retransmission] 9642 → 54728 [PSH, ACK] Seq=1336821 Ack=5592 Win=54400 Len=2479 TSval=4035655389 TSecr=1001755654
 1358 2022-05-13 12:54:56.525365   10.0.201.4 → 10.0.172.7   TCP 78 [TCP Dup ACK 1352#3] 50464 → 9642 [ACK] Seq=1 Ack=1215 Win=4424 Len=0 TSval=1001757546 TSecr=4035655389 SLE=4294966032 SRE=1215

In the above output, 10.0.201.4 is the ovn-controller (client) and 10.0.172.7 is the SBDB (server). The original connection started with the client connecting using port 54728. After sending a window update in packet 1344, the client starts ACK'ing further packets using a different TCP port (50464)! Upon examining the sequence/ack numbers in these packets they match perfectly with the packets being sent on the original connection of 54728.

This is definitely a bug in the kernel. It's hard to say whether it is rooted in conntrack somehow changing the source port or if something gets mangled down in the TCP data structure referencing this socket. I'm going to try to reproduce again and capture conntrack events and/or try to bypass conntrack to rule that out, but ultimately will need help from the kernel team to try to identify what is happening with this kernel. Will attach the pcap to this bz.

Comment 8 Tim Rozet 2022-05-13 19:41:38 UTC
Created attachment 1879529 [details]
pcap

Comment 9 Tim Rozet 2022-05-13 19:52:34 UTC
Just to clarify the sequence numbers shown in the above output are relative. Since tshark thinks there are 2 different conversations happening it starts the second stream at 1. However, if you look at the raw sequence number you can see they match:

[trozet@fedora ovnkube-node-q7pq5]$ tshark -r tcpdump_capture_br-ex-05-13-16-20-22-1652458822.pcap -Y "frame.number==1343 || frame.number==1348" -T fields -E separator=, -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport  -e frame.len -e tcp.flags  -e tcp.seq_raw -e _ws.col.Info -E header=y -E quote=d -E occurrence=f
ip.src,ip.dst,tcp.srcport,tcp.dstport,frame.len,tcp.flags,tcp.seq_raw,_ws.col.Info
"10.0.201.4","10.0.172.7","54728","9642","66","0x0010","3698430526","54728 → 9642 [ACK] Seq=5592 Ack=1336821 Win=112768 Len=0 TSval=1001755651 TSecr=4035653493"
"10.0.201.4","10.0.172.7","50464","9642","78","0x0010","3698430526","50464 → 9642 [ACK] Seq=1 Ack=1 Win=4144 Len=0 TSval=1001756104 TSecr=4035653940 SLE=4294966195 SRE=1"


^ sequence number on both streams: 3698430526

Comment 10 Tim Rozet 2022-05-13 22:30:56 UTC
Managed to reproduce while monitoring conntrack events. It looks like this is caused by conntrack. In this example the original connection is:
10.0.173.53:9642 (SBDB server) <----> 10.0.171.17:48804 (client ovn-controller)

When the problem happens, client changes its port to be 65349, and in conntrack I see:
[trozet@fedora 2nd]$ cat conntrack.log | egrep 'port=65349|port=48804'
May 13 22:04:25 ip-10-0-171-17 conntrack[3178]:     [NEW] tcp      6 120 SYN_SENT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804
May 13 22:04:25 ip-10-0-171-17 conntrack[3178]:     [NEW] tcp      6 120 SYN_SENT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 mark=2 zone=64000
May 13 22:04:25 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 60 SYN_RECV src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804
May 13 22:04:25 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 432000 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [ASSURED]
May 13 22:04:25 ip-10-0-171-17 conntrack[3178]: [DESTROY] tcp      6 431999 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [ASSURED]
May 13 22:04:51 ip-10-0-171-17 conntrack[3178]:     [NEW] tcp      6 300 ESTABLISHED src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [UNREPLIED] src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=61029
May 13 22:04:51 ip-10-0-171-17 conntrack[3178]:     [NEW] tcp      6 300 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349
^^^^^^^^^^^^^^^^^

May 13 22:04:51 ip-10-0-171-17 conntrack[3178]:     [NEW] tcp      6 300 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=65349 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349 mark=2 zone=64000
May 13 22:05:25 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 120 FIN_WAIT src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [UNREPLIED] src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=61029
May 13 22:05:25 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 120 FIN_WAIT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349
May 13 22:05:25 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 30 LAST_ACK src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349
May 13 22:05:43 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 30 LAST_ACK src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [UNREPLIED] src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=61029
May 13 22:05:43 ip-10-0-171-17 conntrack[3178]:  [UPDATE] tcp      6 120 TIME_WAIT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349
May 13 22:06:24 ip-10-0-171-17 conntrack[3178]: [DESTROY] tcp      6 LAST_ACK src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [ASSURED] mark=2 zone=64000
May 13 22:06:27 ip-10-0-171-17 conntrack[3178]: [DESTROY] tcp      6 LAST_ACK src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [UNREPLIED] src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=61029

For some reason conntrack thinks the connection is destroyed, even though it isn't then creates 2 new entries, one of which changes the source port. Will attach this pcap as well.

Comment 11 Tim Rozet 2022-05-13 22:31:38 UTC
iptables output:
sh-4.4# iptables-save
# Generated by iptables-save v1.8.4 on Fri May 13 22:24:12 2022
*raw
:PREROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A PREROUTING -p tcp -m tcp --sport 9642 -j TRACE
-A PREROUTING -p udp -m udp --dport 6081 -j NOTRACK
-A OUTPUT -p udp -m udp --dport 6081 -j NOTRACK
COMMIT
# Completed on Fri May 13 22:24:12 2022
# Generated by iptables-save v1.8.4 on Fri May 13 22:24:12 2022
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-KUBE-ITP - [0:0]
-A PREROUTING -p tcp -m tcp --sport 9642 -j TRACE
-A OUTPUT -j OVN-KUBE-ITP
COMMIT
# Completed on Fri May 13 22:24:12 2022
# Generated by iptables-save v1.8.4 on Fri May 13 22:24:12 2022
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
-A INPUT -j KUBE-FIREWALL
-A INPUT -p tcp -m tcp --sport 9642 -j TRACE
-A FORWARD -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22624 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -p tcp -m tcp --dport 22623 --tcp-flags FIN,SYN,RST,ACK SYN -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -j KUBE-FIREWALL
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FIREWALL ! -s 127.0.0.0/8 -d 127.0.0.0/8 -m comment --comment "block incoming localnet connections" -m conntrack ! --ctstate RELATED,ESTABLISHED,DNAT -j DROP
COMMIT
# Completed on Fri May 13 22:24:12 2022
# Generated by iptables-save v1.8.4 on Fri May 13 22:24:12 2022
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-KUBE-SNAT-MGMTPORT - [0:0]
:OVN-KUBE-ITP - [0:0]
:OVN-KUBE-NODEPORT - [0:0]
:OVN-KUBE-EXTERNALIP - [0:0]
:OVN-KUBE-ETP - [0:0]
-A PREROUTING -j OVN-KUBE-ETP
-A PREROUTING -j OVN-KUBE-EXTERNALIP
-A PREROUTING -j OVN-KUBE-NODEPORT
-A POSTROUTING -o ovn-k8s-mp0 -j OVN-KUBE-SNAT-MGMTPORT
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A OUTPUT -j OVN-KUBE-EXTERNALIP
-A OUTPUT -j OVN-KUBE-NODEPORT
-A OUTPUT -j OVN-KUBE-ITP
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
-A OVN-KUBE-SNAT-MGMTPORT -o ovn-k8s-mp0 -m comment --comment "OVN SNAT to Management Port" -j SNAT --to-source 10.131.0.2
-A OVN-KUBE-SNAT-MGMTPORT -p tcp -m tcp --dport 30858 -j RETURN
-A OVN-KUBE-SNAT-MGMTPORT -p tcp -m tcp --dport 31285 -j RETURN
-A OVN-KUBE-NODEPORT -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 30858 -j DNAT --to-destination 172.30.37.190:443
-A OVN-KUBE-NODEPORT -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 31285 -j DNAT --to-destination 172.30.37.190:80
-A OVN-KUBE-ETP -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 30858 -j DNAT --to-destination 169.254.169.3:30858
-A OVN-KUBE-ETP -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 31285 -j DNAT --to-destination 169.254.169.3:31285
COMMIT
# Completed on Fri May 13 22:24:12 2022

Comment 12 Tim Rozet 2022-05-13 22:33:18 UTC
Created attachment 1879578 [details]
2nd pcap with ports 48804 and 65349

Comment 13 Tim Rozet 2022-05-13 23:34:46 UTC
The problem no longer occurs if conntrack is skipped for these connections:
sh-4.4# iptables -t raw -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
CT         tcp  --  anywhere             anywhere             tcp spt:9642 NOTRACK
TRACE      tcp  --  anywhere             anywhere             tcp spt:9642
CT         udp  --  anywhere             anywhere             udp dpt:geneve NOTRACK

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
CT         tcp  --  anywhere             anywhere             tcp dpt:9642 NOTRACK
CT         udp  --  anywhere             anywhere             udp dpt:geneve NOTRACK

Comment 14 Florian Westphal 2022-05-16 07:14:48 UTC
(In reply to Tim Rozet from comment #10)
> Managed to reproduce while monitoring conntrack events. It looks like this
> is caused by conntrack. In this example the original connection is:
> 10.0.173.53:9642 (SBDB server) <----> 10.0.171.17:48804 (client
> ovn-controller)
> 
> When the problem happens, client changes its port to be 65349, and in
> conntrack I see:
> [trozet@fedora 2nd]$ cat conntrack.log | egrep 'port=65349|port=48804'
    [NEW] tcp    120 SYN_SENT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804
    [NEW] tcp    120 SYN_SENT src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 mark=2 zone=64000
 [UPDATE] tcp     60  SYN_RECV src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804
 [UPDATE] tcp 432000 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [ASSURED]
[DESTROY] tcp 431999 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [ASSURED]
    [NEW] tcp    300 ESTABLISHED src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=48804 [UNREPLIED] src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=61029


   [NEW] tcp     300 ESTABLISHED src=10.0.171.17 dst=10.0.173.53 sport=48804 dport=9642 [UNREPLIED] src=10.0.173.53 dst=10.0.171.17 sport=9642 dport=65349

Ok, thanks, this is great catch.  So the culprit is bug #2006167.
As a workaround you could disable the mid-stream pickup net.netfilter.nf_conntrack_tcp_loose=0.
NOTRACK works as well of course.

Whats happening here is that (for as-of-yet unknown reason) the conntrack entry is removed, and then the next packet is picked up
as new in the *wrong* direction, sport is seen as 9642, dest as 48804.  This then triggers the auto-port rewrite that was supposed to work around
port shadowing.  This will be reverted in 8.6 zstream via #2065266.

Comment 15 Dan Williams 2022-05-16 17:06:11 UTC
Talking with Florian it sounds like the cause of the conntrack deletion should be root-caused, as that may prevent the port rewrite in the first place.

Comment 16 Tim Rozet 2022-05-16 20:27:47 UTC
So on my F35 with kernel 5.16.18-200.fc35.x86_64 I see similarly odd behavior:

 [UPDATE] tcp      6 120 FIN_WAIT src=172.18.0.5 dst=172.18.0.3 sport=41002 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41002 [ASSURED]
 [UPDATE] tcp      6 30 LAST_ACK src=172.18.0.5 dst=172.18.0.3 sport=41002 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41002 [ASSURED]
 [UPDATE] tcp      6 120 TIME_WAIT src=172.18.0.5 dst=172.18.0.3 sport=41002 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41002 [ASSURED]

###ovn-controller container restarted, 41002 is old port, 41004 is new port
    [NEW] tcp      6 120 SYN_SENT src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004
    [NEW] tcp      6 120 SYN_SENT src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004 mark=2 zone=64000
 [UPDATE] tcp      6 60 SYN_RECV src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004
 [UPDATE] tcp      6 432000 ESTABLISHED src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004 [ASSURED]
[DESTROY] tcp      6 431999 ESTABLISHED src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004 [ASSURED]   <-------------Destroyed establish session??
[DESTROY] tcp      6 107 TIME_WAIT src=172.18.0.5 dst=172.18.0.3 sport=41002 dport=6642 src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41002 [ASSURED]
    [NEW] tcp      6 300 ESTABLISHED src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004 [UNREPLIED] src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=53254   <-------------Random new source port
    [NEW] tcp      6 300 ESTABLISHED src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=31963    <-------------Random new source port
    [NEW] tcp      6 300 ESTABLISHED src=172.18.0.5 dst=172.18.0.3 sport=31963 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=31963 mark=2 zone=64000
[DESTROY] tcp      6 300 CLOSE src=172.18.0.5 dst=172.18.0.3 sport=31963 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=31963 mark=2 zone=64000
[DESTROY] tcp      6 300 CLOSE src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=6642 [UNREPLIED] src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=31963
[DESTROY] tcp      6 300 CLOSE src=172.18.0.3 dst=172.18.0.5 sport=6642 dport=41004 [UNREPLIED] src=172.18.0.5 dst=172.18.0.3 sport=41004 dport=53254
    [NEW] tcp      6 120 SYN_SENT src=172.18.0.5 dst=172.18.0.2 sport=49500 dport=6642 [UNREPLIED] src=172.18.0.2 dst=172.18.0.5 sport=6642 dport=49500
    [NEW] tcp      6 120 SYN_SENT src=172.18.0.5 dst=172.18.0.2 sport=49500 dport=6642 [UNREPLIED] src=172.18.0.2 dst=172.18.0.5 sport=6642 dport=49500 mark=2 zone=64000
 [UPDATE] tcp      6 60 SYN_RECV src=172.18.0.5 dst=172.18.0.2 sport=49500 dport=6642 src=172.18.0.2 dst=172.18.0.5 sport=6642 dport=49500
 [UPDATE] tcp      6 432000 ESTABLISHED src=172.18.0.5 dst=172.18.0.2 sport=49500 dport=6642 src=172.18.0.2 dst=172.18.0.5 sport=6642 dport=49500 [ASSURED]



Pcap output:
15:27:55.592623 02:42:ac:12:00:05 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 41575, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.5.41004 > 172.18.0.3.6642: Flags [S], cksum 0x585b (incorrect -> 0x3c34), seq 1643667136, win 64240, options [mss 1460,sackOK,TS val 2757007746 ecr 0,nop,wscale 7], length 0
15:27:55.592695 02:42:ac:12:00:03 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 74: (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.3.6642 > 172.18.0.5.41004: Flags [S.], cksum 0x585b (incorrect -> 0x724e), seq 3988055409, ack 1643667137, win 65160, options [mss 1460,sackOK,TS val 2292607599 ecr 2757007746,nop,wscale 7], length 0
15:27:55.592715 02:42:ac:12:00:05 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 41576, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.5.41004 > 172.18.0.3.6642: Flags [.], cksum 0x5853 (incorrect -> 0x9dac), seq 1, ack 1, win 502, options [nop,nop,TS val 2757007747 ecr 2292607599], length 0

...rest of normal push/acks omitted for brevity

15:28:05.077279 02:42:ac:12:00:03 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 206: (tos 0xc0, ttl 64, id 55243, offset 0, flags [DF], proto TCP (6), length 192)
    172.18.0.3.6642 > 172.18.0.5.41004: Flags [P.], cksum 0x58df (incorrect -> 0x3239), seq 269440:269580, ack 3044, win 501, options [nop,nop,TS val 2292617083 ecr 2757007756], length 140
15:28:05.077338 02:42:ac:12:00:05 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 41596, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.5.31963 > 172.18.0.3.6642: Flags [.], cksum 0x5853 (incorrect -> 0x4d74), seq 1643670180, ack 3988324989, win 629, options [nop,nop,TS val 2757017231 ecr 2292617083], length 0

^^^wrong port in ACK!!!

15:28:05.077398 02:42:ac:12:00:03 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 54: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.18.0.3.6642 > 172.18.0.5.31963: Flags [R], cksum 0xd0af (correct), seq 3988324989, win 0, length 0
15:28:05.077493 02:42:ac:12:00:03 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 292: (tos 0xc0, ttl 64, id 55244, offset 0, flags [DF], proto TCP (6), length 278)
    172.18.0.3.6642 > 172.18.0.5.41004: Flags [P.], cksum 0x5935 (incorrect -> 0x23e2), seq 269580:269806, ack 3044, win 501, options [nop,nop,TS val 2292617083 ecr 2757007756], length 226
15:28:05.077510 02:42:ac:12:00:05 > 02:42:ac:12:00:03, ethertype IPv4 (0x0800), length 54: (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    172.18.0.5.41004 > 172.18.0.3.6642: Flags [R], cksum 0xd0f8 (correct), seq 1643670180, win 0, length 0
15:28:05.077820 02:42:ac:12:00:05 > 02:42:ac:12:00:02, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 44267, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.5.49500 > 172.18.0.2.6642: Flags [S], cksum 0x585a (incorrect -> 0x2718), seq 3594600488, win 64240, options [mss 1460,sackOK,TS val 508935613 ecr 0,nop,wscale 7], length 0
15:28:05.078033 02:42:ac:12:00:02 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 74: (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.18.0.2.6642 > 172.18.0.5.49500: Flags [S.], cksum 0x585a (incorrect -> 0x3221), seq 4128814769, ack 3594600489, win 65160, options [mss 1460,sackOK,TS val 3827130469 ecr 508935613,nop,wscale 7], length 0
15:28:05.078068 02:42:ac:12:00:05 > 02:42:ac:12:00:02, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 44268, offset 0, flags [DF], proto TCP (6), length 52)
    172.18.0.5.49500 > 172.18.0.2.6642: Flags [.], cksum 0x5852 (incorrect -> 0x5d80), seq 1, ack 1, win 502, options [nop,nop,TS val 508935613 ecr 3827130469], length 0


Florian, does my kernel version contain the fix? Anything else we can do to help debug this? Do we need to clone this to a kernel bz?

Comment 17 Dan Williams 2022-05-16 22:16:40 UTC
(In reply to Tim Rozet from comment #16)
> Florian, does my kernel version contain the fix? Anything else we can do to
> help debug this? Do we need to clone this to a kernel bz?

It's already cloned to a kernel BZ: bug 2065266

But it's my understanding that won't fix whatever is deleting the conntrack entry from deleting it; so (I think?) we'll still maybe get a connection reset but at least it won't wait for the full idle timeout before reconnecting. Could be wrong; Florian can correct me.

Comment 18 Dan Williams 2022-05-17 01:59:03 UTC
Tim, is it possible it's ovn-controller reconnecting to the OVS bridge when it restarts?

<kill ovn-controller>
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22533|stream_fd|DBG|recv: Connection reset by peer
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22534|rconn|WARN|br-int<->unix#119078: connection dropped (Connection reset by peer)
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22535|rconn|DBG|br-int<->unix#119078: entering DISCONNECTED
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22536|rconn|DBG|br-int<->unix#119079: connection closed by peer
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22537|rconn|DBG|br-int<->unix#119079: entering DISCONNECTED
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22538|rconn|DBG|br-int<->unix#119080: connection closed by peer
May 17 01:48:38 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22539|rconn|DBG|br-int<->unix#119080: entering DISCONNECTED

<ovn-controller reconnects>
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22819|poll_loop|DBG|wakeup due to [POLLIN] on fd 49 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:274 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22820|rconn|DBG|br-int<->unix#119114: entering CONNECTING
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22821|poll_loop|DBG|wakeup due to [POLLOUT] on fd 10 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:153 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22822|vconn|DBG|unix#119114: sent (Success): OFPT_HELLO (OF1.5) (xid=0x1a88e):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22823|poll_loop|DBG|wakeup due to [POLLIN] on fd 10 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22824|vconn|DBG|unix#119114: received: OFPT_HELLO (OF1.5) (xid=0x1):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22825|vconn|DBG|unix#119114: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22826|rconn|DBG|br-int<->unix#119114: connected
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22827|rconn|DBG|br-int<->unix#119114: entering ACTIVE
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22828|vconn|DBG|unix#119114: received: OFPST_METER_FEATURES request (OF1.5) (xid=0x2):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22829|vconn|DBG|unix#119114: sent (Success): OFPST_METER_FEATURES reply (OF1.5) (xid=0x2):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: max_meter:200000 max_bands:1 max_color:0
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: band_types: drop
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: capabilities: kbps pktps burst stats
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22830|rconn|DBG|br-int<->unix#119115: entering CONNECTING
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22831|poll_loop|DBG|wakeup due to [POLLOUT] on fd 60 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:153 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22832|vconn|DBG|unix#119115: sent (Success): OFPT_HELLO (OF1.5) (xid=0x1a88f):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22833|vconn|DBG|unix#119115: received: OFPT_HELLO (OF1.5) (xid=0x3):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22834|vconn|DBG|unix#119115: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22835|rconn|DBG|br-int<->unix#119115: connected
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22836|rconn|DBG|br-int<->unix#119115: entering ACTIVE
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22837|poll_loop|DBG|wakeup due to [POLLIN] on fd 10 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22838|vconn|DBG|unix#119114: received: OFPST_METER_FEATURES request (OF1.5) (xid=0x4):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22839|vconn|DBG|unix#119114: sent (Success): OFPST_METER_FEATURES reply (OF1.5) (xid=0x4):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: max_meter:200000 max_bands:1 max_color:0
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: band_types: drop
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: capabilities: kbps pktps burst stats
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22840|poll_loop|DBG|wakeup due to [POLLIN] on fd 60 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22841|vconn|DBG|unix#119115: received: NXT_TLV_TABLE_REQUEST (OF1.5) (xid=0x5):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22842|vconn|DBG|unix#119115: sent (Success): NXT_TLV_TABLE_REPLY (OF1.5) (xid=0x5):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  max option space=256 max fields=64
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  allocated option space=4
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  mapping table:
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:   class  type  length  match field
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  ------  ----  ------  --------------
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:   0x102  0x80       4  tun_metadata0
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22843|poll_loop|DBG|wakeup due to [POLLIN] on fd 60 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22844|vconn|DBG|unix#119115: received: OFPT_FLOW_MOD (OF1.5) (xid=0x6): DEL table:255 priority=0 actions=drop
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22845|vconn|DBG|unix#119115: received: OFPT_GROUP_MOD (OF1.5) (xid=0x7):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  DEL group_id=4294967292,type=all
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22846|vconn|DBG|unix#119115: received: OFPT_METER_MOD (OF1.5) (xid=0x8): DEL meter=all bands=
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22847|poll_loop|DBG|wakeup due to 0-ms timeout at ../ofproto/ofproto-dpif.c:1963 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|01858|poll_loop(urcu5)|DBG|wakeup due to [POLLIN] on fd 40 (FIFO pipe:[18354]) at ../lib/ovs-rcu.c:363 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22848|hmap|DBG|../ofproto/ofproto-dpif-xlate.c:889: 1 bucket with 6+ nodes, including 1 bucket with 6 nodes (16 nodes total across 16 buckets)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|01859|poll_loop(urcu5)|DBG|wakeup due to [POLLIN] on fd 40 (FIFO pipe:[18354]) at ../lib/ovs-rcu.c:236 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|03015|poll_loop(revalidator55)|DBG|wakeup due to [POLLIN] on fd 46 (FIFO pipe:[42171]) at ../ofproto/ofproto-dpif-upcall.c:1021 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|01860|dpif(urcu5)|DBG|Dropped 37 log messages in last 0 seconds (most recently, 0 seconds ago) due to excessive rate
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|03016|dpif(revalidator55)|DBG|system@ovs-system: get_stats success
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|01861|dpif(urcu5)|DBG|system@ovs-system: DPIF meter 1 deleted
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|00961|poll_loop(revalidator54)|DBG|wakeup due to [POLLIN] on fd 44 (FIFO pipe:[44105]) at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|00996|poll_loop(revalidator53)|DBG|wakeup due to [POLLIN] on fd 42 (FIFO pipe:[41383]) at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|00962|poll_loop(revalidator54)|DBG|wakeup due to [POLLIN] on fd 44 (FIFO pipe:[44105]) at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|00997|poll_loop(revalidator53)|DBG|wakeup due to [POLLIN] on fd 42 (FIFO pipe:[41383]) at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|00998|poll_loop(revalidator53)|DBG|wakeup due to 0-ms timeout at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|03017|poll_loop(revalidator55)|DBG|wakeup due to [POLLIN] on fd 46 (FIFO pipe:[42171]) at ../lib/ovs-thread.c:341 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22849|poll_loop|DBG|wakeup due to [POLLIN] on fd 54 (FIFO pipe:[28824]) at ../vswitchd/bridge.c:423 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|01862|poll_loop(urcu5)|DBG|wakeup due to [POLLIN] on fd 40 (FIFO pipe:[18354]) at ../lib/ovs-rcu.c:236 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22850|poll_loop|DBG|wakeup due to [POLLIN] on fd 49 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:274 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22851|rconn|DBG|br-int<->unix#119116: entering CONNECTING
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22852|poll_loop|DBG|wakeup due to [POLLOUT] on fd 62 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:153 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22853|vconn|DBG|unix#119116: sent (Success): OFPT_HELLO (OF1.5) (xid=0x1a890):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x01, 0x02, 0x03, 0x04, 0x05, 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22854|vconn|DBG|unix#119116: received: OFPT_HELLO (OF1.5) (xid=0xb):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]:  version bitmap: 0x06
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22855|vconn|DBG|unix#119116: negotiated OpenFlow version 0x06 (we support version 0x06 and earlier, peer supports version 0x06)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22856|rconn|DBG|br-int<->unix#119116: connected
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22857|rconn|DBG|br-int<->unix#119116: entering ACTIVE
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22858|vconn|DBG|unix#119116: received: OFPT_GET_CONFIG_REQUEST (OF1.5) (xid=0x36):
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22859|vconn|DBG|unix#119116: sent (Success): OFPT_GET_CONFIG_REPLY (OF1.5) (xid=0x36): frags=normal miss_send_len=0
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22860|vconn|DBG|unix#119116: received: NXT_SET_PACKET_IN_FORMAT (OF1.5) (xid=0x37): format=nxt_packet_in2
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22861|vconn|DBG|unix#119116: received: OFPT_PACKET_OUT (OF1.5) (xid=0x39): in_port=CONTROLLER actions=set_field:0xc->metadata,set_field:0x8000->reg15,set_field:0x10000/0x10000->reg10,resubmit(CONTROLLER,38) data_len=46
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: igmp,vlan_tci=0x0000,dl_src=0a:58:0a:83:00:01,dl_dst=01:00:5e:00:00:01,nw_src=10.131.0.1,nw_dst=224.0.0.1,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=17,igmp_code=10
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22862|vconn|DBG|unix#119116: received: OFPT_SET_CONFIG (OF1.5) (xid=0x43): frags=normal miss_send_len=65535
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22863|poll_loop|DBG|wakeup due to [POLLIN] on fd 60 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (0% CPU usage)
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22864|vconn|DBG|unix#119115: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9): zone_id=0
May 17 01:49:03 ip-10-0-171-17 ovs-vswitchd[1361]: ovs|22865|ct_dpif|DBG|system@ovs-system: ct_flush: zone 0

note the ct_flush for zone 0...

Comment 19 Surya Seetharaman 2022-05-17 12:03:00 UTC
Yes it seems to be ovn-controller reconnect that causes this, confirmed this on KIND:

[DESTROY] tcp      6 96 TIME_WAIT src=172.18.0.4 dst=172.18.0.3 sport=6642 dport=55346 src=172.18.0.3 dst=172.18.0.4 sport=55346 dport=6642 delta-time=48 [start=Tue May 17 11:50:50 2022] [stop=Tue May 17 11:51:38 2022]
[DESTROY] tcp      6 431999 ESTABLISHED src=172.18.0.3 dst=172.18.0.4 sport=55350 dport=6642 src=172.18.0.4 dst=172.18.0.3 sport=6642 dport=55350 [ASSURED] delta-time=0 [start=Tue May 17 11:51:38 2022] [stop=Tue May 17 11:51:38 2022]
[NEW] tcp      6 300 ESTABLISHED src=172.18.0.4 dst=172.18.0.3 sport=6642 dport=55350 [UNREPLIED] src=172.18.0.3 dst=172.18.0.4 sport=55350 dport=6642

(I seem to have the fix Florian mentioned so my new entry gets the same port ^ and things work fine)

but point being the secondary entry gets destroyed at 11:51:38 and indeed at the same time in OVS logs we see:

2022-05-17T11:51:38.895Z|06747|vconn|DBG|unix#5018: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x8): zone_id=0

Comment 20 Tim Rozet 2022-05-17 14:33:16 UTC
OVN bug: https://bugzilla.redhat.com/show_bug.cgi?id=2087194

Comment 21 Dan Williams 2022-05-17 14:46:39 UTC
Follow-up... ovn-controller flushes specific conntrack zones when it restarts to ensure known state. It doesn't know whether it's been quickly restarted (like on OCP upgrade) or whether it's a fresh start or what; only the thing doing the restarting really knows that. It will flush the conntrack zones chosen for each LSP and those configured by the controller. ovn-kube sets LSP.options:snat-ct-zone=0 for the GW router port (since ovnkube shares the host IP with OVN) which is why it's flushing zone 0 on restart.

We're discussing preventing flushing all zones by CMS/ovnkube choice, or never flushing zone 0, or other options.

This zone flush has always happened in ovn-controller, but the bad behavior is only surfaced because of the conntrack issue in RHEL 8.6 that is solved by the revert in https://bugzilla.redhat.com/show_bug.cgi?id=2065266

Comment 22 Tim Rozet 2022-05-18 14:47:07 UTC
Dan, I'm not sure this flush has always happened in ovn-controller. I am not able to reproduce in an older version of OVN. See https://bugzilla.redhat.com/show_bug.cgi?id=2087194#c3

Comment 23 Dan Williams 2022-05-18 16:41:17 UTC
(In reply to Tim Rozet from comment #22)
> Dan, I'm not sure this flush has always happened in ovn-controller. I am not
> able to reproduce in an older version of OVN. See
> https://bugzilla.redhat.com/show_bug.cgi?id=2087194#c3

It does look like it's been there for zone 0 forever. Mis-use of atoi() and checking whether the result is 0 (which of course is the case for zone 0).

http://patchwork.ozlabs.org/project/ovn/patch/20220518160952.9998-1-dceara@redhat.com/

Comment 24 Tim Rozet 2022-05-18 16:53:54 UTC
Yeah it does actually happen in the older version of OVN, its just our kind deployment back then used local gateway mode by default instead of shared. In local gw mode this isn't an issue. It affects shared gateway mode which was supported in 4.8 and later.

Comment 25 W. Trevor King 2022-05-19 05:07:58 UTC
As we get close to cutting 4.11 feature candidates, I'm wondering if there is a risk here for 4.10 -> 4.11 updates. So I'm asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?
* reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
* example: Customers upgrading from 4.10 to 4.11 running OVN.  Check your vulnerability with 'oc get -o jsonpath='{.status.networkType}{"\n"}' network cluster' or the following PromQL ...

Wondering about PromQL, is there an RFE somewhere about growing a metric that exposes the cluster Network's status.networkType?  Because we're still dragging around [2] as a recording rule for ... , and that seems pretty awkward to use in-cluster, when the network type string seems like something that would be a great way to distinguish "is this cluster vulnerable to bug X?".

What is the impact? Is it serious enough to warrant removing update recommendations?
* reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
* example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
* example: Up to 90 seconds of API downtime. Check with curl ....
* example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?
* reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
* example: Issue resolves itself after five minutes.
* example: Admin can run a single: oc ....
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?
* reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
* example: No, it has always been like this we just never noticed.
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

[1]: https://github.com/openshift/enhancements/tree/master/enhancements/update/update-blocker-lifecycle
[2]: https://github.com/openshift/telemeter/blob/987128a17c198270ff22cf36ff7b7df396f51826/jsonnet/telemeter/rules.libsonnet#L90-L107

Comment 26 Tim Rozet 2022-05-19 18:05:24 UTC
Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

There is no increase in vulnerability. The primary bug of removing conntrack entries has existed in shared gateway mode since 4.8 during ovn-controller restart. This was just compounded by a kernel bug found in RHCOS 8.6 which is not currently present in any OCP version.


What is the impact? Is it serious enough to warrant removing update recommendations?

The impact of the primary bug is during ovn-controller restart (which is typical during an upgrade across all nodes) there could be temporary traffic outage/disruption. Since this has always existed it doesn't warrant any new update recommendation other than moving to a version with the fix across previous z streams once backported.

How involved is remediation?

The issue should resolve itself. Some TCP connections may become stale, but applications will typically reconnect. Other conntrack entries will be rebuilt. It mainly affects traffic that is NAT'ed, like service traffic or cluster egress traffic from pods.


Is this a regression?
No, has existed since 4.8.

Comment 27 W. Trevor King 2022-05-19 21:37:21 UTC
> This was just compounded by a kernel bug found in RHCOS 8.6 which is not currently present in any OCP version.

But we'll shortly have 4.11 FCs built off RHCOS 8.6, right?  Anyhow, "temporary traffic outage/disruption" and "The issue should resolve itself" are sufficient for "not worth doing any graph-data work" for me, so I'm removing UpgradeBlocker.  But if folks feel like I'm misunderstanding, feel free to explain why and add UpgradeBlocker back on, and we'll do a fresh round of assessment.

Comment 34 Mike Fiedler 2022-05-26 14:07:48 UTC
Step 2 should be ovs-appctl -t /var/run/openvswitch/ovs-vswitchd.<pid>.ctl vlog/set dbg

Reproduced CT_FLUSH for zone_id=0 on 4.11.0-0.nightly-2022-05-20-213928.   Next step is test fixed version.

May 26 14:05:43 ip-10-0-184-88 ovs-vswitchd[1238]: ovs|01069|vconn|DBG|unix#146: received: NXT_CT_FLUSH_ZONE (OF1.5) (xid=0x9): zone_id=0

Comment 35 Mike Fiedler 2022-05-26 14:26:02 UTC
Verified on 4.11.0-0.nightly-2022-05-25-123329.   After test procedure in comment 32 and comment 34, no CT_FLUSH entries seen after kill/restart of ovn-controller.

Comment 37 errata-xmlrpc 2022-08-10 11:09:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.