The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.
Bug 1844576 - OVS packet re-ordering due to upcalls
Summary: OVS packet re-ordering due to upcalls
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.13
Version: RHEL 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Flavio Leitner
QA Contact: qding
URL:
Whiteboard:
Depends On: 1992773
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-05 17:12 UTC by mcambria@redhat.com
Modified: 2023-01-20 10:04 UTC (History)
23 users (show)

Fixed In Version: openvswitch2.16-2.16.0-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-03 14:11:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
GCP veth capture (488.78 KB, application/vnd.tcpdump.pcap)
2020-06-19 19:38 UTC, mcambria@redhat.com
no flags Details
GCP genev_sys_6081 capture (804.97 KB, application/vnd.tcpdump.pcap)
2020-06-19 19:38 UTC, mcambria@redhat.com
no flags Details
AWS veth capture (994.26 KB, application/vnd.tcpdump.pcap)
2020-06-19 19:50 UTC, mcambria@redhat.com
no flags Details
AWS genev_sys_6801 capture (1.05 MB, application/vnd.tcpdump.pcap)
2020-06-19 19:50 UTC, mcambria@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-703 0 None None None 2021-08-26 08:52:08 UTC

Comment 16 mcambria@redhat.com 2020-06-19 19:37:00 UTC
The same test has been done on GCP and packet loss has been seen as well.  GCP does not use jumbo frames.

Capture done on host side of veth pair looks fine, packets are in order, MSS is large expecting offload to happen later.
Capture done on genev_sys_6801 shows packets truncated, and out of order.

veth first 2 packets after 3-way handshake:

18:46:59.202655 IP (tos 0x0, ttl 64, id 19045, offset 0, flags [DF], proto TCP (6), length 2100)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [P.], cksum 0x2f40 (incorrect -> 0x2b17), seq 1:2049, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 2048
18:46:59.202719 IP (tos 0x0, ttl 64, id 19047, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0x2c5c (incorrect -> 0xeea1), seq 2049:3357, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308 


geneve packets all packets after 3-way HS until packet with segment containing seq #1: 

18:46:59.203394 IP (tos 0x0, ttl 63, id 19047, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0xeea1 (correct), seq 2049:3357, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308
18:46:59.203397 IP (tos 0x0, ttl 63, id 19054, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0x677a (correct), seq 11205:12513, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308
18:46:59.203409 IP (tos 0x0, ttl 63, id 19050, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0x6220 (correct), seq 5973:7281, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308
18:46:59.203420 IP (tos 0x0, ttl 63, id 19052, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0xea52 (correct), seq 8589:9897, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308
18:46:59.203424 IP (tos 0x0, ttl 63, id 19045, offset 0, flags [DF], proto TCP (6), length 1360)
    10.128.10.10.57700 > 10.128.8.16.5222: Flags [.], cksum 0xf4c9 (correct), seq 1:1309, ack 1, win 207, options [nop,nop,TS val 3946345783 ecr 2379807047], length 1308 

4 packet are received (according to "tcpdump -n -i geneve_sys_6801) before seq 1:1309.  What veth sent was seq 1:2049, so the packet got a "haircut"

Comment 17 mcambria@redhat.com 2020-06-19 19:38:20 UTC
Created attachment 1698154 [details]
GCP veth capture

Comment 18 mcambria@redhat.com 2020-06-19 19:38:50 UTC
Created attachment 1698155 [details]
GCP genev_sys_6081 capture

Comment 19 mcambria@redhat.com 2020-06-19 19:49:40 UTC
Moving to AWS problems are seen as well, but different from BM and GCP.

Note: 

AWS uses jumbo frames.

Using the same test, veth capture never sees TCP segments larger than MSS.  I don't why.  So no attempt at TSO/GSO is needed.

Capture done on host side of veth pair looks fine, packets are in order.
Capture done on genev_sys_6801 shows packets received in the wrong order.

The captured returning ACKs and SACK blocks are consistent with the order in which tcpdump shows packets received via "tcpdump -i genev_sys_6081"


veth capture:

$ tcpdump -n -r veth0.pcap -n -vvv port 5222            
reading from file veth0.pcap, link-type EN10MB (Ethernet)
11:34:59.220355 IP (tos 0x0, ttl 64, id 36597, offset 0, flags [DF], proto TCP (6), length 60)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [S], cksum 0x255b (incorrect -> 0xbb69), seq 2497118578, win 26583, options [mss 8861,sackOK,TS val 1381965636 ecr 0,nop,wscale 7], length 0
11:34:59.222774 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.128.6.27.5222 > 10.128.10.18.50774: Flags [S.], cksum 0x5fd9 (correct), seq 488040960, ack 2497118579, win 26547, options [mss 8861,sackOK,TS val 3805835699 ecr 1381965636,nop,wscale 7], length 0
11:34:59.222802 IP (tos 0x0, ttl 64, id 36598, offset 0, flags [DF], proto TCP (6), length 52)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x2553 (incorrect -> 0x1270), seq 1, ack 1, win 208, options [nop,nop,TS val 1381965638 ecr 3805835699], length 0
11:34:59.222864 IP (tos 0x0, ttl 64, id 36599, offset 0, flags [DF], proto TCP (6), length 2100)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [P.], cksum 0x2d53 (incorrect -> 0x1370), seq 1:2049, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 2048
11:34:59.222937 IP (tos 0x0, ttl 64, id 36600, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x47e4 (incorrect -> 0x1644), seq 2049:10898, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 8849
11:34:59.222996 IP (tos 0x0, ttl 64, id 36601, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x47e4 (incorrect -> 0x5110), seq 10898:19747, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 8849
11:34:59.224879 IP (tos 0x0, ttl 63, id 47101, offset 0, flags [DF], proto TCP (6), length 64)
    10.128.6.27.5222 > 10.128.10.18.50774: Flags [.], cksum 0x3786 (correct), seq 1, ack 1, win 346, options [nop,nop,TS val 3805835701 ecr 1381965636,nop,nop,sack 1 {10898:19747}], length 0
11:34:59.224896 IP (tos 0x0, ttl 64, id 36602, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x47e4 (incorrect -> 0x4362), seq 19747:28596, ack 1, win 208, options [nop,nop,TS val 1381965641 ecr 3805835701], length 8849


geneve capture:

$ tcpdump -n -r geneve0.pcap -n -vvv port 5222
reading from file geneve0.pcap, link-type EN10MB (Ethernet)
11:34:59.220833 IP (tos 0x0, ttl 63, id 36597, offset 0, flags [DF], proto TCP (6), length 60)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [S], cksum 0xbb69 (correct), seq 2497118578, win 26583, options [mss 8861,sackOK,TS val 1381965636 ecr 0,nop,wscale 7], length 0
11:34:59.222481 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.128.6.27.5222 > 10.128.10.18.50774: Flags [S.], cksum 0x5fd9 (correct), seq 488040960, ack 2497118579, win 26547, options [mss 8861,sackOK,TS val 3805835699 ecr 1381965636,nop,wscale 7], length 0
11:34:59.223321 IP (tos 0x0, ttl 63, id 36601, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x5110 (correct), seq 10898:19747, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 8849
11:34:59.223327 IP (tos 0x0, ttl 63, id 36600, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x1644 (correct), seq 2049:10898, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 8849
11:34:59.223363 IP (tos 0x0, ttl 63, id 36598, offset 0, flags [DF], proto TCP (6), length 52)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x1270 (correct), seq 1, ack 1, win 208, options [nop,nop,TS val 1381965638 ecr 3805835699], length 0
11:34:59.223386 IP (tos 0x0, ttl 63, id 36599, offset 0, flags [DF], proto TCP (6), length 2100)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [P.], cksum 0x1370 (correct), seq 1:2049, ack 1, win 208, options [nop,nop,TS val 1381965639 ecr 3805835699], length 2048
11:34:59.224849 IP (tos 0x0, ttl 63, id 47101, offset 0, flags [DF], proto TCP (6), length 64)
    10.128.6.27.5222 > 10.128.10.18.50774: Flags [.], cksum 0x3786 (correct), seq 1, ack 1, win 346, options [nop,nop,TS val 3805835701 ecr 1381965636,nop,nop,sack 1 {10898:19747}], length 0
11:34:59.224923 IP (tos 0x0, ttl 63, id 36602, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x47e4 (incorrect -> 0x4362), seq 19747:28596, ack 1, win 208, options [nop,nop,TS val 1381965641 ecr 3805835701], length 8849
11:34:59.224947 IP (tos 0x0, ttl 63, id 36603, offset 0, flags [DF], proto TCP (6), length 8901)
    10.128.10.18.50774 > 10.128.6.27.5222: Flags [.], cksum 0x47e4 (incorrect -> 0xd1a8), seq 28596:37445, ack 1, win 208, options [nop,nop,TS val 1381965641 ecr 3805835701], length 8849
11:34:59.224955 IP (tos 0x0, ttl 63, id 47102, offset 0, flags [DF], proto TCP (6), length 64)
    10.128.6.27.5222 > 10.128.10.18.50774: Flags [.], cksum 0x598c (correct), seq 1, ack 1, win 484, options [nop,nop,TS val 3805835702 ecr 1381965636,nop,nop,sack 1 {2049:19747}], length 0


As with GCP, I'll attach the raw AWS captures next.

Comment 20 mcambria@redhat.com 2020-06-19 19:50:09 UTC
Created attachment 1698161 [details]
AWS veth capture

Comment 21 mcambria@redhat.com 2020-06-19 19:50:37 UTC
Created attachment 1698162 [details]
AWS genev_sys_6801 capture

Comment 22 Guillaume Nault 2020-06-24 17:00:57 UTC
There's a lot of packet reordering that happens between the veth and the geneve captures.
However, I can't see any packet lost in any of these captures.
The retransmissions in the bare metal and GCP cases don't happen because of packet lost: the timestamp echoed in the ACK is the timestamp of the original segment.

The only thing I can see that might affect TCP is the packet reordering that happens at the beginning of the connections. This affects all cases including AWS.
I suspect openvswitch has an impact here, as it's in the path between the veth and the geneve devices.

Comment 23 Guillaume Nault 2020-06-24 17:03:03 UTC
(In reply to mcambria from comment #16)
> 4 packet are received (according to "tcpdump -n -i geneve_sys_6801) before
> seq 1:1309.  What veth sent was seq 1:2049, so the packet got a "haircut"

The original packet was segmented between the veth and the vxlan devices. The next segment is visible in the pcap, so no data was lost.
That shouldn't be a problem for TCP.

Comment 24 Dan Williams 2020-06-24 21:19:11 UTC
(In reply to Guillaume Nault from comment #23)
> (In reply to mcambria from comment #16)
> > 4 packet are received (according to "tcpdump -n -i geneve_sys_6801) before
> > seq 1:1309.  What veth sent was seq 1:2049, so the packet got a "haircut"
> 
> The original packet was segmented between the veth and the vxlan devices.
> The next segment is visible in the pcap, so no data was lost.
> That shouldn't be a problem for TCP.

@gnault I assume you mean Geneve?

But looking at https://bugzilla.redhat.com/show_bug.cgi?id=1844576#c16 I still have some questions:

1) why are all the segments so badly out of order?
2) Why do we never see an "initial" segment for 1309:2049?

I would assume the process goes like this (which clearly isn't the case in these dumps):

a) veth sends a 20K packet
b) something (what?) segments the packet to send it out Geneve tunnel
c) 1st packet out geneve tunnel is 1:1309
d) 2nd packet is 1309:2049
e) 3rd packet is 2049:3357
f) etc

even if the first part takes a slow path/upcall, I'd expect:

a) veth sends a 20K packet
b) something (what?) segments the packet to send it out Geneve tunnel
c) 1st packet is 1309:2049
d) 2nd packet is 2049:3357
e) nth packet out geneve tunnel is 1:1309
f) etc

but we don't ever see the *initial* 1309:2049 bytes show up on geneve, do we? If so I can't find that.

Comment 25 Dan Williams 2020-06-24 21:22:59 UTC
(In reply to Dan Williams from comment #24)
> but we don't ever see the *initial* 1309:2049 bytes show up on geneve, do
> we? If so I can't find that.

I lie we do see it. But still way later... First question is still valid though, why are the segments to badly OoO? That's going to kill performance, even if all the data eventually makes it to the other side.

Comment 26 Guillaume Nault 2020-06-24 23:05:48 UTC
(In reply to Dan Williams from comment #24)
> (In reply to Guillaume Nault from comment #23)
> > (In reply to mcambria from comment #16)
> > > 4 packet are received (according to "tcpdump -n -i geneve_sys_6801) before
> > > seq 1:1309.  What veth sent was seq 1:2049, so the packet got a "haircut"
> > 
> > The original packet was segmented between the veth and the vxlan devices.
> > The next segment is visible in the pcap, so no data was lost.
> > That shouldn't be a problem for TCP.
> 
> @gnault I assume you mean Geneve?
> 
Yes, I meant Geneve, sorry.

> But looking at https://bugzilla.redhat.com/show_bug.cgi?id=1844576#c16 I
> still have some questions:
> 
> 1) why are all the segments so badly out of order?
> 2) Why do we never see an "initial" segment for 1309:2049?
> 
> I would assume the process goes like this (which clearly isn't the case in
> these dumps):
> 
> a) veth sends a 20K packet
> b) something (what?) segments the packet to send it out Geneve tunnel
> c) 1st packet out geneve tunnel is 1:1309
> d) 2nd packet is 1309:2049
> e) 3rd packet is 2049:3357
> f) etc
> 
GSO is compatible with UDP tunnels, so I'd expect non-segmented packets to be sent as is to the geneve device and to see segmentation occurring only on the physical device.
This is what happens later in the TCP connection: non-segmented packets are visible in the geneve capture.

> even if the first part takes a slow path/upcall, I'd expect:
> 
> a) veth sends a 20K packet
> b) something (what?) segments the packet to send it out Geneve tunnel
> c) 1st packet is 1309:2049
> d) 2nd packet is 2049:3357
> e) nth packet out geneve tunnel is 1:1309
> f) etc
> 
My guess is that openvswitch has to segment packets when it needs to move them to user space, because metadata like the GSO size are going to be lost.
Looking at net/openvswitch/datapath.c, it seems that ovs_dp_upcall() segments GSO-ed skbs and push all resulting segments to user space (with queue_gso_packets()).

Comment 27 Guillaume Nault 2020-06-24 23:28:40 UTC
(In reply to Dan Williams from comment #25)
> (In reply to Dan Williams from comment #24)
> > but we don't ever see the *initial* 1309:2049 bytes show up on geneve, do
> > we? If so I can't find that.
> 
> I lie we do see it. But still way later... First question is still valid
> though, why are the segments to badly OoO? That's going to kill performance,
> even if all the data eventually makes it to the other side.

I can't figure out why segments are out of order. But,
 * this only happens at the beginning of the connection, when the flow possibly isn't yet known by the kernel datapath,
 * and it happens on the path between the veth and the geneve interfaces.
That makes me think that openvswitch plays a role in this phenomenon.

My wild guess is that, since the initial window is 10 MSS, ovs probably doesn't have time to insert the new flow before all these 10 MSS are sent, so all resulting packets are pushed to user space.
Then, somewhere in the "ovs-kernel -> ovs-userspace -> ovs-kernel" path, packets are reordered.

Comment 28 mcambria@redhat.com 2020-06-25 16:19:44 UTC
We can reproduce packet ordering issues with OVNKubernets (Geneve) on BM, GCP and AWS.  Same test, same version of OpenShift and OVS but using SDN (VxLAN) do not show the problem on GCP or AWS.  Using SDN, tcpdump on genev_sys_6081 shows every packet sent by veth show up in the exact order.

I expected the "ovs-kernel -> ovs-userspace -> ovs-kernel" path to be identical.  This is one reason I focused on geneve initially.  That's the first difference.


sh-4.2# tcpdump -n -i vxlan_sys_4789 -n -vvv port 5222
tcpdump: listening on vxlan_sys_4789, link-type EN10MB (Ethernet), capture size 262144 bytes
14:09:04.510770 IP (tos 0x0, ttl 64, id 14821, offset 0, flags [DF], proto TCP (6), length 60)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [S], cksum 0x0511 (correct), seq 1337726548, win 27400, options [mss 1370,sackOK,TS val 454916709 ecr 0,nop,wscale 7], length 0
14:09:04.511481 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.128.2.9.5222 > 10.131.0.18.45354: Flags [S.], cksum 0x53f1 (correct), seq 3538339324, ack 1337726549, win 27160, options [mss 1370,sackOK,TS val 3401469532 ecr 454916709,nop,wscale 7], length 0
14:09:04.511890 IP (tos 0x0, ttl 64, id 14822, offset 0, flags [DF], proto TCP (6), length 52)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x1744 (incorrect -> 0xeba1), seq 1, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 0
14:09:04.512038 IP (tos 0x0, ttl 64, id 14823, offset 0, flags [DF], proto TCP (6), length 2100)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [P.], cksum 0x1f44 (incorrect -> 0xeca2), seq 1:2049, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 2048
14:09:04.512082 IP (tos 0x0, ttl 64, id 14825, offset 0, flags [DF], proto TCP (6), length 1410)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x1c92 (incorrect -> 0x96e2), seq 2049:3407, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 1358
14:09:04.512115 IP (tos 0x0, ttl 64, id 14826, offset 0, flags [DF], proto TCP (6), length 2768)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x21e0 (incorrect -> 0x4a58), seq 3407:6123, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 2716
14:09:04.512148 IP (tos 0x0, ttl 64, id 14828, offset 0, flags [DF], proto TCP (6), length 1410)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x1c92 (incorrect -> 0x50ec), seq 6123:7481, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 1358
14:09:04.512176 IP (tos 0x0, ttl 64, id 14829, offset 0, flags [DF], proto TCP (6), length 2768)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x21e0 (incorrect -> 0x977a), seq 7481:10197, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 2716
14:09:04.512205 IP (tos 0x0, ttl 64, id 14831, offset 0, flags [DF], proto TCP (6), length 1410)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x1c92 (incorrect -> 0x00be), seq 10197:11555, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 1358
14:09:04.512230 IP (tos 0x0, ttl 64, id 14832, offset 0, flags [DF], proto TCP (6), length 1410)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x1c92 (incorrect -> 0x8225), seq 11555:12913, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469532], length 1358
14:09:04.512253 IP (tos 0x0, ttl 64, id 24253, offset 0, flags [DF], proto TCP (6), length 52)
    10.128.2.9.5222 > 10.131.0.18.45354: Flags [.], cksum 0xe63f (correct), seq 1, ack 1359, win 234, options [nop,nop,TS val 3401469533 ecr 454916712], length 0
14:09:04.512282 IP (tos 0x0, ttl 64, id 24254, offset 0, flags [DF], proto TCP (6), length 52)
    10.128.2.9.5222 > 10.131.0.18.45354: Flags [.], cksum 0xe378 (correct), seq 1, ack 2049, win 255, options [nop,nop,TS val 3401469533 ecr 454916712], length 0
14:09:04.512346 IP (tos 0x0, ttl 64, id 14833, offset 0, flags [DF], proto TCP (6), length 2768)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x21e0 (incorrect -> 0xd08b), seq 12913:15629, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469533], length 2716
14:09:04.512366 IP (tos 0x0, ttl 64, id 14835, offset 0, flags [DF], proto TCP (6), length 2768)
    10.131.0.18.45354 > 10.128.2.9.5222: Flags [.], cksum 0x21e0 (incorrect -> 0xb635), seq 15629:18345, ack 1, win 215, options [nop,nop,TS val 454916712 ecr 3401469533], length 2716
14:09:04.512612 IP (tos 0x0, ttl 64, id 24255, offset 0, flags [DF], proto TCP (6), length 52)
    10.128.2.9.5222 > 10.131.0.18.45354: Flags [.], cksum 0xde15 (correct), seq 1, ack 3407, win 276, options [nop,nop,TS val 3401469533 ecr 454916712], length 0

Comment 29 mcambria@redhat.com 2020-06-26 20:21:53 UTC
Aaron looked at OVS.  Packets without flow need upcall.  Multiple packets for the same TCP connection were each taking a different path to the slowpath, resulting in their being injected into fastpath out of order.   There already is a OVS fix for this.

A cluster was built with this fixed OVS and tcpdump capture looks a lot better.  There are still things that can be done to improve performance re how OVN sets up OVS.

Comment 30 Dan Williams 2020-06-26 21:11:43 UTC
(In reply to mcambria from comment #29)
> Aaron looked at OVS.  Packets without flow need upcall.  Multiple packets
> for the same TCP connection were each taking a different path to the
> slowpath, resulting in their being injected into fastpath out of order.  
> There already is a OVS fix for this.
> 
> A cluster was built with this fixed OVS and tcpdump capture looks a lot
> better.  There are still things that can be done to improve performance re
> how OVN sets up OVS.

To be clear, Aaron posted an RFC patch here http://patchwork.ozlabs.org/project/openvswitch/patch/20200624134800.2905038-1-aconole@redhat.com/ but it's not accepted yet upstream. It makes the packet ordering better by far, but not perfect, and just papers over the issues with OVN and upcalls.

Comment 31 Flavio Leitner 2020-06-29 21:18:55 UTC
Hi,

Upcalls use netlink, so one can you Netlink Monitor (nlmon) to see what is going on and perhaps identify the ordering issue there.
Unfortunately wireshark most probably can't decode the messages out-of-the-box.

Another alternative is to set the number of upcall handler threads to one, which will force all packets to be processed by a single thread, and in order as a consequence.
# ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads = 1

You should see only one handler thread after that. You might want to do that in both sides.

Another interesting thing to do is check why there is an upcall. It's not very usual to have flow rules applied to encapsulated egress traffic.
Perhaps there is an ACL or something else that can be disabled, and then the problem disappears.

HTH,
fbl

Comment 32 mcambria@redhat.com 2020-07-22 21:08:51 UTC
Latest version of patch submitted:  

https://bugzilla.redhat.com/show_bug.cgi?id=1834444#c2

Once accepted, the OoO packets should be addressed.

Comment 33 Dan Williams 2020-08-05 20:39:12 UTC
Aaron, have you been able to address Matteo's comments?

Comment 34 Aaron Conole 2020-08-24 11:59:17 UTC
Mark Grey will take up v3 of the patch, addressing the outstanding comment by Flavio.

Comment 36 Mark Gray 2020-11-05 08:43:37 UTC
This patch has been going through multiple iterations on the OVS mailiing list. The latest patch for this is in review:

https://patchwork.ozlabs.org/project/openvswitch/patch/20201028181706.16673-1-mark.d.gray@redhat.com/

We can successfully resolve this issue with this patch as it limits upcalls to a single handler thread per-port. However, there are some concerns that this may limit scalability for some use cases because, previously in the current upstream code, multiple handler threads can handle upcalls per port.

Another approach is being explored in which we dispatch upcalls to handler threads (either in kernel space or user space).

Comment 39 Mark Gray 2021-04-30 15:42:08 UTC
Previous patch was basically NAKed. I have posted an alternative approach as an RFC:

https://marc.info/?l=linux-netdev&m=161979680725977&w=2
https://mail.openvswitch.org/pipermail/ovs-dev/2021-April/382618.html

Comment 40 Mark Gray 2021-07-08 16:14:39 UTC
This is now at v5 for user space patches: https://patchwork.ozlabs.org/project/openvswitch/list/?series=252282
v1 for kernel space patches: <https://marc.info/?l=linux-netdev&m=162504684016825&w=2>

Comment 46 Dan Williams 2021-12-06 20:27:06 UTC
This actually landed in OVS 2.16:

commit b1e517bd2f818fc7c0cd43ee0b67db4274e6b972
Author:     Mark Gray <mark.d.gray>
CommitDate: Fri Jul 16 20:05:03 2021 +0200

and should be present in all downstream openvswitch2.16 builds.

Comment 47 Clement Verna 2022-01-20 10:28:25 UTC
Hi Dan, will openvswitch2.16 be available in RHEL 8.4 ? if this version is needed to fix https://bugzilla.redhat.com/show_bug.cgi?id=2018930 then we need it in RHEL 8.4 which is used by OCP 4.8.

Comment 48 Clement Verna 2022-01-20 10:33:36 UTC
(In reply to Clement Verna from comment #47)
> Hi Dan, will openvswitch2.16 be available in RHEL 8.4 ? if this version is
> needed to fix https://bugzilla.redhat.com/show_bug.cgi?id=2018930 then we
> need it in RHEL 8.4 which is used by OCP 4.8.

Or will this patch be backported to openvswitch2.15 ? and released on the fpd repo (which seems to be the one we use in RHCOS) :-)

Comment 61 Aaron Conole 2022-06-21 13:13:29 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1844576#c52 says which kernel has the fix - 8.5 kernel should include these bits.

NOTE: not every cause of reorder will be addressed.  There are some cases that can never be resolved (due to the nature of the upcall mechanism).


Note You need to log in before you can comment on or make changes to this bug.