Bug 1833012

Summary:	Lower OVNKubernetes HTTP E/W performance compared with OpenShiftSDN
Product:	OpenShift Container Platform	Reporter:	Raul Sevilla <rsevilla>
Component:	Networking	Assignee:	Mark Gray <mark.d.gray>
Networking sub component:	ovn-kubernetes	QA Contact:	Raul Sevilla <rsevilla>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	anbhat, anusaxen, bbennett, dcbw, dceara, fleitner, jeder, jtaleric, mark.d.gray, mifiedle, mwoodson, rsevilla, wking
Version:	4.4
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:12:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Flavio Leitner 2020-05-14 14:03:13 UTC

Hi,


Could you please provide the output of the script below while reproducing the issue from RX and TX hosts?

----8<----
#!/bin/bash

ovs-dpctl show > dpctl.txt
while :; do 
    echo $( date ) >> dpctl.txt
    ovs-dpctl dump-flows >> dpctl.txt
    sleep 0.5
done
----8<----

Stop it with CTRL+C when after had reproduced the issue for at least few seconds. More time is not a problem.


Note that it needs to be at bare metal/host level and not inside a container, otherwise the command will return an error like:
  ovs-dpctl: no datapaths exist
  ovs-dpctl: datapath not found (Invalid argument)

Then you could you repeat the same when reproducing with openshift-sdn?

The outputs will tell us how the flows are being cached in the kernel. That will likely point us to sources of latency and throughput differences.

Thank you!
fbl

Comment 5 Flavio Leitner 2020-05-18 21:26:29 UTC

First pass on the level of re-circulations:
OCPSDN:
-> recirc_id(0),tunnel(tun_id=0x5ff4a9,src=10.0.158.227,dst=
  -> recirc_id(0x8f8),tunnel(tun_id=0x5ff4a9,src=10.0.158.22
    -> recirc_id(0x8f9),tunnel(tun_id=0x5ff4a9,src=10.0.158.
  -> recirc_id(0x8f8),tunnel(tun_id=0x5ff4a9,src=10.0.158.22
-> recirc_id(0),tunnel(tun_id=0x8f5c1f,src=10.0.140.38,dst=1
  -> recirc_id(0x993),tunnel(tun_id=0x8f5c1f,src=10.0.140.38
    -> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
    -> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
  -> recirc_id(0x993),tunnel(tun_id=0x8f5c1f,src=10.0.140.38
    -> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
    -> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
-> recirc_id(0),in_port(10),ct_state(-trk),eth(),eth_type(0x
  -> recirc_id(0x15d),in_port(10),ct_state(+trk),eth(),eth_t
    -> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
    -> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
    -> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
-> recirc_id(0),tunnel(tun_id=0x5ff4a9,src=10.0.161.173,dst=
  -> recirc_id(0x978),tunnel(tun_id=0x5ff4a9,src=10.0.161.17
    -> recirc_id(0x979),tunnel(tun_id=0x5ff4a9,src=10.0.161.
  -> recirc_id(0x978),tunnel(tun_id=0x5ff4a9,src=10.0.161.17
[...]

Some re-circulations, but not deep (3 levels mostly).

OVN SDN:
-> recirc_id(0),in_port(5),eth(src=00:00:00:00:00:00/01:00:0
  -> recirc_id(0x20),in_port(5),ct_state(+new-est-rel-rpl-in
    -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rpl-
      -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c:0
    -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rpl-
  -> recirc_id(0x20),in_port(5),ct_state(-new+est-rel-rpl-in
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x1e947),in_port(5),ct_state(-new+est-rel
        -> recirc_id(0x1e948),in_port(5),ct_state(-new+est-r
          -> recirc_id(0x1e949),in_port(5),ct_state(+new-est
            -> recirc_id(0x1e94a),in_port(5),ct_state(-new+e
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
  -> recirc_id(0x20),in_port(5),ct_state(-new+est-rel-rpl-in
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x1e947),in_port(5),ct_state(-new+est-rel
        -> recirc_id(0x1e948),in_port(5),ct_state(-new+est-r
          -> recirc_id(0x1e949),in_port(5),ct_state(+new-est
            -> recirc_id(0x1e94a),in_port(5),ct_state(-new+e
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
    -> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
      -> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
        -> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
      -> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp

That above is just one example re-circulating.
I will try to identify the flows with the highest number of packets which probably represents the test and see if I can find the correct sequence.
to be continued.
fbl

Comment 6 Flavio Leitner 2020-05-19 16:36:04 UTC

Hi,

Could you provide a perf record from all CPUs while reproducing the issue on the RX and TX side?
I suggest to leave the test running a moment until it gets stable and then start 'perf record -a sleep 5' 
I hope that 5 seconds will not generate a huge data file.

I am asking this because the OVN flow does the following:
port 9: 29f0f80135845ea
→recirc_id(0),in_port(9),eth(src=e6:85:2 |s:P., actions:ct(zone=3),recirc(0x1e8e8)
  ↳recirc_id(0x1e8e8),in_port(9),ct_stat |, actions:ct(zone=3,nat),recirc(0x1e8e9)
    ↳recirc_id(0x1e8e9),in_port(9),eth(s |s:P., actions:ct(zone=3),recirc(0x1e8eb)
      ↳recirc_id(0x1e8eb),in_port(9),ct_ |, actions:ct(zone=3,nat),recirc(0x1e8ec)
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
    ↳recirc_id(0x1e8e9),in_port(9),eth(s |gs:., actions:ct(zone=3),recirc(0x1e8eb)
      ↳recirc_id(0x1e8eb),in_port(9),ct_ |, actions:ct(zone=3,nat),recirc(0x1e8ec)
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
        ↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
[...]

It calls ct() action 4x for the same flow, while SDN calls ct() only one time.
port 5: vethd9111e25

→recirc_id(0),in_port(5),ct_state(-trk), |72s, flags:SFPR., actions:ct,recirc(0x1)
  ↳recirc_id(0x1),in_port(5),ct_state(+t |ags:SFP., actions:ct(commit),recirc(0xb)
    ↳recirc_id(0xb),in_port(5),eth(),eth |5682, used:0.571s, flags:SFP., actions:3
    ↳recirc_id(0xb),in_port(5),ct_state( |.38,ttl=64,tp_dst=4789,flags(df|key))),2
    ↳recirc_id(0xb),in_port(5),ct_state( |227,ttl=64,tp_dst=4789,flags(df|key))),2
    ↳recirc_id(0xb),in_port(5),ct_state( |153,ttl=64,tp_dst=4789,flags(df|key))),2
  ↳recirc_id(0x1),in_port(5),ct_state(-r |ytes:66, used:6.290s, flags:., actions:3


We should be able to see OVN using a lot more conntrack than OCPSDN.

In the meantime I will try to come up with a flow table that mimics the above and see if I can get the same performance impact.

BTW, are the systems equal? Can we confirm if the performance is not related to server hardware?

Thanks,
fbl

Comment 7 Flavio Leitner 2020-05-19 18:02:33 UTC

Hi,

One more test, could you please run a new test and provide the tcpdump from both interfaces at bare metal connecting the hosts?
I will need full packet sizes (-s0) and you can limit to 10000 packets (-c 10000) since the beginning of the flow to limit the pcap size.
Thanks,
fbl

Comment 8 Flavio Leitner 2020-05-19 21:37:26 UTC

Hi,

I built this flow table to simulate the problem of having few calls to conntrack:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:64684, bytes:5541294, used:0.000s, flags:SFP., actions:ct(commit,zone=1,nat(src=10.0.0.10)),recirc(0x7e)
  ↳recirc_id(0x7e),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64684, bytes:5541294, used:0.000s, flags:SFP., actions:ct(commit,zone=2),recirc(0x7f)
    ↳recirc_id(0x7f),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64685, bytes:5541368, used:0.000s, flags:SFP., actions:ct(commit,zone=3),recirc(0x80)
      ↳recirc_id(0x80),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64685, bytes:5541368, used:0.000s, flags:SFP., actions:3
→recirc_id(0),in_port(3),ct_state(-trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:43125, bytes:6403972, used:0.000s, flags:SFP., actions:ct(zone=1,nat),recirc(0x81)
  ↳recirc_id(0x81),in_port(3),ct_state(+est),ct_zone(0x1),eth(),eth_type(0x0800),ipv4(frag=no), packets:43126, bytes:6404046, used:0.000s, flags:SFP., actions:2


./mb/mb -d 10 -i  requests.json
info: threads (28) > connections (1): lowering the number of threads to 1

Time: 10.01s
Sent: 2.02MiB, 206.12kiB/s
Recv: 5.90MiB, 603.38kiB/s
Hits: 19212, 1918.81/s

> Now with a single call to conntrack:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:50096, bytes:4291628, used:0.000s, flags:SFP., actions:ct(commit,zone=1,nat(src=10.0.0.10)),3
→recirc_id(0),in_port(3),ct_state(-trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:33398, bytes:4959438, used:0.000s, flags:SFP., actions:ct(zone=1,nat),recirc(0x92)
  ↳recirc_id(0x92),in_port(3),ct_state(+est+trk),ct_zone(0x1),eth(),eth_type(0x0800),ipv4(frag=no), packets:33398, bytes:4959438, used:0.000s, flags:SFP., actions:2

# ./mb/mb -d 10 -i  requests.json
info: threads (28) > connections (1): lowering the number of threads to 1
Time: 10.02s
Sent: 2.07MiB, 212.03kiB/s
Recv: 6.07MiB, 620.67kiB/s
Hits: 19770, 1973.82/s


> Now with direct flows:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:120315, bytes:10306926, used:7.284s, flags:SFP., actions:3
→recirc_id(0),in_port(3),eth(),eth_type(0x0800),ipv4(frag=no), packets:80212, bytes:11911152, used:7.284s, flags:SFP., actions:2

# ./mb/mb -d 10 -i  requests.json
info: threads (28) > connections (1): lowering the number of threads to 1
Time: 10.01s
Sent: 2.10MiB, 215.22kiB/s
Recv: 6.16MiB, 630.00kiB/s
Hits: 20052, 2003.49/s


Therefore, the multiple calls to conntrack would cause -4% of impact, and not -44%.
Something else is going on (maybe I didn't reproduce the correct conntrack situation), but I will focus more on the traffic itself.
Please provide the tcpdumps as requested in comment#7.

Thanks
fbl

Comment 9 Raul Sevilla 2020-05-20 10:24:58 UTC

Attaching perf:

Captured on the client from OpenShift SDN and OVN side with "perf record -F99 -a -- mb -i wc.xml -d 120"

And yes, both environments are equal. m5.4xlarge deployed, where the client is deployed in zone=us-west-2a and the server at zone=us-west-2b

Regarding of the results you obtained with the flow table you built, can you try again with a higher number of clients?. The configuration from the description was incorrect, the good one is the described at #3. I have seen that the issue is more notorious in multi-threading configurations:

OVN e.g.

With:
"keep-alive-requests": 100
"clients": 50

sh-4.2$ mb -i wc.xml -d 10
Time: 10.10s
Sent: 21.69MiB, 2.15MiB/s
Recv: 293.92MiB, 29.10MiB/s
Hits: 244075, 24166.64/s


With:
"keep-alive-requests": 100
"clients": 5

sh-4.2$ mb -i wc.xml -d 10
info: threads (8) > connections (5): lowering the number of threads to 5
Time: 10.10s
Sent: 6.96MiB, 705.65kiB/s
Recv: 94.32MiB, 9.34MiB/s
Hits: 78317, 7753.98/s

-----------------------------

OpenShift-SDN e.g.

With:
"keep-alive-requests": 100
"clients": 50
sh-4.2$ mb -i wk.xml -d 10
Time: 10.00s
Sent: 26.12MiB, 2.61MiB/s
Recv: 353.95MiB, 35.38MiB/s
Hits: 293913, 29377.76/s

With:
"keep-alive-requests": 100
"clients": 5
sh-4.2$ mb -i wk.xml -d 10
info: threads (8) > connections (5): lowering the number of threads to 5
Time: 10.10s
Sent: 6.79MiB, 688.28kiB/s
Recv: 92.00MiB, 9.11MiB/s
Hits: 76384, 7563.02/s

Comment 12 Ben Bennett 2020-05-27 13:34:57 UTC

We are tracking further OVN performance improvements for 4.6 here https://issues.redhat.com/browse/SDN-694

Comment 14 Flavio Leitner 2020-05-29 20:17:50 UTC

Hi,

Based on comment#6, and further look into the OCP-OVN we identified the reason for the calls to conntrack.
One of the reasons is the stateful ACL including 'allow_related' which relies on conntrack.
Then we could see NAT being done twice for the same traffic (SNAT+DNAT).

I worked on a flow table that simulates what is happening in the OCP-OVN but without OVN to be able to change it as needed.

* baseline: straight forward connection, so no ACLs or SNAT or DNAT.
* statefull ACL: flow table using conntrack to allow new connection only from one side, and related packets on both sides.

baseline acl Gain/Loss
Sent (MiB/s) 13.86 13.11 -5.43%
Recv(MiB/s) 1,120.00 1,060.00 -5.36%
Hits (h/s) 143,645.67 135,830.20 -5.44%

The above shows the cost of using stateful ACL, which represents 5.4% of performance impact.

Now checking the cost of having statefull ACL and dnat in each direction:

baseline dnat+acl Gain/Loss
Sent (MiB/s) 13.86 12.00 -13.39%
Recv(MiB/s) 1,120.00 995.80 -11.09%
Hits (h/s) 143,645.67 124,405.40 -13.39%

The above number is somewhat in line with the previous one because dnat and ACL will call conntrack twice, so the cost is about 2x.

Checking the cost of the full pipeline using stateful snat, dnat and acl:
baseline dnat+snat+acl Gain/Loss
Sent (MiB/s) 13.86 11.48 -17.19
Recv(MiB/s) 1,120.00 952.20 -14.98%
Hits (h/s) 143,645.67 118961.80 -17.18%

The numbers are pretty much 3x the impact of using conntrack. Since snat, dnat and acl use conntrack each, the cost is about 3x .

So, giving that the OCP-SDN is only calling conntrack one time (comment#6), I think all the performance impact reported originally has been identified as the extra calls to conntrack.

fbl

Comment 15 Flavio Leitner 2020-05-29 20:40:57 UTC

Hi,

As a potential optimization for use-cases where performance is required, OVN could use stateless ACLs and/or stateless DNAT/SNAT providing the same functionality.
Here are the results of the two potential optimizations:

#1 This is an intermediate solution where the flow stable is using stateful ACL but stateless SNAT/DNAT.
   * baseline: straight forward connection, so no ACLs or SNAT or DNAT.
   * stateless snat/dnat: flow table using stateful ACL but stateless SNAT and DNAT
   * Gain/Loss: percentage when comparing to the baseline
   * Recover: percentage when comparing to current performance 

                baseline      stateless dnat/snat   Gain/Loss     Recover:
Sent (MiB/s)       13.86           12.77             -7.85%       +11.27%
Recv(MiB/s)     1,120.00            1036             -7.50%        +8.80%
Hits (h/s)    143,645.67          132338             -7.87%       +11.24%

The above solution shows that stateless dnat and snat are not for free because they modify packets, but it is much cheaper than calling conntrack. Therefore the result is a performance impact of -7.8% 
when comparing to straight forward path or a gain of performance of +11.27% when comparing to the current stateful OVN solution.

#2 This is a more radical solution where everything is stateless:
   * baseline: straight forward connection, so no ACLs or SNAT or DNAT
   * stateless: flow table using stateless ACL, SNAT and DNAT
   * Gain/Loss: percentage when comparing to the baseline
   * Recover: percentage when comparing to current performance

                baseline         stateless     Gain/Loss     Recover:
Sent (MiB/s)       13.86          13.674        -1.34%       +19.13%
Recv(MiB/s)     1,120.00            1110        -0.89%       +16.57%
Hits (h/s)    143,645.67         1416679        -1.37%       +19.10%

Using the above solution had a performance impact of -1.3% when comparing to straight forward flow table, which is the simplest one and a gain of performance of +19.10% when comparing to the current stateful OVN solution.

The stateless ACL offers a risk since it ignores the dynamic part of the protocol and can't validate sequence and acknowledge numbers and window sizes in case of TCP, for example.

fbl

Comment 18 Dan Williams 2020-07-30 14:46:54 UTC

Further upstream improvements in mask lookup from Eelco: http://patchwork.ozlabs.org/project/openvswitch/list/?series=193095

Also a proposal to OVN by Numan to reduce number of recirculations: http://patchwork.ozlabs.org/project/openvswitch/list/?series=191630

Comment 19 Dan Williams 2020-09-02 20:55:13 UTC

We have Numan's conntrack reductions in OpenShift 4.6 nighlies as of August 31st.

Comment 20 Ben Bennett 2020-09-08 18:49:41 UTC

@dcbw -- Do you think this is resolved on 4.6?  Should we move this to on_qe?

Comment 21 Dan Williams 2020-09-08 21:39:17 UTC

(In reply to Ben Bennett from comment #20)
> @dcbw -- Do you think this is resolved on 4.6?  Should we move this to on_qe?

Yes, I believe it is. Perf/scale team and OVN team have extensive testing that show considerable improvement with ovn-kubernetes over openshift-sdn, now that the OVS kernel patches and some of the OVN improvements have landed.

Comment 24 Dan Williams 2020-09-21 18:34:23 UTC

Mark, I believe the necessary kernel is already (or very soon will be) part of 4.6 nighlies via RHCOS. Can you confirm and if so set the but to POST?

Comment 25 Dan Williams 2020-09-21 18:53:18 UTC

(In reply to Dan Williams from comment #24)
> Mark, I believe the necessary kernel is already (or very soon will be) part
> of 4.6 nighlies via RHCOS. Can you confirm and if so set the but to POST?

That should read "set this *bug* to POST"...

Comment 26 Mark Gray 2020-09-22 16:20:21 UTC

This kernel is in the nightlies for 4.5 and 4.6

Comment 27 Dan Williams 2020-09-29 20:45:15 UTC

This hit 4.5.11 last week.

Comment 28 Mike Fiedler 2020-11-18 15:33:27 UTC

@rsevilla Are you able to verify this on 4.7?

Comment 34 errata-xmlrpc 2021-02-24 15:12:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633