Bug 1833012
| Summary: | Lower OVNKubernetes HTTP E/W performance compared with OpenShiftSDN | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Raul Sevilla <rsevilla> |
| Component: | Networking | Assignee: | Mark Gray <mark.d.gray> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Raul Sevilla <rsevilla> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | anbhat, anusaxen, bbennett, dcbw, dceara, fleitner, jeder, jtaleric, mark.d.gray, mifiedle, mwoodson, rsevilla, wking |
| Version: | 4.4 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:12:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
First pass on the level of re-circulations:
OCPSDN:
-> recirc_id(0),tunnel(tun_id=0x5ff4a9,src=10.0.158.227,dst=
-> recirc_id(0x8f8),tunnel(tun_id=0x5ff4a9,src=10.0.158.22
-> recirc_id(0x8f9),tunnel(tun_id=0x5ff4a9,src=10.0.158.
-> recirc_id(0x8f8),tunnel(tun_id=0x5ff4a9,src=10.0.158.22
-> recirc_id(0),tunnel(tun_id=0x8f5c1f,src=10.0.140.38,dst=1
-> recirc_id(0x993),tunnel(tun_id=0x8f5c1f,src=10.0.140.38
-> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
-> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
-> recirc_id(0x993),tunnel(tun_id=0x8f5c1f,src=10.0.140.38
-> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
-> recirc_id(0x99f),tunnel(tun_id=0x8f5c1f,src=10.0.140.
-> recirc_id(0),in_port(10),ct_state(-trk),eth(),eth_type(0x
-> recirc_id(0x15d),in_port(10),ct_state(+trk),eth(),eth_t
-> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
-> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
-> recirc_id(0x161),in_port(10),ct_state(-rpl),eth(),eth
-> recirc_id(0),tunnel(tun_id=0x5ff4a9,src=10.0.161.173,dst=
-> recirc_id(0x978),tunnel(tun_id=0x5ff4a9,src=10.0.161.17
-> recirc_id(0x979),tunnel(tun_id=0x5ff4a9,src=10.0.161.
-> recirc_id(0x978),tunnel(tun_id=0x5ff4a9,src=10.0.161.17
[...]
Some re-circulations, but not deep (3 levels mostly).
OVN SDN:
-> recirc_id(0),in_port(5),eth(src=00:00:00:00:00:00/01:00:0
-> recirc_id(0x20),in_port(5),ct_state(+new-est-rel-rpl-in
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rpl-
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c:0
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rpl-
-> recirc_id(0x20),in_port(5),ct_state(-new+est-rel-rpl-in
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x1e947),in_port(5),ct_state(-new+est-rel
-> recirc_id(0x1e948),in_port(5),ct_state(-new+est-r
-> recirc_id(0x1e949),in_port(5),ct_state(+new-est
-> recirc_id(0x1e94a),in_port(5),ct_state(-new+e
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
-> recirc_id(0x20),in_port(5),ct_state(-new+est-rel-rpl-in
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x1e947),in_port(5),ct_state(-new+est-rel
-> recirc_id(0x1e948),in_port(5),ct_state(-new+est-r
-> recirc_id(0x1e949),in_port(5),ct_state(+new-est
-> recirc_id(0x1e94a),in_port(5),ct_state(-new+e
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
-> recirc_id(0x2e),in_port(5),eth(src=66:18:43:e0:4d:94,
-> recirc_id(0x49),in_port(5),ct_state(-new+est-rel-rp
-> recirc_id(0x4c),in_port(5),eth(dst=e6:85:2f:80:0c
-> recirc_id(0x49),in_port(5),ct_state(+new-est-rel-rp
That above is just one example re-circulating.
I will try to identify the flows with the highest number of packets which probably represents the test and see if I can find the correct sequence.
to be continued.
fbl
Hi,
Could you provide a perf record from all CPUs while reproducing the issue on the RX and TX side?
I suggest to leave the test running a moment until it gets stable and then start 'perf record -a sleep 5'
I hope that 5 seconds will not generate a huge data file.
I am asking this because the OVN flow does the following:
port 9: 29f0f80135845ea
→recirc_id(0),in_port(9),eth(src=e6:85:2 |s:P., actions:ct(zone=3),recirc(0x1e8e8)
↳recirc_id(0x1e8e8),in_port(9),ct_stat |, actions:ct(zone=3,nat),recirc(0x1e8e9)
↳recirc_id(0x1e8e9),in_port(9),eth(s |s:P., actions:ct(zone=3),recirc(0x1e8eb)
↳recirc_id(0x1e8eb),in_port(9),ct_ |, actions:ct(zone=3,nat),recirc(0x1e8ec)
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8e9),in_port(9),eth(s |gs:., actions:ct(zone=3),recirc(0x1e8eb)
↳recirc_id(0x1e8eb),in_port(9),ct_ |, actions:ct(zone=3,nat),recirc(0x1e8ec)
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
↳recirc_id(0x1e8ec),in_port(9),c |=e6:85:2f:81:12:09)),set(ipv4(ttl=63)),1
[...]
It calls ct() action 4x for the same flow, while SDN calls ct() only one time.
port 5: vethd9111e25
→recirc_id(0),in_port(5),ct_state(-trk), |72s, flags:SFPR., actions:ct,recirc(0x1)
↳recirc_id(0x1),in_port(5),ct_state(+t |ags:SFP., actions:ct(commit),recirc(0xb)
↳recirc_id(0xb),in_port(5),eth(),eth |5682, used:0.571s, flags:SFP., actions:3
↳recirc_id(0xb),in_port(5),ct_state( |.38,ttl=64,tp_dst=4789,flags(df|key))),2
↳recirc_id(0xb),in_port(5),ct_state( |227,ttl=64,tp_dst=4789,flags(df|key))),2
↳recirc_id(0xb),in_port(5),ct_state( |153,ttl=64,tp_dst=4789,flags(df|key))),2
↳recirc_id(0x1),in_port(5),ct_state(-r |ytes:66, used:6.290s, flags:., actions:3
We should be able to see OVN using a lot more conntrack than OCPSDN.
In the meantime I will try to come up with a flow table that mimics the above and see if I can get the same performance impact.
BTW, are the systems equal? Can we confirm if the performance is not related to server hardware?
Thanks,
fbl
Hi, One more test, could you please run a new test and provide the tcpdump from both interfaces at bare metal connecting the hosts? I will need full packet sizes (-s0) and you can limit to 10000 packets (-c 10000) since the beginning of the flow to limit the pcap size. Thanks, fbl Hi,
I built this flow table to simulate the problem of having few calls to conntrack:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:64684, bytes:5541294, used:0.000s, flags:SFP., actions:ct(commit,zone=1,nat(src=10.0.0.10)),recirc(0x7e)
↳recirc_id(0x7e),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64684, bytes:5541294, used:0.000s, flags:SFP., actions:ct(commit,zone=2),recirc(0x7f)
↳recirc_id(0x7f),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64685, bytes:5541368, used:0.000s, flags:SFP., actions:ct(commit,zone=3),recirc(0x80)
↳recirc_id(0x80),in_port(2),ct_state(+trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:64685, bytes:5541368, used:0.000s, flags:SFP., actions:3
→recirc_id(0),in_port(3),ct_state(-trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:43125, bytes:6403972, used:0.000s, flags:SFP., actions:ct(zone=1,nat),recirc(0x81)
↳recirc_id(0x81),in_port(3),ct_state(+est),ct_zone(0x1),eth(),eth_type(0x0800),ipv4(frag=no), packets:43126, bytes:6404046, used:0.000s, flags:SFP., actions:2
./mb/mb -d 10 -i requests.json
info: threads (28) > connections (1): lowering the number of threads to 1
Time: 10.01s
Sent: 2.02MiB, 206.12kiB/s
Recv: 5.90MiB, 603.38kiB/s
Hits: 19212, 1918.81/s
> Now with a single call to conntrack:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:50096, bytes:4291628, used:0.000s, flags:SFP., actions:ct(commit,zone=1,nat(src=10.0.0.10)),3
→recirc_id(0),in_port(3),ct_state(-trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:33398, bytes:4959438, used:0.000s, flags:SFP., actions:ct(zone=1,nat),recirc(0x92)
↳recirc_id(0x92),in_port(3),ct_state(+est+trk),ct_zone(0x1),eth(),eth_type(0x0800),ipv4(frag=no), packets:33398, bytes:4959438, used:0.000s, flags:SFP., actions:2
# ./mb/mb -d 10 -i requests.json
info: threads (28) > connections (1): lowering the number of threads to 1
Time: 10.02s
Sent: 2.07MiB, 212.03kiB/s
Recv: 6.07MiB, 620.67kiB/s
Hits: 19770, 1973.82/s
> Now with direct flows:
→recirc_id(0),in_port(2),eth(),eth_type(0x0800),ipv4(frag=no), packets:120315, bytes:10306926, used:7.284s, flags:SFP., actions:3
→recirc_id(0),in_port(3),eth(),eth_type(0x0800),ipv4(frag=no), packets:80212, bytes:11911152, used:7.284s, flags:SFP., actions:2
# ./mb/mb -d 10 -i requests.json
info: threads (28) > connections (1): lowering the number of threads to 1
Time: 10.01s
Sent: 2.10MiB, 215.22kiB/s
Recv: 6.16MiB, 630.00kiB/s
Hits: 20052, 2003.49/s
Therefore, the multiple calls to conntrack would cause -4% of impact, and not -44%.
Something else is going on (maybe I didn't reproduce the correct conntrack situation), but I will focus more on the traffic itself.
Please provide the tcpdumps as requested in comment#7.
Thanks
fbl
Attaching perf: Captured on the client from OpenShift SDN and OVN side with "perf record -F99 -a -- mb -i wc.xml -d 120" And yes, both environments are equal. m5.4xlarge deployed, where the client is deployed in zone=us-west-2a and the server at zone=us-west-2b Regarding of the results you obtained with the flow table you built, can you try again with a higher number of clients?. The configuration from the description was incorrect, the good one is the described at #3. I have seen that the issue is more notorious in multi-threading configurations: OVN e.g. With: "keep-alive-requests": 100 "clients": 50 sh-4.2$ mb -i wc.xml -d 10 Time: 10.10s Sent: 21.69MiB, 2.15MiB/s Recv: 293.92MiB, 29.10MiB/s Hits: 244075, 24166.64/s With: "keep-alive-requests": 100 "clients": 5 sh-4.2$ mb -i wc.xml -d 10 info: threads (8) > connections (5): lowering the number of threads to 5 Time: 10.10s Sent: 6.96MiB, 705.65kiB/s Recv: 94.32MiB, 9.34MiB/s Hits: 78317, 7753.98/s ----------------------------- OpenShift-SDN e.g. With: "keep-alive-requests": 100 "clients": 50 sh-4.2$ mb -i wk.xml -d 10 Time: 10.00s Sent: 26.12MiB, 2.61MiB/s Recv: 353.95MiB, 35.38MiB/s Hits: 293913, 29377.76/s With: "keep-alive-requests": 100 "clients": 5 sh-4.2$ mb -i wk.xml -d 10 info: threads (8) > connections (5): lowering the number of threads to 5 Time: 10.10s Sent: 6.79MiB, 688.28kiB/s Recv: 92.00MiB, 9.11MiB/s Hits: 76384, 7563.02/s We are tracking further OVN performance improvements for 4.6 here https://issues.redhat.com/browse/SDN-694 Hi, Based on comment#6, and further look into the OCP-OVN we identified the reason for the calls to conntrack. One of the reasons is the stateful ACL including 'allow_related' which relies on conntrack. Then we could see NAT being done twice for the same traffic (SNAT+DNAT). I worked on a flow table that simulates what is happening in the OCP-OVN but without OVN to be able to change it as needed. * baseline: straight forward connection, so no ACLs or SNAT or DNAT. * statefull ACL: flow table using conntrack to allow new connection only from one side, and related packets on both sides. baseline acl Gain/Loss Sent (MiB/s) 13.86 13.11 -5.43% Recv(MiB/s) 1,120.00 1,060.00 -5.36% Hits (h/s) 143,645.67 135,830.20 -5.44% The above shows the cost of using stateful ACL, which represents 5.4% of performance impact. Now checking the cost of having statefull ACL and dnat in each direction: baseline dnat+acl Gain/Loss Sent (MiB/s) 13.86 12.00 -13.39% Recv(MiB/s) 1,120.00 995.80 -11.09% Hits (h/s) 143,645.67 124,405.40 -13.39% The above number is somewhat in line with the previous one because dnat and ACL will call conntrack twice, so the cost is about 2x. Checking the cost of the full pipeline using stateful snat, dnat and acl: baseline dnat+snat+acl Gain/Loss Sent (MiB/s) 13.86 11.48 -17.19 Recv(MiB/s) 1,120.00 952.20 -14.98% Hits (h/s) 143,645.67 118961.80 -17.18% The numbers are pretty much 3x the impact of using conntrack. Since snat, dnat and acl use conntrack each, the cost is about 3x . So, giving that the OCP-SDN is only calling conntrack one time (comment#6), I think all the performance impact reported originally has been identified as the extra calls to conntrack. fbl Hi,
As a potential optimization for use-cases where performance is required, OVN could use stateless ACLs and/or stateless DNAT/SNAT providing the same functionality.
Here are the results of the two potential optimizations:
#1 This is an intermediate solution where the flow stable is using stateful ACL but stateless SNAT/DNAT.
* baseline: straight forward connection, so no ACLs or SNAT or DNAT.
* stateless snat/dnat: flow table using stateful ACL but stateless SNAT and DNAT
* Gain/Loss: percentage when comparing to the baseline
* Recover: percentage when comparing to current performance
baseline stateless dnat/snat Gain/Loss Recover:
Sent (MiB/s) 13.86 12.77 -7.85% +11.27%
Recv(MiB/s) 1,120.00 1036 -7.50% +8.80%
Hits (h/s) 143,645.67 132338 -7.87% +11.24%
The above solution shows that stateless dnat and snat are not for free because they modify packets, but it is much cheaper than calling conntrack. Therefore the result is a performance impact of -7.8%
when comparing to straight forward path or a gain of performance of +11.27% when comparing to the current stateful OVN solution.
#2 This is a more radical solution where everything is stateless:
* baseline: straight forward connection, so no ACLs or SNAT or DNAT
* stateless: flow table using stateless ACL, SNAT and DNAT
* Gain/Loss: percentage when comparing to the baseline
* Recover: percentage when comparing to current performance
baseline stateless Gain/Loss Recover:
Sent (MiB/s) 13.86 13.674 -1.34% +19.13%
Recv(MiB/s) 1,120.00 1110 -0.89% +16.57%
Hits (h/s) 143,645.67 1416679 -1.37% +19.10%
Using the above solution had a performance impact of -1.3% when comparing to straight forward flow table, which is the simplest one and a gain of performance of +19.10% when comparing to the current stateful OVN solution.
The stateless ACL offers a risk since it ignores the dynamic part of the protocol and can't validate sequence and acknowledge numbers and window sizes in case of TCP, for example.
fbl
Further upstream improvements in mask lookup from Eelco: http://patchwork.ozlabs.org/project/openvswitch/list/?series=193095 Also a proposal to OVN by Numan to reduce number of recirculations: http://patchwork.ozlabs.org/project/openvswitch/list/?series=191630 We have Numan's conntrack reductions in OpenShift 4.6 nighlies as of August 31st. @dcbw -- Do you think this is resolved on 4.6? Should we move this to on_qe? (In reply to Ben Bennett from comment #20) > @dcbw -- Do you think this is resolved on 4.6? Should we move this to on_qe? Yes, I believe it is. Perf/scale team and OVN team have extensive testing that show considerable improvement with ovn-kubernetes over openshift-sdn, now that the OVS kernel patches and some of the OVN improvements have landed. Mark, I believe the necessary kernel is already (or very soon will be) part of 4.6 nighlies via RHCOS. Can you confirm and if so set the but to POST? (In reply to Dan Williams from comment #24) > Mark, I believe the necessary kernel is already (or very soon will be) part > of 4.6 nighlies via RHCOS. Can you confirm and if so set the but to POST? That should read "set this *bug* to POST"... This kernel is in the nightlies for 4.5 and 4.6 This hit 4.5.11 last week. @rsevilla Are you able to verify this on 4.7? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Hi, Could you please provide the output of the script below while reproducing the issue from RX and TX hosts? ----8<---- #!/bin/bash ovs-dpctl show > dpctl.txt while :; do echo $( date ) >> dpctl.txt ovs-dpctl dump-flows >> dpctl.txt sleep 0.5 done ----8<---- Stop it with CTRL+C when after had reproduced the issue for at least few seconds. More time is not a problem. Note that it needs to be at bare metal/host level and not inside a container, otherwise the command will return an error like: ovs-dpctl: no datapaths exist ovs-dpctl: datapath not found (Invalid argument) Then you could you repeat the same when reproducing with openshift-sdn? The outputs will tell us how the flows are being cached in the kernel. That will likely point us to sources of latency and throughput differences. Thank you! fbl