This is one issue found while working on https://bugzilla.redhat.com/show_bug.cgi?id=2087604#16 Note: The iperf3 test run between PODs on the same node. Some analysis: The iperf3 test started at: Time: Fri, 13 May 2022 06:22:17 GMT and ran for 600 seconds. I can't see anything out of the ordinary in the OVS logs at this time to give a clue. For example correlating: $ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1 Fri May 13 06:22:25 2022 2338 packets dropped at ovs_dp_process_packet I don't see a large increase of flow_mods, (wouldn't we expect that if it was due to the classifier issue that AaronC fixed?): ovs-vswitchd.log-20220414: 2022-04-13T06:22:10.318Z|98838|connmgr|INFO|br-int<->unix#231439: 1 flow_mods 10 s ago (1 adds) 2022-04-13T06:22:16.889Z|98839|connmgr|INFO|br-ex<->unix#243278: 2 flow_mods in the last 0 s (2 adds) 2022-04-13T06:22:31.922Z|98840|connmgr|INFO|br-ex<->unix#243281: 2 flow_mods in the last 0 s (2 adds) 2022-04-13T06:22:46.955Z|98841|connmgr|INFO|br-ex<->unix#243284: 2 flow_mods in the last 0 s (2 adds) ovsdb-server.log-20220414: 2022-04-13T06:22:01.821Z|23014|jsonrpc|WARN|unix#263088: receive error: Connection reset by peer 2022-04-13T06:22:01.821Z|23015|reconnect|WARN|unix#263088: connection dropped (Connection reset by peer) 2022-04-13T06:23:01.961Z|23016|jsonrpc|WARN|unix#263098: receive error: Connection reset by peer 2022-04-13T06:23:01.961Z|23017|reconnect|WARN|unix#263098: connection dropped (Connection reset by peer) 2022-04-13T06:24:47.208Z|23018|jsonrpc|WARN|unix#263115: receive error: Connection reset by peer But the dropwatch2.out file is full of upcall drops: $ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1 Fri May 13 06:22:25 2022 2338 packets dropped at ovs_dp_process_packet Fri May 13 06:22:30 2022 3230 packets dropped at ovs_dp_process_packet -- Fri May 13 06:24:15 2022 1847 packets dropped at ovs_dp_process_packet -- Fri May 13 06:25:20 2022 1821 packets dropped at ovs_dp_process_packet Fri May 13 06:25:25 2022 1620 packets dropped at ovs_dp_process_packet Fri May 13 06:25:30 2022 2142 packets dropped at ovs_dp_process_packet Fri May 13 06:25:35 2022 259 packets dropped at ovs_dp_process_packet -- Fri May 13 06:26:25 2022 783 packets dropped at ovs_dp_process_packet Fri May 13 06:26:30 2022 230 packets dropped at ovs_dp_process_packet -- Fri May 13 06:26:55 2022 5052 packets dropped at ovs_dp_process_packet Fri May 13 06:27:00 2022 82 packets dropped at ovs_dp_process_packet -- Fri May 13 06:27:45 2022 1077 packets dropped at ovs_dp_process_packet -- Fri May 13 06:28:00 2022 86 packets dropped at ovs_dp_process_packet -- Fri May 13 06:28:20 2022 1760 packets dropped at ovs_dp_process_packet Fri May 13 06:28:25 2022 611 packets dropped at ovs_dp_process_packet -- Fri May 13 06:28:45 2022 1306 packets dropped at ovs_dp_process_packet -- Fri May 13 06:31:00 2022 996 packets dropped at ovs_dp_process_packet -- Fri May 13 06:31:35 2022 3511 packets dropped at ovs_dp_process_packet -- Fri May 13 06:31:45 2022 417 packets dropped at ovs_dp_process_packet Maybe running the test with OVS debug enabled may give better clues. =========================== The upcall buffer was identified as too small in upstream and got bumped in the following commit: https://github.com/openvswitch/ovs/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90 Downstream fix is available in OVS 2.16: https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90
This fix has been part of OCP 4.11 since mid-February 2022 via https://github.com/openshift/os/pull/715
Test 4.11.0-0.nightly-2022-06-25-132614 in both packet dualstack and cluster-bot dualstack. Test results show 0 packet lost in packet dualstack after kernal tunning, and packets lost from 1.7% to 0.068% in cluster-bot dualstack cluster after kernal tunning. #### Packets baremetal cluster without kernal tunning / $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666 iperf 3.10.1 Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 Control connection MSS 1328 Time: Tue, 28 Jun 2022 13:46:53 UTC Connecting to host fd01:0:0:5::f, port 6666 Cookie: lp5azkzqcwak7ujjl55i3mxwqh4ljlrpv44x Target Bitrate: 200000000 [ 5] local fd01:0:0:5::10 port 59553 connected to fd01:0:0:5::f port 6666 Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 56603706 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/56603706 (0%) sender [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.001 ms 15840/56603706 (0.028%) receiver CPU Utilization: local/sender 16.9% (1.1%u/15.8%s), remote/receiver 12.4% (1.6%u/10.8%s) iperf Done. / $ #### Packets baremetal cluster with kernal tunning / $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666 iperf 3.10.1 Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 Control connection MSS 1328 Time: Tue, 28 Jun 2022 14:05:45 UTC Connecting to host fd01:0:0:5::f, port 6666 Cookie: tnmi6xfsq2qk2dmydnxhlg5q63idz4y44rqp Target Bitrate: 200000000 [ 5] local fd01:0:0:5::10 port 34628 connected to fd01:0:0:5::f port 6666 Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 56603701 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/56603701 (0%) sender [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.001 ms 0/56603701 (0%) receiver CPU Utilization: local/sender 20.1% (2.8%u/17.3%s), remote/receiver 13.3% (2.4%u/10.9%s) iperf Done. / $ #### cluster-bot baremetal cluster without kernal tunning / $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553 iperf 3.10.1 Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 Control connection MSS 1328 Time: Tue, 28 Jun 2022 13:50:59 UTC Connecting to host fd01:0:0:5::10, port 59553 Cookie: 5oipxhgympitohjzg6fabedtbqydaoef6lqb Target Bitrate: 200000000 [ 5] local fd01:0:0:5::11 port 41243 connected to fd01:0:0:5::10 port 59553 Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 56603702 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/56603702 (0%) sender [ 5] 0.00-600.00 sec 13.7 GBytes 197 Mbits/sec 0.005 ms 966313/56603702 (1.7%) receiver CPU Utilization: local/sender 38.0% (5.2%u/32.7%s), remote/receiver 22.6% (3.8%u/18.8%s) iperf Done. / $ #### cluster-bot baremetal cluster with kernal tunning / $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553 iperf 3.10.1 Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 Control connection MSS 1328 Time: Tue, 28 Jun 2022 14:05:23 UTC Connecting to host fd01:0:0:5::10, port 59553 Cookie: uvrhzzmb24iz3rok6xzapgdny7qxaarmbktk Target Bitrate: 200000000 [ 5] local fd01:0:0:5::11 port 51660 connected to fd01:0:0:5::10 port 59553 Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0 [ ID] Interval Transfer Bitrate Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 56603753 - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.000 ms 0/56603753 (0%) sender [ 5] 0.00-600.00 sec 14.0 GBytes 200 Mbits/sec 0.006 ms 38621/56603753 (0.068%) receiver CPU Utilization: local/sender 38.2% (5.5%u/32.7%s), remote/receiver 23.0% (4.0%u/19.0%s) iperf Done. / $ fleitner Based on above test results, do you think QE can verify this bug? Thanks!
#### Packets baremetal cluster with kernal tunning sh-4.4# chroot /host sh-4.4# ovs-appctl dpctl/show system@ovs-system: lookups: hit:11614148 missed:199183 lost:0 flows: 540 masks: hit:43790003 total:63 hit/pkt:3.71 cache: hit:10035165 hit-rate:84.95% caches: masks-cache: size:256 port 0: ovs-system (internal) port 1: bond0 port 2: br-ex (internal) port 3: br-int (internal) port 4: genev_sys_6081 (geneve: packet_type=ptap) port 5: ovn-k8s-mp0 (internal) port 6: dd02eff4b1e74b8 port 7: 903f8301d63476b port 8: 389a796c3e5b235 port 9: 1e3140f1676b7f9 port 10: 629e2bc98da5ca3 port 11: 0cd33fefb7b42fa port 12: c37e94e0db04962 port 13: 671fe511889ca98 port 14: 98321308ab1b19f port 15: 02f42b06a0c6297 port 16: 24735518d1bf969 #### cluster-bot baremetal cluster with kernal tunning sh-4.4# chroot /host sh-4.4# ovs-appctl dpctl/show system@ovs-system: lookups: hit:12457282 missed:162303 lost:0 flows: 513 masks: hit:34936722 total:42 hit/pkt:2.77 cache: hit:11185592 hit-rate:88.64% caches: masks-cache: size:256 port 0: ovs-system (internal) port 1: enp2s0 port 2: br-ex (internal) port 3: br-int (internal) port 4: genev_sys_6081 (geneve: packet_type=ptap) port 5: ovn-k8s-mp0 (internal) port 6: 087daf35aba40c9 port 7: cc3ba85ac25c0ab port 8: c66f6e768f1c664 port 9: be4ebadea1ff66f port 10: ad0b5d7affbda32 port 11: 397a290983455fe port 12: e9a7b69f81f18c6 port 13: bb1f5a2bf3d1d0a port 14: 8e2654d28a35212 port 15: 7792cd2dfde1bdd port 16: 17719f3b5b7dd71 port 17: dc3097a7e562f6d sh-4.4#
Tested on cluster-bot dualstack cluster, see the iperf3 throughput improvement in 4.11 comparing to 4.9.12 #### cluster-bot dualstack cluster 4.11 Test three times: client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554 [ 5] 0.00-600.00 sec 13.9 GBytes 199 Mbits/sec 0.005 ms 343838/56603689 (0.61%) receiver [ 5] 0.00-600.00 sec 13.9 GBytes 199 Mbits/sec 0.004 ms 308788/56603746 (0.55%) receiver [ 5] 0.00-600.00 sec 13.9 GBytes 199 Mbits/sec 0.005 ms 310015/56603745 (0.55%) receiver #### cluster-bot dualstack cluster 4.9.12 Test three times: client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554 [ 5] 0.00-600.00 sec 13.7 GBytes 197 Mbits/sec 0.007 ms 948533/56591795 (1.7%) receiver [ 5] 0.00-600.00 sec 13.8 GBytes 197 Mbits/sec 0.007 ms 835952/56603726 (1.5%) receiver [ 5] 0.00-600.00 sec 13.8 GBytes 197 Mbits/sec 0.013 ms 822906/56603660 (1.5%) receiver
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days