Bug 2099811

Summary: UDP Packet loss in OpenShift using IPv6 [upcall]
Product: OpenShift Container Platform Reporter: Flavio Leitner <fleitner>
Component: NetworkingAssignee: Flavio Leitner <fleitner>
Networking sub component: ovn-kubernetes QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bbennett, dan, dcbw, eglottma, jmaxwell, weliang, zzhao
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2099812 (view as bug list) Environment:
Last Closed: 2022-08-10 11:18:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2099812    

Description Flavio Leitner 2022-06-21 18:22:48 UTC
This is one issue found while working on
https://bugzilla.redhat.com/show_bug.cgi?id=2087604#16



Note: 

The iperf3 test run between PODs on the same node.

Some analysis:

The iperf3 test started at:

Time: Fri, 13 May 2022 06:22:17 GMT

and ran for 600 seconds.

I can't see anything out of the ordinary in the OVS logs at this time to give a clue. 

For example correlating:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet

I don't see a large increase of flow_mods, (wouldn't we expect that if it was due to the classifier issue that AaronC fixed?):
 
ovs-vswitchd.log-20220414:

2022-04-13T06:22:10.318Z|98838|connmgr|INFO|br-int<->unix#231439: 1 flow_mods 10 s ago (1 adds)
2022-04-13T06:22:16.889Z|98839|connmgr|INFO|br-ex<->unix#243278: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:31.922Z|98840|connmgr|INFO|br-ex<->unix#243281: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:46.955Z|98841|connmgr|INFO|br-ex<->unix#243284: 2 flow_mods in the last 0 s (2 adds)

ovsdb-server.log-20220414:

2022-04-13T06:22:01.821Z|23014|jsonrpc|WARN|unix#263088: receive error: Connection reset by peer
2022-04-13T06:22:01.821Z|23015|reconnect|WARN|unix#263088: connection dropped (Connection reset by peer)
2022-04-13T06:23:01.961Z|23016|jsonrpc|WARN|unix#263098: receive error: Connection reset by peer
2022-04-13T06:23:01.961Z|23017|reconnect|WARN|unix#263098: connection dropped (Connection reset by peer)
2022-04-13T06:24:47.208Z|23018|jsonrpc|WARN|unix#263115: receive error: Connection reset by peer

But the dropwatch2.out file is full of upcall drops:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet
Fri May 13 06:22:30 2022
3230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:24:15 2022
1847 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:25:20 2022
1821 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:25 2022
1620 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:30 2022
2142 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:35 2022
259 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:25 2022
783 packets dropped at ovs_dp_process_packet
Fri May 13 06:26:30 2022
230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:55 2022
5052 packets dropped at ovs_dp_process_packet
Fri May 13 06:27:00 2022
82 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:27:45 2022
1077 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:00 2022
86 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:20 2022
1760 packets dropped at ovs_dp_process_packet
Fri May 13 06:28:25 2022
611 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:45 2022
1306 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:00 2022
996 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:35 2022
3511 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:45 2022
417 packets dropped at ovs_dp_process_packet

Maybe running the test with OVS debug enabled may give better clues.

===========================

The upcall buffer was identified as too small in upstream and got
bumped in the following commit:
https://github.com/openvswitch/ovs/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90


Downstream fix is available in OVS 2.16:
https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90

Comment 1 Dan Williams 2022-06-21 18:29:48 UTC
This fix has been part of OCP 4.11 since mid-February 2022 via https://github.com/openshift/os/pull/715

Comment 5 Weibin Liang 2022-06-28 14:26:41 UTC
Test 4.11.0-0.nightly-2022-06-25-132614 in both packet dualstack and cluster-bot dualstack.

Test results show 0 packet lost in packet dualstack after kernal tunning, and packets lost from 1.7% to 0.068% in cluster-bot dualstack cluster after kernal tunning.


#### Packets baremetal cluster without kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 13:46:53 UTC
Connecting to host fd01:0:0:5::f, port 6666
      Cookie: lp5azkzqcwak7ujjl55i3mxwqh4ljlrpv44x
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::10 port 59553 connected to fd01:0:0:5::f port 6666
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603706  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603706 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  15840/56603706 (0.028%)  receiver
CPU Utilization: local/sender 16.9% (1.1%u/15.8%s), remote/receiver 12.4% (1.6%u/10.8%s)

iperf Done.
/ $ 


#### Packets baremetal cluster with kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 14:05:45 UTC
Connecting to host fd01:0:0:5::f, port 6666
      Cookie: tnmi6xfsq2qk2dmydnxhlg5q63idz4y44rqp
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::10 port 34628 connected to fd01:0:0:5::f port 6666
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603701  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603701 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  0/56603701 (0%)  receiver
CPU Utilization: local/sender 20.1% (2.8%u/17.3%s), remote/receiver 13.3% (2.4%u/10.9%s)

iperf Done.
/ $ 



#### cluster-bot baremetal cluster without kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 13:50:59 UTC
Connecting to host fd01:0:0:5::10, port 59553
      Cookie: 5oipxhgympitohjzg6fabedtbqydaoef6lqb
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::11 port 41243 connected to fd01:0:0:5::10 port 59553
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603702  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603702 (0%)  sender
[  5]   0.00-600.00 sec  13.7 GBytes   197 Mbits/sec  0.005 ms  966313/56603702 (1.7%)  receiver
CPU Utilization: local/sender 38.0% (5.2%u/32.7%s), remote/receiver 22.6% (3.8%u/18.8%s)

iperf Done.
/ $ 


#### cluster-bot baremetal cluster with kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 14:05:23 UTC
Connecting to host fd01:0:0:5::10, port 59553
      Cookie: uvrhzzmb24iz3rok6xzapgdny7qxaarmbktk
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::11 port 51660 connected to fd01:0:0:5::10 port 59553
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603753  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603753 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.006 ms  38621/56603753 (0.068%)  receiver
CPU Utilization: local/sender 38.2% (5.5%u/32.7%s), remote/receiver 23.0% (4.0%u/19.0%s)

iperf Done.
/ $ 


fleitner Based on above test results, do you think QE can verify this bug? Thanks!

Comment 6 Weibin Liang 2022-06-28 14:45:20 UTC
#### Packets baremetal cluster with kernal tunning
sh-4.4# chroot /host
sh-4.4# ovs-appctl dpctl/show 
system@ovs-system:
  lookups: hit:11614148 missed:199183 lost:0
  flows: 540
  masks: hit:43790003 total:63 hit/pkt:3.71
  cache: hit:10035165 hit-rate:84.95%
  caches:
    masks-cache: size:256
  port 0: ovs-system (internal)
  port 1: bond0
  port 2: br-ex (internal)
  port 3: br-int (internal)
  port 4: genev_sys_6081 (geneve: packet_type=ptap)
  port 5: ovn-k8s-mp0 (internal)
  port 6: dd02eff4b1e74b8
  port 7: 903f8301d63476b
  port 8: 389a796c3e5b235
  port 9: 1e3140f1676b7f9
  port 10: 629e2bc98da5ca3
  port 11: 0cd33fefb7b42fa
  port 12: c37e94e0db04962
  port 13: 671fe511889ca98
  port 14: 98321308ab1b19f
  port 15: 02f42b06a0c6297
  port 16: 24735518d1bf969

#### cluster-bot baremetal cluster with kernal tunning
sh-4.4# chroot /host
sh-4.4# ovs-appctl dpctl/show 
system@ovs-system:
  lookups: hit:12457282 missed:162303 lost:0
  flows: 513
  masks: hit:34936722 total:42 hit/pkt:2.77
  cache: hit:11185592 hit-rate:88.64%
  caches:
    masks-cache: size:256
  port 0: ovs-system (internal)
  port 1: enp2s0
  port 2: br-ex (internal)
  port 3: br-int (internal)
  port 4: genev_sys_6081 (geneve: packet_type=ptap)
  port 5: ovn-k8s-mp0 (internal)
  port 6: 087daf35aba40c9
  port 7: cc3ba85ac25c0ab
  port 8: c66f6e768f1c664
  port 9: be4ebadea1ff66f
  port 10: ad0b5d7affbda32
  port 11: 397a290983455fe
  port 12: e9a7b69f81f18c6
  port 13: bb1f5a2bf3d1d0a
  port 14: 8e2654d28a35212
  port 15: 7792cd2dfde1bdd
  port 16: 17719f3b5b7dd71
  port 17: dc3097a7e562f6d
sh-4.4#

Comment 8 Weibin Liang 2022-06-29 17:19:54 UTC
Tested on cluster-bot dualstack cluster, see the iperf3 throughput improvement in 4.11 comparing to 4.9.12

#### cluster-bot dualstack cluster 4.11
Test three times:
client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.005 ms  343838/56603689 (0.61%)  receiver
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.004 ms  308788/56603746 (0.55%)  receiver
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.005 ms  310015/56603745 (0.55%)  receiver


#### cluster-bot dualstack cluster 4.9.12
Test three times:
client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  13.7 GBytes   197 Mbits/sec  0.007 ms  948533/56591795 (1.7%)  receiver
[  5]   0.00-600.00 sec  13.8 GBytes   197 Mbits/sec  0.007 ms  835952/56603726 (1.5%)  receiver
[  5]   0.00-600.00 sec  13.8 GBytes   197 Mbits/sec  0.013 ms  822906/56603660 (1.5%)  receiver

Comment 19 errata-xmlrpc 2022-08-10 11:18:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 21 Red Hat Bugzilla 2023-09-18 04:39:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days