2099811 – UDP Packet loss in OpenShift using IPv6 [upcall]

Bug 2099811 - UDP Packet loss in OpenShift using IPv6 [upcall]

Summary: UDP Packet loss in OpenShift using IPv6 [upcall]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Flavio Leitner
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2099812
TreeView+	depends on / blocked

Reported:	2022-06-21 18:22 UTC by Flavio Leitner
Modified:	2023-09-18 04:39 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2099812 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:18:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift os pull 715	0	None	Merged	manifest: bump to openvswitch2.16	2022-06-21 18:30:03 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:19:18 UTC

Description Flavio Leitner 2022-06-21 18:22:48 UTC

This is one issue found while working on
https://bugzilla.redhat.com/show_bug.cgi?id=2087604#16



Note: 

The iperf3 test run between PODs on the same node.

Some analysis:

The iperf3 test started at:

Time: Fri, 13 May 2022 06:22:17 GMT

and ran for 600 seconds.

I can't see anything out of the ordinary in the OVS logs at this time to give a clue. 

For example correlating:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet

I don't see a large increase of flow_mods, (wouldn't we expect that if it was due to the classifier issue that AaronC fixed?):
 
ovs-vswitchd.log-20220414:

2022-04-13T06:22:10.318Z|98838|connmgr|INFO|br-int<->unix#231439: 1 flow_mods 10 s ago (1 adds)
2022-04-13T06:22:16.889Z|98839|connmgr|INFO|br-ex<->unix#243278: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:31.922Z|98840|connmgr|INFO|br-ex<->unix#243281: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:46.955Z|98841|connmgr|INFO|br-ex<->unix#243284: 2 flow_mods in the last 0 s (2 adds)

ovsdb-server.log-20220414:

2022-04-13T06:22:01.821Z|23014|jsonrpc|WARN|unix#263088: receive error: Connection reset by peer
2022-04-13T06:22:01.821Z|23015|reconnect|WARN|unix#263088: connection dropped (Connection reset by peer)
2022-04-13T06:23:01.961Z|23016|jsonrpc|WARN|unix#263098: receive error: Connection reset by peer
2022-04-13T06:23:01.961Z|23017|reconnect|WARN|unix#263098: connection dropped (Connection reset by peer)
2022-04-13T06:24:47.208Z|23018|jsonrpc|WARN|unix#263115: receive error: Connection reset by peer

But the dropwatch2.out file is full of upcall drops:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet
Fri May 13 06:22:30 2022
3230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:24:15 2022
1847 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:25:20 2022
1821 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:25 2022
1620 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:30 2022
2142 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:35 2022
259 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:25 2022
783 packets dropped at ovs_dp_process_packet
Fri May 13 06:26:30 2022
230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:55 2022
5052 packets dropped at ovs_dp_process_packet
Fri May 13 06:27:00 2022
82 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:27:45 2022
1077 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:00 2022
86 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:20 2022
1760 packets dropped at ovs_dp_process_packet
Fri May 13 06:28:25 2022
611 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:45 2022
1306 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:00 2022
996 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:35 2022
3511 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:45 2022
417 packets dropped at ovs_dp_process_packet

Maybe running the test with OVS debug enabled may give better clues.

===========================

The upcall buffer was identified as too small in upstream and got
bumped in the following commit:
https://github.com/openvswitch/ovs/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90


Downstream fix is available in OVS 2.16:
https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90

Comment 1 Dan Williams 2022-06-21 18:29:48 UTC

This fix has been part of OCP 4.11 since mid-February 2022 via https://github.com/openshift/os/pull/715

Comment 5 Weibin Liang 2022-06-28 14:26:41 UTC

Test 4.11.0-0.nightly-2022-06-25-132614 in both packet dualstack and cluster-bot dualstack.

Test results show 0 packet lost in packet dualstack after kernal tunning, and packets lost from 1.7% to 0.068% in cluster-bot dualstack cluster after kernal tunning.


#### Packets baremetal cluster without kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 13:46:53 UTC
Connecting to host fd01:0:0:5::f, port 6666
      Cookie: lp5azkzqcwak7ujjl55i3mxwqh4ljlrpv44x
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::10 port 59553 connected to fd01:0:0:5::f port 6666
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603706  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603706 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  15840/56603706 (0.028%)  receiver
CPU Utilization: local/sender 16.9% (1.1%u/15.8%s), remote/receiver 12.4% (1.6%u/10.8%s)

iperf Done.
/ $ 


#### Packets baremetal cluster with kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::f -u -i l -t 600 -p 6666
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 14:05:45 UTC
Connecting to host fd01:0:0:5::f, port 6666
      Cookie: tnmi6xfsq2qk2dmydnxhlg5q63idz4y44rqp
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::10 port 34628 connected to fd01:0:0:5::f port 6666
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603701  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603701 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.001 ms  0/56603701 (0%)  receiver
CPU Utilization: local/sender 20.1% (2.8%u/17.3%s), remote/receiver 13.3% (2.4%u/10.9%s)

iperf Done.
/ $ 



#### cluster-bot baremetal cluster without kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 13:50:59 UTC
Connecting to host fd01:0:0:5::10, port 59553
      Cookie: 5oipxhgympitohjzg6fabedtbqydaoef6lqb
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::11 port 41243 connected to fd01:0:0:5::10 port 59553
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603702  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603702 (0%)  sender
[  5]   0.00-600.00 sec  13.7 GBytes   197 Mbits/sec  0.005 ms  966313/56603702 (1.7%)  receiver
CPU Utilization: local/sender 38.0% (5.2%u/32.7%s), remote/receiver 22.6% (3.8%u/18.8%s)

iperf Done.
/ $ 


#### cluster-bot baremetal cluster with kernal tunning
/ $ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::10 -u -i l -t 600 -p 59553
iperf 3.10.1
Linux test-pod2 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64
Control connection MSS 1328
Time: Tue, 28 Jun 2022 14:05:23 UTC
Connecting to host fd01:0:0:5::10, port 59553
      Cookie: uvrhzzmb24iz3rok6xzapgdny7qxaarmbktk
      Target Bitrate: 200000000
[  5] local fd01:0:0:5::11 port 51660 connected to fd01:0:0:5::10 port 59553
Starting Test: protocol: UDP, 1 streams, 265 byte blocks, omitting 0 seconds, 600 second test, tos 0
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  56603753  
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.000 ms  0/56603753 (0%)  sender
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.006 ms  38621/56603753 (0.068%)  receiver
CPU Utilization: local/sender 38.2% (5.5%u/32.7%s), remote/receiver 23.0% (4.0%u/19.0%s)

iperf Done.
/ $ 


fleitner Based on above test results, do you think QE can verify this bug? Thanks!

Comment 6 Weibin Liang 2022-06-28 14:45:20 UTC

#### Packets baremetal cluster with kernal tunning
sh-4.4# chroot /host
sh-4.4# ovs-appctl dpctl/show 
system@ovs-system:
  lookups: hit:11614148 missed:199183 lost:0
  flows: 540
  masks: hit:43790003 total:63 hit/pkt:3.71
  cache: hit:10035165 hit-rate:84.95%
  caches:
    masks-cache: size:256
  port 0: ovs-system (internal)
  port 1: bond0
  port 2: br-ex (internal)
  port 3: br-int (internal)
  port 4: genev_sys_6081 (geneve: packet_type=ptap)
  port 5: ovn-k8s-mp0 (internal)
  port 6: dd02eff4b1e74b8
  port 7: 903f8301d63476b
  port 8: 389a796c3e5b235
  port 9: 1e3140f1676b7f9
  port 10: 629e2bc98da5ca3
  port 11: 0cd33fefb7b42fa
  port 12: c37e94e0db04962
  port 13: 671fe511889ca98
  port 14: 98321308ab1b19f
  port 15: 02f42b06a0c6297
  port 16: 24735518d1bf969

#### cluster-bot baremetal cluster with kernal tunning
sh-4.4# chroot /host
sh-4.4# ovs-appctl dpctl/show 
system@ovs-system:
  lookups: hit:12457282 missed:162303 lost:0
  flows: 513
  masks: hit:34936722 total:42 hit/pkt:2.77
  cache: hit:11185592 hit-rate:88.64%
  caches:
    masks-cache: size:256
  port 0: ovs-system (internal)
  port 1: enp2s0
  port 2: br-ex (internal)
  port 3: br-int (internal)
  port 4: genev_sys_6081 (geneve: packet_type=ptap)
  port 5: ovn-k8s-mp0 (internal)
  port 6: 087daf35aba40c9
  port 7: cc3ba85ac25c0ab
  port 8: c66f6e768f1c664
  port 9: be4ebadea1ff66f
  port 10: ad0b5d7affbda32
  port 11: 397a290983455fe
  port 12: e9a7b69f81f18c6
  port 13: bb1f5a2bf3d1d0a
  port 14: 8e2654d28a35212
  port 15: 7792cd2dfde1bdd
  port 16: 17719f3b5b7dd71
  port 17: dc3097a7e562f6d
sh-4.4#

Comment 8 Weibin Liang 2022-06-29 17:19:54 UTC

Tested on cluster-bot dualstack cluster, see the iperf3 throughput improvement in 4.11 comparing to 4.9.12

#### cluster-bot dualstack cluster 4.11
Test three times:
client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.005 ms  343838/56603689 (0.61%)  receiver
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.004 ms  308788/56603746 (0.55%)  receiver
[  5]   0.00-600.00 sec  13.9 GBytes   199 Mbits/sec  0.005 ms  310015/56603745 (0.55%)  receiver


#### cluster-bot dualstack cluster 4.9.12
Test three times:
client: iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:6::8 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  13.7 GBytes   197 Mbits/sec  0.007 ms  948533/56591795 (1.7%)  receiver
[  5]   0.00-600.00 sec  13.8 GBytes   197 Mbits/sec  0.007 ms  835952/56603726 (1.5%)  receiver
[  5]   0.00-600.00 sec  13.8 GBytes   197 Mbits/sec  0.013 ms  822906/56603660 (1.5%)  receiver

Comment 19 errata-xmlrpc 2022-08-10 11:18:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 21 Red Hat Bugzilla 2023-09-18 04:39:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.