Bug 2099812

Summary: [release-4.10] UDP Packet loss in OpenShift using IPv6 [upcall]
Product: OpenShift Container Platform Reporter: Dan Williams <dcbw>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Weibin Liang <weliang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: anusaxen, bbennett, dcbw, fleitner, rravaiol, travier, weliang
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2099811 Environment:
Last Closed: 2022-07-20 07:46:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2099811    
Bug Blocks:    

Description Dan Williams 2022-06-21 18:31:07 UTC
+++ This bug was initially created as a clone of Bug #2099811 +++

This is one issue found while working on
https://bugzilla.redhat.com/show_bug.cgi?id=2087604#16



Note: 

The iperf3 test run between PODs on the same node.

Some analysis:

The iperf3 test started at:

Time: Fri, 13 May 2022 06:22:17 GMT

and ran for 600 seconds.

I can't see anything out of the ordinary in the OVS logs at this time to give a clue. 

For example correlating:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet

I don't see a large increase of flow_mods, (wouldn't we expect that if it was due to the classifier issue that AaronC fixed?):
 
ovs-vswitchd.log-20220414:

2022-04-13T06:22:10.318Z|98838|connmgr|INFO|br-int<->unix#231439: 1 flow_mods 10 s ago (1 adds)
2022-04-13T06:22:16.889Z|98839|connmgr|INFO|br-ex<->unix#243278: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:31.922Z|98840|connmgr|INFO|br-ex<->unix#243281: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:46.955Z|98841|connmgr|INFO|br-ex<->unix#243284: 2 flow_mods in the last 0 s (2 adds)

ovsdb-server.log-20220414:

2022-04-13T06:22:01.821Z|23014|jsonrpc|WARN|unix#263088: receive error: Connection reset by peer
2022-04-13T06:22:01.821Z|23015|reconnect|WARN|unix#263088: connection dropped (Connection reset by peer)
2022-04-13T06:23:01.961Z|23016|jsonrpc|WARN|unix#263098: receive error: Connection reset by peer
2022-04-13T06:23:01.961Z|23017|reconnect|WARN|unix#263098: connection dropped (Connection reset by peer)
2022-04-13T06:24:47.208Z|23018|jsonrpc|WARN|unix#263115: receive error: Connection reset by peer

But the dropwatch2.out file is full of upcall drops:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet
Fri May 13 06:22:30 2022
3230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:24:15 2022
1847 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:25:20 2022
1821 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:25 2022
1620 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:30 2022
2142 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:35 2022
259 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:25 2022
783 packets dropped at ovs_dp_process_packet
Fri May 13 06:26:30 2022
230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:55 2022
5052 packets dropped at ovs_dp_process_packet
Fri May 13 06:27:00 2022
82 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:27:45 2022
1077 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:00 2022
86 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:20 2022
1760 packets dropped at ovs_dp_process_packet
Fri May 13 06:28:25 2022
611 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:45 2022
1306 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:00 2022
996 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:35 2022
3511 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:45 2022
417 packets dropped at ovs_dp_process_packet

Maybe running the test with OVS debug enabled may give better clues.

===========================

The upcall buffer was identified as too small in upstream and got
bumped in the following commit:
https://github.com/openvswitch/ovs/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90


Downstream fix is available in OVS 2.16:
https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90

--- Additional comment from Dan Williams on 2022-06-21 13:29:48 CDT ---

This fix has been part of OCP 4.11 since mid-February 2022 via https://github.com/openshift/os/pull/715

Comment 6 Weibin Liang 2022-07-14 19:37:45 UTC
Tested and verified in 4.10.0-0.nightly-2022-07-13-131411

#### Test log from packet dualstack cluster using 4.10.0-0.nightly-2022-07-13-131411
$ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::14 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  36334/56603715 (0.064%)  receiver
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  38169/56603715 (0.067%)  receiver
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  41949/56603718 (0.074%)  receiver

Comment 8 errata-xmlrpc 2022-07-20 07:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5568