Bug 2099812 - [release-4.10] UDP Packet loss in OpenShift using IPv6 [upcall]
Summary: [release-4.10] UDP Packet loss in OpenShift using IPv6 [upcall]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.10.z
Assignee: Dan Williams
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On: 2099811
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-21 18:31 UTC by Dan Williams
Modified: 2022-07-20 07:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2099811
Environment:
Last Closed: 2022-07-20 07:46:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift os pull 860 0 None Merged Bug 2099812: manifest: bump to openvswitch2.16 2022-07-05 14:16:23 UTC
Red Hat Product Errata RHBA-2022:5568 0 None None None 2022-07-20 07:46:27 UTC

Description Dan Williams 2022-06-21 18:31:07 UTC
+++ This bug was initially created as a clone of Bug #2099811 +++

This is one issue found while working on
https://bugzilla.redhat.com/show_bug.cgi?id=2087604#16



Note: 

The iperf3 test run between PODs on the same node.

Some analysis:

The iperf3 test started at:

Time: Fri, 13 May 2022 06:22:17 GMT

and ran for 600 seconds.

I can't see anything out of the ordinary in the OVS logs at this time to give a clue. 

For example correlating:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet

I don't see a large increase of flow_mods, (wouldn't we expect that if it was due to the classifier issue that AaronC fixed?):
 
ovs-vswitchd.log-20220414:

2022-04-13T06:22:10.318Z|98838|connmgr|INFO|br-int<->unix#231439: 1 flow_mods 10 s ago (1 adds)
2022-04-13T06:22:16.889Z|98839|connmgr|INFO|br-ex<->unix#243278: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:31.922Z|98840|connmgr|INFO|br-ex<->unix#243281: 2 flow_mods in the last 0 s (2 adds)
2022-04-13T06:22:46.955Z|98841|connmgr|INFO|br-ex<->unix#243284: 2 flow_mods in the last 0 s (2 adds)

ovsdb-server.log-20220414:

2022-04-13T06:22:01.821Z|23014|jsonrpc|WARN|unix#263088: receive error: Connection reset by peer
2022-04-13T06:22:01.821Z|23015|reconnect|WARN|unix#263088: connection dropped (Connection reset by peer)
2022-04-13T06:23:01.961Z|23016|jsonrpc|WARN|unix#263098: receive error: Connection reset by peer
2022-04-13T06:23:01.961Z|23017|reconnect|WARN|unix#263098: connection dropped (Connection reset by peer)
2022-04-13T06:24:47.208Z|23018|jsonrpc|WARN|unix#263115: receive error: Connection reset by peer

But the dropwatch2.out file is full of upcall drops:

$ grep -e Fri -e ovs_dp_process_packet dropwatch2.out|grep ovs_dp_process_packet -B 1
Fri May 13 06:22:25 2022
2338 packets dropped at ovs_dp_process_packet
Fri May 13 06:22:30 2022
3230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:24:15 2022
1847 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:25:20 2022
1821 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:25 2022
1620 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:30 2022
2142 packets dropped at ovs_dp_process_packet
Fri May 13 06:25:35 2022
259 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:25 2022
783 packets dropped at ovs_dp_process_packet
Fri May 13 06:26:30 2022
230 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:26:55 2022
5052 packets dropped at ovs_dp_process_packet
Fri May 13 06:27:00 2022
82 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:27:45 2022
1077 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:00 2022
86 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:20 2022
1760 packets dropped at ovs_dp_process_packet
Fri May 13 06:28:25 2022
611 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:28:45 2022
1306 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:00 2022
996 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:35 2022
3511 packets dropped at ovs_dp_process_packet
--
Fri May 13 06:31:45 2022
417 packets dropped at ovs_dp_process_packet

Maybe running the test with OVS debug enabled may give better clues.

===========================

The upcall buffer was identified as too small in upstream and got
bumped in the following commit:
https://github.com/openvswitch/ovs/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90


Downstream fix is available in OVS 2.16:
https://gitlab.cee.redhat.com/nst/openvswitch/openvswitch2.16/-/commit/b4a9c9cd848b56d538f17f94cde78d5a139c7d90

--- Additional comment from Dan Williams on 2022-06-21 13:29:48 CDT ---

This fix has been part of OCP 4.11 since mid-February 2022 via https://github.com/openshift/os/pull/715

Comment 6 Weibin Liang 2022-07-14 19:37:45 UTC
Tested and verified in 4.10.0-0.nightly-2022-07-13-131411

#### Test log from packet dualstack cluster using 4.10.0-0.nightly-2022-07-13-131411
$ iperf3 -V -f m -l 265 -b 200m -c fd01:0:0:5::14 -u -i l -t 600 -p 59554
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  36334/56603715 (0.064%)  receiver
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  38169/56603715 (0.067%)  receiver
[  5]   0.00-600.00 sec  14.0 GBytes   200 Mbits/sec  0.002 ms  41949/56603718 (0.074%)  receiver

Comment 8 errata-xmlrpc 2022-07-20 07:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5568


Note You need to log in before you can comment on or make changes to this bug.