Bug 2215362 - [Azure][RHEL-9][CVM][Network] Very low TCP throughput between 2 CVMs
Summary: [Azure][RHEL-9][CVM][Network] Very low TCP throughput between 2 CVMs
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: kernel
Version: 9.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Vitaly Kuznetsov
QA Contact: Li Tian
URL:
Whiteboard:
: 2227799 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-15 17:03 UTC by Li Tian
Modified: 2023-08-11 12:23 UTC (History)
8 users (show)

Fixed In Version: kernel-5.14.0-351.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src/kernel centos-stream-9 merge_requests 2880 0 None opened Draft: x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case 2023-08-03 08:34:20 UTC
Red Hat Issue Tracker RHELPLAN-160006 0 None None None 2023-06-15 17:03:46 UTC

Description Li Tian 2023-06-15 17:03:12 UTC
Description of problem:
TCP throughput is very low between 2 CVMs, e.g. Standard_DC96as_v5. Tested with latest 9.3.

# iperf3 -c 10.0.0.4 -b 0 -f g -i 10 -l 4096 -t 30 -p 750 -P 1 -4
Connecting to host 10.0.0.4, port 750
[  5] local 10.0.0.5 port 49690 connected to 10.0.0.4 port 750
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec   253 MBytes  0.21 Gbits/sec    0    261 KBytes       
[  5]  10.00-20.00  sec   260 MBytes  0.22 Gbits/sec    0    261 KBytes       
ntttcp with 32 connections:
Throughput in Gbps: Tx: .43 , Rx: 0.43

Version-Release number of selected component (if applicable):
5.14.0-316.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. 2 CVMs, run below command on VM#1:
iperf3 -s -1 -i10 -f g -p 750
run below command on VM#2:
iperf3 -c 10.0.0.4 -b 0 -f g -i10 -l 4096 -t 300 -p 750 -P 1 -4
Or test with ntttcp on multiple connections.

Actual results:
Less than 1Gpbs throughput.

Expected results:
Reach or be close to advertised throughput on DC96as_v5.

Additional info:
1. The issue has been bisected to kernel 5.14.0-195.el9.x86_64. In other words, 5.14.0-194.el9.x86_64 was still good:
# iperf3 -c 10.0.0.4 -b 0 -f g -i10 -l 4096 -t 30 -p 750 -P 1 -4
Connecting to host 10.0.0.4, port 750
[  5] local 10.0.0.5 port 47966 connected to 10.0.0.4 port 750
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  4.02 GBytes  3.45 Gbits/sec   40   2.01 MBytes       
[  5]  10.00-20.00  sec  3.97 GBytes  3.41 Gbits/sec    0   2.64 MBytes
ntttcp with 32 connections:
Thu Jun 15 04:35:33 2023 : Throughput in Gbps: Tx: 17.68 , Rx: 17.68

2. No such issue on RHEL 8.8 (4.18.0-477.el8.x86_64)
# iperf3 -c 10.0.0.4 -b 0 -f g -i30 -l 4096 -t 300 -p 750 -P 1 -4
Connecting to host 10.0.0.4, port 750
[  5] local 10.0.0.5 port 58400 connected to 10.0.0.4 port 750
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-30.00  sec  12.1 GBytes  3.47 Gbits/sec   76   2.63 MBytes      
[  5]  30.00-60.00  sec  12.1 GBytes  3.45 Gbits/sec   49   2.23 MBytes

Comment 1 Vitaly Kuznetsov 2023-06-20 12:51:25 UTC
(In reply to Li Tian from comment #0)

> 1. The issue has been bisected to kernel 5.14.0-195.el9.x86_64. In other
> words, 5.14.0-194.el9.x86_64 was still good:

Did you figure out if the issue is on the receiver side or on the sender's? I.e.
did you try downgrading the kernel to -194 on one side only? Alternatively, you
can try to use only one CVM and a 'normal' VM on the other side.

I'm a bit surprised the issue seems to be between 5.14.0-194.el9 and 5.14.0-195.el9
as I don't see much besides https://bugzilla.redhat.com/show_bug.cgi?id=2136491
but maybe that's the one? In case it is, the question is why only CVMs are affected...

Comment 2 Li Tian 2023-06-25 09:12:52 UTC
So issue is on the client side. When I have -194 on client and -195 on server, this issue is gone.

And another observation is that this issue is only reproducible at a big buffer length. In other words, with 'iperf3 -l 32' this the performance is always ~0.2Gbps regardless of kernel version. With 'iperf -l 4096' this issue is brought to light from -194 (~3Gbps) to -195 (~0.2Gbps). 

5.14.0-325.el9.x86_64 (client) on Standard_D64s_v4 does not have this issue:
# iperf3 -c 10.0.0.4 -b 0 -f g -i10 -l 4096 -t 30 -p 750 -P 1 -4
Connecting to host 10.0.0.4, port 750
[  5] local 10.0.0.5 port 39402 connected to 10.0.0.4 port 750
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  3.55 GBytes  3.05 Gbits/sec   87   1.57 MBytes       
[  5]  10.00-20.00  sec  3.71 GBytes  3.18 Gbits/sec    2   2.14 MBytes
(same server - -195 on CVM)

Comment 5 Li Tian 2023-08-03 09:00:03 UTC
*** Bug 2227799 has been marked as a duplicate of this bug. ***

Comment 11 Li Tian 2023-08-04 12:11:02 UTC
Tested on 5.14.0-349.2880_954641462.el9.x86_64:

TCP throughput is good:
[  5]   0.00-10.00  sec  3.83 GBytes  3.29 Gbits/sec   15   2.26 MBytes       
[  5]  10.00-20.00  sec  3.79 GBytes  3.25 Gbits/sec    4   2.63 MBytes

Disk IOPS is good:
   bw (  KiB/s): min=26712, max=55272, per=100.00%, avg=49412.88, stdev=5239.23, samples=59
   iops        : min= 6678, max=13818, avg=12353.25, stdev=1309.81, samples=59

Comment 15 Li Tian 2023-08-10 06:23:51 UTC
Tested good on 5.14.0-351.el9.x86_64:

TCP throughput is good:
[  5]   0.00-10.00  sec  3.84 GBytes  3.30 Gbits/sec   31   2.25 MBytes       
[  5]  10.00-20.00  sec  3.87 GBytes  3.33 Gbits/sec    9   1.96 MBytes  

Disk IOPS is good:
   bw (  KiB/s): min=26360, max=55272, per=100.00%, avg=50867.22, stdev=5781.01, samples=59
   iops        : min= 6590, max=13818, avg=12716.76, stdev=1445.23, samples=59


Note You need to log in before you can comment on or make changes to this bug.