Bug 485292 - High numbers of retransmitted segments, TCPFastRetrans and TCPSlowStartRetrans with HTTP traffic.
Summary: High numbers of retransmitted segments, TCPFastRetrans and TCPSlowStartRetra...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-12 18:36 UTC by Veaceslav Falico
Modified: 2018-10-20 01:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-02-24 16:51:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
strace from a sample httpd process (8.78 KB, application/octet-stream)
2009-02-18 02:47 UTC, Fabio Olive Leite
no flags Details
summary & analysis of problem (10.09 KB, text/plain)
2009-02-20 20:44 UTC, Neil Horman
no flags Details

Description Veaceslav Falico 2009-02-12 18:36:27 UTC
Description of problem:
There is a high number of TCPSlowStartRetrans and TCPFastRetrans numbers during heavy loads on squid server. The box is running RHEL AS 4.5 with  2.6.9-55.ELsmp kernel. Artificial load is generated towards the server.

Steps to Reproduce:
This is from a test that is http requests for 1MB files.  The test environment
is 1 2950 acting as a cache server, 1 2950 as the origin server which serves
the file requested if the cache server does not have it, 1 Foundry 850
ServerIron and 1 Avalanche http load generator.  The results listed are from a
high load test (5000 requests/second) to show the retransmits, but we see them
even under normal http traffic, just not as many.

Actual results:
The output of netstat -s
-t |grep -i retrans shows the following during a test
Initial values:
 0 segments retransmited
 TCPLostRetransmit: 0
 TCPFastRetrans: 0
 TCPForwardRetrans: 0
 TCPSlowStartRetrans: 0

** TEST RUN **   

Final values:
 21111 segments retransmited
 TCPLostRetransmit: 0
 TCPFastRetrans: 6212
 TCPForwardRetrans: 0
 TCPSlowStartRetrans: 56
 364 failed txns

There are no errors on ethernet interface, TSO is enabled, sysctl.conf:
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
fs.file-max = 65535
net.core.rmem_max = 262143
net.core.rmem_default = 262143
net.core.wmem_max = 262144
net.core.wmem_default = 262144
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_keepalive_time = 2400 
net.ipv4.tcp_fin_timeout = 7 
net.ipv4.tcp_window_scaling = 0 
net.ipv4.tcp_sack = 0 
net.ipv4.tcp_orphan_retries = 3 
net.ipv4.tcp_tw_recycle = 1 
kernel.panic=5

bnx driver is used:
filename:       /lib/modules/2.6.9-55.ELsmp/kernel/drivers/net/bnx2.ko
parm:           disable_msi:Disable Message Signaled Interrupt (MSI)
version:        1.4.43-rh 49524544DE69B4B5647379F
license:        GPL
description:    Broadcom NetXtreme II BCM5706/5708 Driver
author:         Michael Chan <mchan>
alias:          pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias:          pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias:          pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends: 

tcp_sack's are disabled.

Comment 1 Andy Gospodarek 2009-02-12 20:20:11 UTC
Another interesting data point if we don't have it it already would be the output from:

ethtool -S ethX

for all interfaces involved.  Luckily the driver used here added support for a new stat counter that was previously unavailable called "rx_fw_discards."  This would be useful to help determine if frames are being dropped by the firmware.

The firmware will typically drop frames if there is not adequate space to store incoming frames.  This can be alleviated by modifying the ring buffer size or by changing the interrupt coalesce settings.

Please gather output from:

ethtool -g ethX 

and

ethtool -c eth0

as well so we can take a look at those values and possibly make recommendations if the 'rx_fw_discards' stat is incrementing during the run.

Comment 2 Veaceslav Falico 2009-02-12 20:29:38 UTC
[root@pe2950 ~]# ethtool -S eth0
NIC statistics:
     rx_bytes: 656535
     rx_error_bytes: 0
     tx_bytes: 20290266
     tx_error_bytes: 0
     rx_ucast_packets: 6796
     rx_mcast_packets: 569
     rx_bcast_packets: 649
     tx_ucast_packets: 13724
     tx_mcast_packets: 5
     tx_bcast_packets: 7
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 433
     rx_65_to_127_byte_packets: 6904
     rx_128_to_255_byte_packets: 618
     rx_256_to_511_byte_packets: 47
     rx_512_to_1023_byte_packets: 5
     rx_1024_to_1522_byte_packets: 7
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 13
     tx_65_to_127_byte_packets: 288
     tx_128_to_255_byte_packets: 74
     tx_256_to_511_byte_packets: 25
     tx_512_to_1023_byte_packets: 3
     tx_1024_to_1522_byte_packets: 13333
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 1298
     rx_discards: 0
     rx_fw_discards: 0
[root@pe2950 ~]# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 999936
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 18
rx-frames: 6
rx-usecs-irq: 18
rx-frames-irq: 6

tx-usecs: 80
tx-frames: 20
tx-usecs-irq: 80
tx-frames-irq: 20

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

[root@pe2950 ~]#

Comment 3 Andy Gospodarek 2009-02-12 21:50:32 UTC
Based on what I've seen and tested this does not appear to initially be related to the network device in use.  I would not rule it out, but any analysis of packet captures (here or from the customer) would probably be helpful.

Since we can reproduce it in-house it might also be wise to try with some newer kernels.

Comment 4 Veaceslav Falico 2009-02-13 14:05:46 UTC
From tests:
We were able to eliminate the reported server delay by:
1. Disable the TCP Segmentation Offload on the Broadcom NIC using bnx2
2. RH5.3

We were NOT able to eliminate the reported server delay by:
1. sack=1
2. net.ipv4.tcp_tw_recycle=0 (not even sure how this was related but we tried it anyway per RH advised)
3. tcp_tso_win_divisor=0 (also 1 and 3) per Dell advised
4. RH4.5/4.6 with most recent Dell supported bnx2 (version 1.7.6b)
5. RH4.6 with Intel Pro card (e1000 version: 7.3.20-k2-NAPI) with TSO on

Last night we did 2 more tests:
1. Problem was eliminated by disable TSO on Intel card (e1000)
2. Problem still persisted with RH4 kernel 2.6.9-78.0.13.Elsmp which we believe it to be the latest kernel update for RH4

Comment 5 Andy Gospodarek 2009-02-13 14:38:36 UTC
to try.(In reply to comment #4)
> From tests:
> We were able to eliminate the reported server delay by:
> 1. Disable the TCP Segmentation Offload on the Broadcom NIC using bnx2

Interesting.

> 2. RH5.3

Do you mean that the problem does not exist on RHEL5.3?  What about 5.2?

> 
> We were NOT able to eliminate the reported server delay by:
> 1. sack=1
> 2. net.ipv4.tcp_tw_recycle=0 (not even sure how this was related but we tried
> it anyway per RH advised)
> 3. tcp_tso_win_divisor=0 (also 1 and 3) per Dell advised
> 4. RH4.5/4.6 with most recent Dell supported bnx2 (version 1.7.6b)
> 5. RH4.6 with Intel Pro card (e1000 version: 7.3.20-k2-NAPI) with TSO on
> 

This would indicate this is something with the stack and TSO if both an Intel and Broadcom NIC have problems.

> Last night we did 2 more tests:
> 1. Problem was eliminated by disable TSO on Intel card (e1000)
> 2. Problem still persisted with RH4 kernel 2.6.9-78.0.13.Elsmp which we believe
> it to be the latest kernel update for RH4

Having a workaround to disable TSO is nice, but it would be good if we could figure out if the bug still exists in RHEL4.8 Beta and fix it in the stack there.

Did you try the RHEL4.8 Beta kernel?

You can quickly get it from:

http://people.redhat.com/vgoyal/rhel4/

Comment 23 David Miller 2009-02-17 07:14:11 UTC
Could we get "tc qdisc -s" output before and after the test run
for both the "RHEL4 failure" case and the "RHEL5 works" case?

I suspect that the system is generating too much traffic for
the network device to keep up with, we hit the queue limits,
drop, and this creates a chain reaction of events that lead
to the bad delays.

If we see less drops occuring in the "RHEL5 works" case, it would
confirm the theory.

There are some things that can be experimented with to alleviate
the symptoms:

1) "ifconfig eth0 txqueuelen XXX" where XXX is some larger value
   than the default which is 1000

2) play around with 'ethtool -G eth0 tx XXX'  This will increase the
   internal TX queue size in use by the driver.  By running
   'ethtool -g eth0' you can see the current settings and also the
   maximum values allowed by the specific device

All of the above is to be done on the server.

Comment 42 Fabio Olive Leite 2009-02-18 02:47:30 UTC
Created attachment 332330 [details]
strace from a sample httpd process

I'm not sure how relevant this is, but I straced one of the apache processes during a transfer to see how it sets up the socket and the sends.

It sets the socket to non-blocking, and uses poll and sendfile to try and keep the send queue filled at all times. I wonder if we are not also having a process scheduling latency at the server here.

Perhaps with the massive dup acks and retransmits, the send queue takes longer to clear up, poll takes longer to return, and the process loses a bit of priority? Or maybe it has to seek data from the disk for the next sendfile? I imagine not, since the test always fetches the same file so it will be in memory after the first time.

Maybe we should reduce the number of apache processes (wouldn't hurt, since we're not doing parallel transfers) to be easier to strace, and collect the strace with timestamps to correlate with the capture later.

Then, this has nothing to do with TSO. Unless disabling TSO causes the system to remove less data at a time from the send queue, thus allowing for smaller non-blocking sendfile chunks and altering the userlevel timing.

Or maybe it's just too late and I should be sleeping. Just felt like sharing this wild thought. :)

Cheers,
Fabio Olive

Comment 43 David Miller 2009-02-18 06:51:32 UTC
I know what the problem is, and it has to do with TSO.

In RHEL4 with TSO, if we get packet loss we just turn off TSO
and forget all of the loss information we have accumulated.
Eventually what will happen is that we'll resend the whole
send queue after the 2 * RTT retransmit timeout.

In fact, if you disable SACK, it's going to be even worse.  The
kernel will retransmit entire TSO segments and only notice and
disable TSO at the moment that the retransmit timeout fires.
(For reference see the RHEL4 code in net/ipv4/tcp_input.c:tcp_sacktag_write_queue() that checks NETIF_F_TSO,
and also the code in net/ipv4/tcp_output.c:tcp_retransmit_skb()
that makes the same exact check)

RHEL5 on the other hand has special code that carefully resegments
the TSO frames in the send queue at the boundaries of the losses
we detect.  This allows the code to remember all of loss information
and therefore emit exactly the minimal necessary retransmits to fill
the lost segment holes properly.  It also therefore does not need to
disable TSO on detection of packet loss.

Backporting these RHEL5 changes would be a huge deal.  It would be
a complete rewrite of how the TCP sendmsg/sendfile path builds TSO
frames and how the whole retransmit engine maintains packet count
state in the TCP socket structure.

The amount of potential regressions in backporting these changes is
close to infinite.  When initially written upstream, this code took
more than 6 months to fully debug.

To be honest, the safest thing to do is to simply run with TSO disabled
on systems that have to stick with RHEL4 and cannot afford to run into
these extra retransmits.  TSO is getting disabled on a socket when
loss happens on it's connection anyways.

Comment 59 Neil Horman 2009-02-20 20:44:20 UTC
Created attachment 332753 [details]
summary & analysis of problem

As promised, this is the summary writeup and root cause analysis that AT&T has asked for.  This document should answer all remaining questions on this ticket that the customer had.  Please let me know if anything else is required.

Comment 62 Andy Gospodarek 2009-02-24 16:51:52 UTC
There is nothing that can really be fixed here, but the attachment in comment #59 sums up the issue and what others can do if they experience these types of issues (hint: disable TSO on RHEL4).


Note You need to log in before you can comment on or make changes to this bug.