Description of problem: There is a high number of TCPSlowStartRetrans and TCPFastRetrans numbers during heavy loads on squid server. The box is running RHEL AS 4.5 with 2.6.9-55.ELsmp kernel. Artificial load is generated towards the server. Steps to Reproduce: This is from a test that is http requests for 1MB files. The test environment is 1 2950 acting as a cache server, 1 2950 as the origin server which serves the file requested if the cache server does not have it, 1 Foundry 850 ServerIron and 1 Avalanche http load generator. The results listed are from a high load test (5000 requests/second) to show the retransmits, but we see them even under normal http traffic, just not as many. Actual results: The output of netstat -s -t |grep -i retrans shows the following during a test Initial values: 0 segments retransmited TCPLostRetransmit: 0 TCPFastRetrans: 0 TCPForwardRetrans: 0 TCPSlowStartRetrans: 0 ** TEST RUN ** Final values: 21111 segments retransmited TCPLostRetransmit: 0 TCPFastRetrans: 6212 TCPForwardRetrans: 0 TCPSlowStartRetrans: 56 364 failed txns There are no errors on ethernet interface, TSO is enabled, sysctl.conf: net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 kernel.sysrq = 0 kernel.core_uses_pid = 1 fs.file-max = 65535 net.core.rmem_max = 262143 net.core.rmem_default = 262143 net.core.wmem_max = 262144 net.core.wmem_default = 262144 net.ipv4.ip_local_port_range = 1024 65000 net.ipv4.tcp_keepalive_time = 2400 net.ipv4.tcp_fin_timeout = 7 net.ipv4.tcp_window_scaling = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_orphan_retries = 3 net.ipv4.tcp_tw_recycle = 1 kernel.panic=5 bnx driver is used: filename: /lib/modules/2.6.9-55.ELsmp/kernel/drivers/net/bnx2.ko parm: disable_msi:Disable Message Signaled Interrupt (MSI) version: 1.4.43-rh 49524544DE69B4B5647379F license: GPL description: Broadcom NetXtreme II BCM5706/5708 Driver author: Michael Chan <mchan> alias: pci:v000014E4d000016ACsv*sd*bc*sc*i* alias: pci:v000014E4d000016AAsv*sd*bc*sc*i* alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i* alias: pci:v000014E4d0000164Csv*sd*bc*sc*i* alias: pci:v000014E4d0000164Asv*sd*bc*sc*i* alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i* alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i* depends: tcp_sack's are disabled.
Another interesting data point if we don't have it it already would be the output from: ethtool -S ethX for all interfaces involved. Luckily the driver used here added support for a new stat counter that was previously unavailable called "rx_fw_discards." This would be useful to help determine if frames are being dropped by the firmware. The firmware will typically drop frames if there is not adequate space to store incoming frames. This can be alleviated by modifying the ring buffer size or by changing the interrupt coalesce settings. Please gather output from: ethtool -g ethX and ethtool -c eth0 as well so we can take a look at those values and possibly make recommendations if the 'rx_fw_discards' stat is incrementing during the run.
[root@pe2950 ~]# ethtool -S eth0 NIC statistics: rx_bytes: 656535 rx_error_bytes: 0 tx_bytes: 20290266 tx_error_bytes: 0 rx_ucast_packets: 6796 rx_mcast_packets: 569 rx_bcast_packets: 649 tx_ucast_packets: 13724 tx_mcast_packets: 5 tx_bcast_packets: 7 tx_mac_errors: 0 tx_carrier_errors: 0 rx_crc_errors: 0 rx_align_errors: 0 tx_single_collisions: 0 tx_multi_collisions: 0 tx_deferred: 0 tx_excess_collisions: 0 tx_late_collisions: 0 tx_total_collisions: 0 rx_fragments: 0 rx_jabbers: 0 rx_undersize_packets: 0 rx_oversize_packets: 0 rx_64_byte_packets: 433 rx_65_to_127_byte_packets: 6904 rx_128_to_255_byte_packets: 618 rx_256_to_511_byte_packets: 47 rx_512_to_1023_byte_packets: 5 rx_1024_to_1522_byte_packets: 7 rx_1523_to_9022_byte_packets: 0 tx_64_byte_packets: 13 tx_65_to_127_byte_packets: 288 tx_128_to_255_byte_packets: 74 tx_256_to_511_byte_packets: 25 tx_512_to_1023_byte_packets: 3 tx_1024_to_1522_byte_packets: 13333 tx_1523_to_9022_byte_packets: 0 rx_xon_frames: 0 rx_xoff_frames: 0 tx_xon_frames: 0 tx_xoff_frames: 0 rx_mac_ctrl_frames: 0 rx_filtered_packets: 1298 rx_discards: 0 rx_fw_discards: 0 [root@pe2950 ~]# ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 999936 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 18 rx-frames: 6 rx-usecs-irq: 18 rx-frames-irq: 6 tx-usecs: 80 tx-frames: 20 tx-usecs-irq: 80 tx-frames-irq: 20 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 [root@pe2950 ~]#
Based on what I've seen and tested this does not appear to initially be related to the network device in use. I would not rule it out, but any analysis of packet captures (here or from the customer) would probably be helpful. Since we can reproduce it in-house it might also be wise to try with some newer kernels.
From tests: We were able to eliminate the reported server delay by: 1. Disable the TCP Segmentation Offload on the Broadcom NIC using bnx2 2. RH5.3 We were NOT able to eliminate the reported server delay by: 1. sack=1 2. net.ipv4.tcp_tw_recycle=0 (not even sure how this was related but we tried it anyway per RH advised) 3. tcp_tso_win_divisor=0 (also 1 and 3) per Dell advised 4. RH4.5/4.6 with most recent Dell supported bnx2 (version 1.7.6b) 5. RH4.6 with Intel Pro card (e1000 version: 7.3.20-k2-NAPI) with TSO on Last night we did 2 more tests: 1. Problem was eliminated by disable TSO on Intel card (e1000) 2. Problem still persisted with RH4 kernel 2.6.9-78.0.13.Elsmp which we believe it to be the latest kernel update for RH4
to try.(In reply to comment #4) > From tests: > We were able to eliminate the reported server delay by: > 1. Disable the TCP Segmentation Offload on the Broadcom NIC using bnx2 Interesting. > 2. RH5.3 Do you mean that the problem does not exist on RHEL5.3? What about 5.2? > > We were NOT able to eliminate the reported server delay by: > 1. sack=1 > 2. net.ipv4.tcp_tw_recycle=0 (not even sure how this was related but we tried > it anyway per RH advised) > 3. tcp_tso_win_divisor=0 (also 1 and 3) per Dell advised > 4. RH4.5/4.6 with most recent Dell supported bnx2 (version 1.7.6b) > 5. RH4.6 with Intel Pro card (e1000 version: 7.3.20-k2-NAPI) with TSO on > This would indicate this is something with the stack and TSO if both an Intel and Broadcom NIC have problems. > Last night we did 2 more tests: > 1. Problem was eliminated by disable TSO on Intel card (e1000) > 2. Problem still persisted with RH4 kernel 2.6.9-78.0.13.Elsmp which we believe > it to be the latest kernel update for RH4 Having a workaround to disable TSO is nice, but it would be good if we could figure out if the bug still exists in RHEL4.8 Beta and fix it in the stack there. Did you try the RHEL4.8 Beta kernel? You can quickly get it from: http://people.redhat.com/vgoyal/rhel4/
Could we get "tc qdisc -s" output before and after the test run for both the "RHEL4 failure" case and the "RHEL5 works" case? I suspect that the system is generating too much traffic for the network device to keep up with, we hit the queue limits, drop, and this creates a chain reaction of events that lead to the bad delays. If we see less drops occuring in the "RHEL5 works" case, it would confirm the theory. There are some things that can be experimented with to alleviate the symptoms: 1) "ifconfig eth0 txqueuelen XXX" where XXX is some larger value than the default which is 1000 2) play around with 'ethtool -G eth0 tx XXX' This will increase the internal TX queue size in use by the driver. By running 'ethtool -g eth0' you can see the current settings and also the maximum values allowed by the specific device All of the above is to be done on the server.
Created attachment 332330 [details] strace from a sample httpd process I'm not sure how relevant this is, but I straced one of the apache processes during a transfer to see how it sets up the socket and the sends. It sets the socket to non-blocking, and uses poll and sendfile to try and keep the send queue filled at all times. I wonder if we are not also having a process scheduling latency at the server here. Perhaps with the massive dup acks and retransmits, the send queue takes longer to clear up, poll takes longer to return, and the process loses a bit of priority? Or maybe it has to seek data from the disk for the next sendfile? I imagine not, since the test always fetches the same file so it will be in memory after the first time. Maybe we should reduce the number of apache processes (wouldn't hurt, since we're not doing parallel transfers) to be easier to strace, and collect the strace with timestamps to correlate with the capture later. Then, this has nothing to do with TSO. Unless disabling TSO causes the system to remove less data at a time from the send queue, thus allowing for smaller non-blocking sendfile chunks and altering the userlevel timing. Or maybe it's just too late and I should be sleeping. Just felt like sharing this wild thought. :) Cheers, Fabio Olive
I know what the problem is, and it has to do with TSO. In RHEL4 with TSO, if we get packet loss we just turn off TSO and forget all of the loss information we have accumulated. Eventually what will happen is that we'll resend the whole send queue after the 2 * RTT retransmit timeout. In fact, if you disable SACK, it's going to be even worse. The kernel will retransmit entire TSO segments and only notice and disable TSO at the moment that the retransmit timeout fires. (For reference see the RHEL4 code in net/ipv4/tcp_input.c:tcp_sacktag_write_queue() that checks NETIF_F_TSO, and also the code in net/ipv4/tcp_output.c:tcp_retransmit_skb() that makes the same exact check) RHEL5 on the other hand has special code that carefully resegments the TSO frames in the send queue at the boundaries of the losses we detect. This allows the code to remember all of loss information and therefore emit exactly the minimal necessary retransmits to fill the lost segment holes properly. It also therefore does not need to disable TSO on detection of packet loss. Backporting these RHEL5 changes would be a huge deal. It would be a complete rewrite of how the TCP sendmsg/sendfile path builds TSO frames and how the whole retransmit engine maintains packet count state in the TCP socket structure. The amount of potential regressions in backporting these changes is close to infinite. When initially written upstream, this code took more than 6 months to fully debug. To be honest, the safest thing to do is to simply run with TSO disabled on systems that have to stick with RHEL4 and cannot afford to run into these extra retransmits. TSO is getting disabled on a socket when loss happens on it's connection anyways.
Created attachment 332753 [details] summary & analysis of problem As promised, this is the summary writeup and root cause analysis that AT&T has asked for. This document should answer all remaining questions on this ticket that the customer had. Please let me know if anything else is required.
There is nothing that can really be fixed here, but the attachment in comment #59 sums up the issue and what others can do if they experience these types of issues (hint: disable TSO on RHEL4).