Bug 1203742

Summary:

TCP bug creates additional RTO in very specific condition

Product:

Red Hat Enterprise Linux 6

Reporter:

Matias Bonaventura <matiasb>

Component:

kernel

Assignee:

Thadeu Lima de Souza Cascardo <cascardo>

kernel sub component:

tcp

QA Contact:

xmu

Status:

CLOSED INSUFFICIENT_DATA

Docs Contact:

Severity:

high

Priority:

medium

CC:

cascardo, hsowa, jaroslaw.polok, kzhang, matiasb, michele, mleitner, network-qe, rkhan, thomas.oulevey

Version:

6.6

Flags:

matiasb: needinfo-

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-06 14:42:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1172231

Attachments:

Description	Flags
tcpdump from the server showing the bug	none
ftrace for tcp	none

Description Matias Bonaventura 2015-03-19 15:02:29 UTC

Created attachment 1003867 [details]
tcpdump from the server showing the bug

To be reported at: https://bugzilla.kernel.org/

Hi all,

We found an unexpected behavior in the applications which we are using that appears to be a bug in the TCP algorithm. Following tcpdumps we detected that a second unnecessary retransmission timeout (RTO) occurs after a first valid RTO. This occurs only in a very particular situation: when there are 2 packets pending to be sent at the moment the first retransmission occurs (and also in the context of our application which operates in a very low latency network, and which its only traffic is receiving a single request message and sending a ~12KB response). This situation occurs very often in our context and because the application operates at very low latency a RTO impacts severely the performance.
More details of the application communication and tcpdumps with the bug explanation is copied below.

Is this a proper bug or is there something we are missing expected TCP behavoir? Also, is there is this a known way to avoid this unexpected behaviour?

Thank you very much,
Regards,
Matias

Environment
OS: SLC 6
 $ uname -a
Linux <pcName> 2.6.32-504.3.3.el6.x86_64 #1 SMP Wed Dec 17 09:22:39 CET 2014 x86_64 x86_64 x86_64 GNU/Linux

TCP configuration: full configuration from  /proc/sys/net/ipv4/* pasted at the bottom.
tcp_base_mss=512                           
tcp_congestion_control=cubic   
tcp_sack=1


Application communication description:  We are using 2 applications which communicate using TCP: a client and around 200 server application.  The client sends a request message of 188B (including headers) to all servers and waits for a response of all of them. The client does not send any other message until the response of all servers is received. Each server upon receiving the request, it sends a 12KB response (which is obviously splitted into several TCP packets).  Because there are 200 servers responding at almost the same moment (with a total of ~2.4MB) some buffers in the network may overflow generating drops and retransmissions.
When there are no drops (thanks to control application that limits the requests sent) the latency to receive all messages from all servers is ~20ms. If there is a drop of one or more TCP fragments then the latency goes to near ~200ms (this is because of the minimum RTO of 200ms hardcoded in the kernel). Even if this is 10 times higher it is more or less under acceptable for the application. The bug creates a second consecutive retransmission so the latency when this occurs goes to 600ms (200ms of the first RTO + 400ms of the second unexpected RTO), which is out of the limits that the application can handle (60 times higher).

Bug detailed description:
The unexpected behavior appears in the server applications when TCP needs to retransmit drops packets. It appears in all server applications at a quite a high frequency.
The bug appears only when the server detected a drop (by a RTO after 200ms) and at that moment it is still pending to receive the ACK for 2 packets. In that case, after 200ms of sending all packets, the RTO triggers the retransmission of the first packet, then the ACK for that packet is received but the second packet is not retransmitted at that moment. After another 400ms another RTO is triggered and that second packet is retransmitted and ACKed. To our understanding this second retransmission should not occur. The expected behaviour is that the second packet is retransmitted right after receiving the ACK for the first retransmitted packet.
Also this unexpected second RTO occurs only if there are 2 pending packets at the moment of the first RTO.  If  there is one packet to retransmit for more than 2, the behaviour is as expected, all packets are retransmitted and ACKed after the first RTO (there is no second RTO). 

Below the explanation and a section of a tcpdump recorded in one of the server applications showing the unexpected behaviour. 

Frame #170:  request is received (at T 0)
Frames #171-#173: response is sent splitted into several TCP packets. From seq=204273 to seq=216289.
Frame #171 and #172 are recorded by tcpdump as a single packet but is probably several real packets as the MSS is 1460 bytes and it shows a lenght higher than that (this is probably caused because NICs support segmentation offloads, which means, that the NIC joins the segments together and pushes it to the host’s TCP stack a single segment. This is why tcpdump sees it as a segment of higher length).
Frames #173-#177: ACKs for some of the sent packets is received. Last seq acknowledged is seq=442797 (there is still 1796 bytes to be sent, which is 2 TCP packets).
Frame #178: At T 207ms a packet is retransmitted. This is the first retransmission, which makes total sense as the ACKs for 2 packets were not received after 200ms. Because of the RTO the TCP internal state should be updated to duplicate the RTO (so it should be 400ms now). Also the CWND should be reduced to 1.
Frame #179: ACK for the retransmitted packet is received. 
The internal state of TCP should be update to duplicate the CWND because of slow start (so should be set now to 2). RTO is not updated because calculation of RTO is based only in packets which were not retransmitted.
At this point we would expect that the pending packet should be retransmitted, but this does not occur. After receiving an ACK the CWDN should allow more packets to be sent, but there is no data sent by the server (and consequently it receives nothing).
Frame #180: at T 613ms (aprox ~400ms after the last received ACK) the last packet is retransmitted.
This is what creates a 600ms latency which is 60 times the expected and 6 times higher if the bug would not be present.
Frame #181: ACK for the last packet is received.
Frame #182: a new request is received..


No.  Time  	Source	Destination   Protocol RTO    	Length Info
170 *REF*   	DCM      	ROS     	TCP        	118	47997 > 41418 [PSH, ACK] Seq=1089 Ack=204273 Win=10757 Len=64
171 0.000073	ROS     	DCM      	TCP        	5894   41418 > 47997 [ACK] Seq=204273 Ack=1153 Win=58 Len=5840
172 0.000080	ROS     	DCM      	TCP        	5894   41418 > 47997 [ACK] Seq=210113 Ack=1153 Win=58 Len=5840
173 0.000083	ROS     	DCM      	TCP        	390	41418 > 47997 [PSH, ACK] Seq=215953 Ack=1153 Win=58 Len=336[Packet size limited during capture]
174 0.003901	DCM      	ROS     	TCP       	60 	47997 > 41418 [ACK] Seq=1153 Ack=207193 Win=10757 Len=0
175 0.004270	DCM      	ROS     	TCP        	60 	47997 > 41418 [ACK] Seq=1153 Ack=211573 Win=10768 Len=0
176 0.004649	DCM      	ROS     	TCP        	60 	47997 > 41418 [ACK] Seq=1153 Ack=213033 Win=10768 Len=0
177 0.004835	DCM      	ROS     	TCP        	66 	[TCP Dup ACK 176#1] 47997 > 41418 [ACK] Seq=1153 Ack=213033 Win=10768 Len=0 SLE=214493 SRE=215953
178 0.207472	ROS     	DCM      	TCP	0.207389000  1514   [TCP Retransmission] 41418 > 47997 [ACK] Seq=213033 Ack=1153 Win=58 Len=1460
179 0.207609	DCM      	ROS     	TCP        	60 	47997 > 41418 [ACK] Seq=1153 Ack=215953 Win=10768 Len=0
180 0.613472	ROS     	DCM      	TCP	0.613389000  390	[TCP Retransmission] 41418 > 47997 [PSH, ACK] Seq=215953 Ack=1153 Win=58 Len=336[Packet size limited during capture]
181 0.613622	DCM      	ROS     	TCP        	60 	47997 > 41418 [ACK] Seq=1153 Ack=216289 Win=10768 Len=0
182 0.615189	DCM      	ROS     	TCP        	118	47997 > 41418 [PSH, ACK] Seq=1153 Ack=216289 Win=10768 Len=64


 



Full TCP configuration: for f in /proc/sys/net/ipv4/* ;do confName=$(basename "$f") ; echo -n "$confName="  >> /logs/tpu_TCP_config.txt ; cat "$f" >> /logs/tpu_TCP_config.txt ;done

cipso_cache_bucket_size=10
cipso_cache_enable=1      
cipso_rbm_optfmt=0        
cipso_rbm_strictvalid=1   
icmp_echo_ignore_all=0
icmp_echo_ignore_broadcasts=1
icmp_errors_use_inbound_ifaddr=0
icmp_ignore_bogus_error_responses=1
icmp_ratelimit=1000                
icmp_ratemask=6168                 
igmp_max_memberships=20            
igmp_max_msf=10                    
inet_peer_gc_maxtime=120           
inet_peer_gc_mintime=10            
inet_peer_maxttl=600               
inet_peer_minttl=120               
inet_peer_threshold=65664          
ip_default_ttl=64                  
ip_dynaddr=0                       
ip_forward=0                       
ipfrag_high_thresh=262144          
ipfrag_low_thresh=196608           
ipfrag_max_dist=64                 
ipfrag_secret_interval=600         
ipfrag_time=30                     
ip_local_port_range=32768       61000
ip_local_reserved_ports=             
ip_nonlocal_bind=0                   
ip_no_pmtu_disc=0                    
ping_group_range=1        0    
rt_cache_rebuild_count=4       
tcp_abc=0                            
tcp_abort_on_overflow=0              
tcp_adv_win_scale=2                  
tcp_allowed_congestion_control=cubic reno
tcp_app_win=31                           
tcp_available_congestion_control=cubic reno
tcp_base_mss=512                           
tcp_challenge_ack_limit=100                
tcp_congestion_control=cubic               
tcp_dma_copybreak=262144                   
tcp_dsack=1                                
tcp_ecn=2                                  
tcp_fack=1                                 
tcp_fin_timeout=60                         
tcp_frto=2                                 
tcp_frto_response=0                        
tcp_keepalive_intvl=75                     
tcp_keepalive_probes=9                     
tcp_keepalive_time=7200                    
tcp_limit_output_bytes=131072              
tcp_low_latency=0                          
tcp_max_orphans=262144                     
tcp_max_ssthresh=0                         
tcp_max_syn_backlog=2048                   
tcp_max_tw_buckets=262144                  
tcp_mem=2316864 3089152 4633728            
tcp_min_tso_segs=2                         
tcp_moderate_rcvbuf=1                      
tcp_mtu_probing=0                          
tcp_no_metrics_save=0                      
tcp_orphan_retries=0                       
tcp_reordering=3                           
tcp_retrans_collapse=1                     
tcp_retries1=3                             
tcp_retries2=15                            
tcp_rfc1337=0                              
tcp_rmem=4096   87380   4194304
tcp_sack=1
tcp_slow_start_after_idle=0
tcp_stdurg=0
tcp_synack_retries=5
tcp_syncookies=1
tcp_syn_retries=5
tcp_thin_dupack=0
tcp_thin_linear_timeouts=0
tcp_timestamps=0
tcp_tso_win_divisor=3
tcp_tw_recycle=0
tcp_tw_reuse=0
tcp_window_scaling=1
tcp_wmem=4096   65536   4194304
tcp_workaround_signed_windows=0
udp_mem=2316864 3089152 4633728
udp_rmem_min=4096
udp_wmem_min=4096
xfrm4_gc_thresh=4194304

Comment 2 Matias Bonaventura 2015-03-29 09:31:00 UTC

Would it possible to mark this bug as public (viewable by public)? 
I would need technicians to be able to see the description and the progress of the bug report. Also I need to add references to it in some documents.

Thanks a lot!

Comment 3 Matias Bonaventura 2015-04-26 16:59:24 UTC

Hi, is there anyone trying to follow this bug? any updated?

Also, would it possible to mark this bug as public (viewable by public)? 
Apparently I have no permissions for this.

Thanks

Comment 4 Marcelo Ricardo Leitner 2015-04-27 12:27:49 UTC

Hi Matias, sorry the delay.

Yes, it's possible to open it. I just did it. But I'm afraid, as you've noticed, that we weren't able to evaluate your report so far.

Comment 6 Thadeu Lima de Souza Cascardo 2015-05-12 14:49:25 UTC

It looks like you are using RHEL 6.2. RHEL 6.3 received some TCP fixes, can you try it and see if it fixes the problem for you?

Cascardo.

Comment 9 Thadeu Lima de Souza Cascardo 2015-05-13 12:08:30 UTC

Hi, Matias.

Can you confirm you are in fact using RHEL6.6 and not RHEL6.2?

By the way, are there any socket options that your application uses either client or server side?

Have you considered changing the rto_min of the route in order to workaround the minimal of 200ms? You can do that by issuing ip route replace address/mask dev ethX rto_min 50ms, for example. If it's a local network, you may need to add the option table local.

Cascardo.

Comment 10 Matias Bonaventura 2015-05-22 13:12:47 UTC

Hi all,

The tcpdump attached was using RHEL 6.5 (kernel 2.6.32-431.3.3.el6.x86_64) so if the changes to TCP where introduced in 6.3 they should have been in. We are now using RHEL 6.6 (kernel 2.6.32-504.3.3.el6.x86_64). I could repeat tests with the new kernel, but it will take me some time to find availability in the network to perform the tests.

We didn't thought of changing the rto_min of the routes using "ip route", but it sounds like a good option. But before doing that change we would need to test it first (setting a rto to low can be dangerous) so hopefully next week we might give it a try.

The server application does use one socket option: it uses the NODELAY socket option to disable Nagle's algorithm.

Hopefully this helps. Let me know if you need more info.

Regards,
Matias

Comment 11 Thadeu Lima de Souza Cascardo 2015-05-28 12:17:34 UTC

I tried to reproduce this issue, forcing packets to be dropped.

Specifically, this happens when a hole is lost, and the last packet is lost as well. Notice the SACK in the pcap file. I tried to reproduce this exact scenario, but still, the server sent the last packet right after receiving the ack for the first retransmission.

I went through the code trying to come up with some specific scenario that could apply to this case, but couldn't find anything obvious. One possibility was previous packet losses, but I forced that as well, and still that didn't work out.

Other possibilities are the sender dropping the packets itself, but that didn't seem likely giving the other data. I am not sure that could cause the second RTO, but it could be worth testing.

Some data that could be helpful is MIB before the issue and after the issue. Maybe it could show some points that we hit that I am not considering, or that I disregarded as not the case.

The best possible data I can think of would be ftrace for tcp functions. That, however, may be challenging, depending on how long it takes to reproduce the issue, the presence of any other TCP flows in the system, etc.

In case you think you can do it, here are the instructions:

# mount -t debugfs none /sys/kernel/debug/
# cd /sys/kernel/debug/tracing/
# echo '*tcp*' > set_ftrace_filter
# echo function > current_tracer
# echo 1 > tracing_on
# cat trace_pipe > /var/tmp/tcp_tracer.log

If possible, do it with only that single TCP flow. That means using a serial console is preferable than using SSH.

Thanks.
Cascardo

Comment 12 Thadeu Lima de Souza Cascardo 2015-05-28 12:35:09 UTC

Created attachment 1031267 [details]
ftrace for tcp

Here is the trace I did for the case I created.

The relevant parts here are:

          server-2714  [000] 71742.919845: tcp_sacktag_write_queue <-tcp_ack

Where we receive the SACK, but do not retransmit immediately.

          <idle>-0     [000] 71743.120434: tcp_enter_frto <-tcp_retransmit_timer
          <idle>-0     [000] 71743.120434: tcp_retransmit_skb <-tcp_retransmit_timer

Where the RTO happens, about 200ms later, notice that we enter FRTO.

          server-2714  [000] 71743.122755: tcp_enter_frto_loss <-tcp_ack
          server-2714  [000] 71743.122755: tcp_fastretrans_alert <-tcp_ack
          server-2714  [000] 71743.122756: tcp_xmit_retransmit_queue <-tcp_fastretrans_alert

Where we get the following ACK, and retransmit the last piece, no second RTO.

Cascardo.

Comment 13 Matias Bonaventura 2015-06-03 07:10:53 UTC

Hi Cascardo, thank you very much for looking at the issue.

I will try to prepare the ftraces and send them to you, but it will take while to prepare them. 

Regarding: "Specifically, this happens when a hole is lost, and the last packet is lost as well."
My only comment to narrow your tests is that in my tcpdumps it always happened when exactly 3 packets (including the last one) where lost. If 4 or more packets were lost the problem was not there. If 2 or 1 packet were lost the problem was not there.


Cheers!

Comment 14 Thadeu Lima de Souza Cascardo 2015-08-07 00:24:50 UTC

Hi, Matias.

Were you able to prepare the ftraces?

Thanks.
Cascardo.

Comment 15 Matias Bonaventura 2015-08-07 12:00:20 UTC

Hi all, 

The lab is currently in production so it is really hard to prepare the setup for the ftraces. On the other hand, I am not working any more at CERN, so I personally don't have access to the lab or the testing environment.
I have the information of the tests we did on the past, but I will not be able to continue performing new tests.

Sorry for that. 
Regards, 
Matias

Comment 17 Thadeu Lima de Souza Cascardo 2015-09-02 22:55:30 UTC

So, I managed to reproduce this with the following packetdrill:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0 bind(3, ..., ...) = 0
0 listen(3, 1) = 0

0 < S 0:0(0) win 32792 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.020 < . 1:1(0) ack 1 win 257

+0 accept(3, ..., ...) = 4

+0.5 < . 1:65(64) ack 1 win 257
+0 > . 1:1(0) ack 65

+0 write(4, ..., 10016) = 10016

+0 > . 1:2921(2920) ack 65
+0 > . 2921:5841(2920) ack 65
+0 > . 5841:8761(2920) ack 65
+0 > P. 8761:10017(1256) ack 65

+0.020 < . 65:65(0) ack 2921 win 257
+0 < . 65:65(0) ack 2921 win 257 <sack 4381:5841,nop,nop>

+0.22 > . 2921:4381(1460) ack 65
+0 < . 65:65(0) ack 8761 win 257
+0.44 > P. 8761:10017(1256) ack 65


However, I reproduced that on CentOS 7.1. When trying to reproduce on RHEL7.1, the last delta is around 230ms, not 440ms. And when running on RHEL6.6, it's 0ms.

I am going to create ftraces so I can look more into this.

Cascardo.

Comment 19 Thadeu Lima de Souza Cascardo 2015-11-06 14:42:43 UTC

Since we can't reproduce the issue and we don't have a contact anymore with the bug reporter in order to help us test any fixes or collect any other debug data, we will close this since the data is insufficient.

If there is any new data, please reopen the bug.

Thanks.
Cascardo.