Bug 167572 - Two established TCP/IP connections aborted on the peer's side lead to DUP/ACK storms (deadlock?)
Two established TCP/IP connections aborted on the peer's side lead to DUP/ACK...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
All Linux
medium Severity high
: ---
: ---
Assigned To: Steve Dickson
Cluster QE
:
Depends On: 167571
Blocks: RHEL4NFSFailover
  Show dependency treegraph
 
Reported: 2005-09-05 14:16 EDT by Axel Thimm
Modified: 2010-06-07 00:50 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-07 00:50:43 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Axel Thimm 2005-09-05 14:16:16 EDT
Description of problem:
This is related to bug #167571, in the sense that bug #167571 triggers this one.
When the relocation hits a cluster node that has an old established TCP/IP
connection to the client, while the client has already aborted it and had an
active TCP/IP connection to the previous service owner, both sides storm each
other with their old requests.

This is a tethereal sniplet of such a situation (first traffic after the
relocation). homes-nfs is the service ip, zs06 the client. Note how both only
send the same package w/o responding to the peer's. Although the client only
starts to transmit when the server tries to FIN/ACK the old established connection.

 14  20.225738 homes-nfs -> zs06 TCP 2049 > 799 [FIN, ACK] Seq=0 Ack=0 Win=1448
Len=0 TSV=124899289 TSER=14101667
 15  20.225753 zs06 -> homes-nfs TCP [TCP ACKed lost segment] [TCP Previous
segment lost] 799 > 2049 [ACK] Seq=1019243553 Ack=1051926772 Win=8512 Len=0
TSV=14154378 TSER=124904147 SLE=0 SRE=1
 16  20.225900 homes-nfs -> zs06 TCP 2049 > 799 [ACK] Seq=1 Ack=0 Win=1448 Len=0
TSV=124899289 TSER=14101667
 17  20.225910 zs06 -> homes-nfs TCP [TCP Dup ACK 15#1] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154378 TSER=124904147
 18  20.226063 homes-nfs -> zs06 TCP [TCP Dup ACK 16#1] 2049 > 799 [ACK] Seq=1
Ack=0 Win=1448 Len=0 TSV=124899289 TSER=14101667
 19  20.226072 zs06 -> homes-nfs TCP [TCP Dup ACK 15#2] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154378 TSER=124904147
 20  20.226225 homes-nfs -> zs06 TCP [TCP Dup ACK 16#2] 2049 > 799 [ACK] Seq=1
Ack=0 Win=1448 Len=0 TSV=124899289 TSER=14101667
 21  20.226234 zs06 -> homes-nfs TCP [TCP Dup ACK 15#3] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154378 TSER=124904147
 22  20.226387 homes-nfs -> zs06 TCP [TCP Dup ACK 16#3] 2049 > 799 [ACK] Seq=1
Ack=0 Win=1448 Len=0 TSV=124899289 TSER=14101667
 23  20.226396 zs06 -> homes-nfs TCP [TCP Dup ACK 15#4] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154378 TSER=124904147
 24  20.226550 homes-nfs -> zs06 TCP [TCP Dup ACK 16#4] 2049 > 799 [ACK] Seq=1
Ack=0 Win=1448 Len=0 TSV=124899290 TSER=14101667
[...]
32715  21.660202 zs06 -> homes-nfs TCP [TCP Dup ACK 15#16350] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154521 TSER=124904147
32716  21.660189 homes-nfs -> zs06 TCP [TCP Dup ACK 16#16348] 2049 > 799 [ACK]
Seq=1 Ack=0 Win=1448 Len=0 TSV=124900723 TSER=14101667
32717  21.660215 zs06 -> homes-nfs TCP [TCP Dup ACK 15#16351] 799 > 2049 [ACK]
Seq=1019243553 Ack=1051926772 Win=8512 Len=0 TSV=14154521 TSER=124904147
32718  21.660348 homes-nfs -> zs06 TCP [TCP Dup ACK 16#16349] 2049 > 799 [ACK]
Seq=1 Ack=0 Win=1448 Len=0 TSV=124900723 TSER=14101667


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.see bug #167571
2.
3.
  
Actual results:
Large timeouts because of bilateral DUP/ACK storms. Some clients don't recover.

Expected results:
Both TCP/IP connections have been aborted on one side, the TCP/IP stack should
notice and RST both.

Additional info:
It looks like a deadlock. Both sides have a one sided TCP/IP connection that
should be RST by the other side, but instead the TCP/IP stack only retransmits
its ACK, waiting for an RST from the other side which never comes.

Most probably not directly a cluster suite bug, and probably not even an NFS
bug, but perhaps a TCP/IP stack bug in the kernel. But NFS relocation with the
aid of bug #167571 triggers this bug and breaks the relocation.

Note You need to log in before you can comment on or make changes to this bug.