Bug 1187347 - RPC ping does not retransmit
Summary: RPC ping does not retransmit
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Milind Changire
QA Contact:
URL:
Whiteboard: rpc-3.4.0?
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-29 19:57 UTC by Scott Merrill
Modified: 2019-05-07 14:40 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-05-07 14:40:37 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
tcpdumps of gluster clients, servers, and firewall (3.27 MB, application/zip)
2015-01-29 19:57 UTC, Scott Merrill
no flags Details

Description Scott Merrill 2015-01-29 19:57:08 UTC
Created attachment 985748 [details]
tcpdumps of gluster clients, servers, and firewall

Description of problem:
The Gluster FUSE client seems to only send one RPC ping packet, and marks the remote server down if that single packet is not received.


Version-Release number of selected component (if applicable): 3.6.1


How reproducible:
We have been unable to reliably reproduce this.



Steps to Reproduce:

We have a replica 2 Gluster configuration, with two physical Gluster servers hosting bricks.  All clients are VMware virtual machines, and all clients use FUSE to make glusterfs mounts.

The servers are in a different subnet than the clients.  There is a SonicWall firewall between the subnets.

Randomly through the day we'll have Gluster clients claim a ping timeout from a brick server. In every case, the client reports a ping time out to first one server and them almost immediately to the other server.  The client will re-establish a connection to both servers within a few seconds (often within the same second as the disconnect is reported).

Clients do not fail together. That is, client1 will report a disconnect while clients2 and client3 are happily using the Gluster volumes.


Actual results:

After much tcpdump and Wireshark, it appears to us as though the clients send an RPC ping packet to the server.  This packet is getting lost somewhere, such that the servers never ACK them.  The client TCP stack appears to re-transmit the packet, and we see that the servers do ACK these retransmitted packets.

The retransmitted packet ACK seems not be accepted by the client, causing the client to drop the connection to the server.


Expected results:


We would expect the client to be a little more resilient.  A single packet retransmission should not tear down the entire Gluster universe.  No other application in our network produces anything remotely similar.

Additional info:

Attached are several tcpdumps, from the Gluster clients, servers, and our firewall.

gluster error at Tue Jan 13 07:06:43 EST 2015 (UTC Tue Jan 13 12:06:43
UTC 2015)

GLUSTER CLIENT: 192.168.135.61, GLUSTER SERVER: 192.168.30.115
SRC PORT: 1014, DEST PORT: 49162

1. See t11.pcap7.
 - packet number 85187
 - This is the initiation of a Gluster Dump RPC call on the gluster
client side

2. See t11.pcap7, packet number 85196.
 - This is a retransmission of the Gluster Dump RPC call in the previous
packet.

3. Now, see dump firewall.cap
 - Missing: the initiation of the Gluster Dump RPC call (from packet
85187 above)
 - However, the retransmission is in packet number 39789

4. Finally, see dump gluster-t2.pcap3
 - Again Missing: the initiation of the Gluster Dump RPC call
 - And this time the retransmission is also missing on the server side.
 - We¹re asssuming this is because the firewall dropped it, not knowing
it belonged to an active
   TCP conversation.


Further down below in the t11.pcap6 capture you can see the client gives
up and send TCP Resets for the failed RPC initiations.  There¹s several
RPC calls missing from the client to the firewall in these captures.  The
details below are to show one specific example.  But notice that we have
failed initiations to both gluster servers in these captures.

Comment 1 Kaushal 2016-08-30 12:40:23 UTC
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.

Comment 2 Amar Tumballi 2019-05-07 14:40:37 UTC
Hi Scott, thanks for your detailed report. We regret keeping it open for such a long time. Currently we recommend you to upgrade to glusterfs-6.x and see if the behavior is fine for you. With the current scope of things, we can't pick this bug to work on (as there are options for having backup-volfile-server etc).

Will keep this bug under DEFERRED status, we will revisit this after couple of releases. We also are looking at implementing a different n/w layer based solution (ref: https://github.com/gluster/glusterfs/issues/391 & https://github.com/gluster/glusterfs/issues/505). Feel free to follow those issues to keep track of the progress.


Note You need to log in before you can comment on or make changes to this bug.