Created attachment 985748 [details] tcpdumps of gluster clients, servers, and firewall Description of problem: The Gluster FUSE client seems to only send one RPC ping packet, and marks the remote server down if that single packet is not received. Version-Release number of selected component (if applicable): 3.6.1 How reproducible: We have been unable to reliably reproduce this. Steps to Reproduce: We have a replica 2 Gluster configuration, with two physical Gluster servers hosting bricks. All clients are VMware virtual machines, and all clients use FUSE to make glusterfs mounts. The servers are in a different subnet than the clients. There is a SonicWall firewall between the subnets. Randomly through the day we'll have Gluster clients claim a ping timeout from a brick server. In every case, the client reports a ping time out to first one server and them almost immediately to the other server. The client will re-establish a connection to both servers within a few seconds (often within the same second as the disconnect is reported). Clients do not fail together. That is, client1 will report a disconnect while clients2 and client3 are happily using the Gluster volumes. Actual results: After much tcpdump and Wireshark, it appears to us as though the clients send an RPC ping packet to the server. This packet is getting lost somewhere, such that the servers never ACK them. The client TCP stack appears to re-transmit the packet, and we see that the servers do ACK these retransmitted packets. The retransmitted packet ACK seems not be accepted by the client, causing the client to drop the connection to the server. Expected results: We would expect the client to be a little more resilient. A single packet retransmission should not tear down the entire Gluster universe. No other application in our network produces anything remotely similar. Additional info: Attached are several tcpdumps, from the Gluster clients, servers, and our firewall. gluster error at Tue Jan 13 07:06:43 EST 2015 (UTC Tue Jan 13 12:06:43 UTC 2015) GLUSTER CLIENT: 192.168.135.61, GLUSTER SERVER: 192.168.30.115 SRC PORT: 1014, DEST PORT: 49162 1. See t11.pcap7. - packet number 85187 - This is the initiation of a Gluster Dump RPC call on the gluster client side 2. See t11.pcap7, packet number 85196. - This is a retransmission of the Gluster Dump RPC call in the previous packet. 3. Now, see dump firewall.cap - Missing: the initiation of the Gluster Dump RPC call (from packet 85187 above) - However, the retransmission is in packet number 39789 4. Finally, see dump gluster-t2.pcap3 - Again Missing: the initiation of the Gluster Dump RPC call - And this time the retransmission is also missing on the server side. - We¹re asssuming this is because the firewall dropped it, not knowing it belonged to an active TCP conversation. Further down below in the t11.pcap6 capture you can see the client gives up and send TCP Resets for the failed RPC initiations. There¹s several RPC calls missing from the client to the firewall in these captures. The details below are to show one specific example. But notice that we have failed initiations to both gluster servers in these captures.
GlusterFS-3.6 is nearing its End-Of-Life, only important security bugs still make a chance on getting fixed. Moving this to the mainline 'version'. If this needs to get fixed in 3.7 or 3.8 this bug should get cloned.
Hi Scott, thanks for your detailed report. We regret keeping it open for such a long time. Currently we recommend you to upgrade to glusterfs-6.x and see if the behavior is fine for you. With the current scope of things, we can't pick this bug to work on (as there are options for having backup-volfile-server etc). Will keep this bug under DEFERRED status, we will revisit this after couple of releases. We also are looking at implementing a different n/w layer based solution (ref: https://github.com/gluster/glusterfs/issues/391 & https://github.com/gluster/glusterfs/issues/505). Feel free to follow those issues to keep track of the progress.