+++ This bug was initially created as a clone of Bug #1097309 +++ Description of problem: While using glusterfs-fuse or NFS as client for accessing RHS volume. If we do hard poweroff (poweroff -f) to simulate the power outage from one of the fuse or NFS client which has lock on a file with write operation on the same file, other clients which are waiting on same lock gets lock after 15 minutes, which is *VERY* long. Because of hard poweroff client is not able to send any signal that it is down to RHS server. After checking netstat output in RHS server: tcp 0 240 rhs2.server:49152 client1.rhs:exp1 ESTABLISHED 2592/glusterfsd glusterfsd has 240 bytes pending in it's Send-Q as client-1 is down. and this send Q is getting flushed after 15 minutes , we should have some option to set the timeout value for client connection if client is not available. Just like we have for RHS server : Option: network.frame-timeout Default Value: 1800 Description: Time frame after which the (file) operation would be declared as dead, if the *SERVER* does not respond for a particular (file) operation. Option: network.ping-timeout Default Value: 42 Description: Time duration for which the client waits to check if the *SERVER* is responsive. Steps to Reproduce: 1. Mount RHS volume with glusterfs fuse or NFS on two clients 2.take a lock on a file with write on it from client-1 with some test program 3. run same program from client-2 which will wait fro client-1 to release the lock 4. run `poweroff -f` command from client-1 Actual results: RHS server is keeping the lock for 15 minutes with client-1 but client-1 is no longer available because of power outage Expected results: RHS server should not keep lock for 15 minutes it should reset the connection from client which has power outage after some graceful amount of time. After doing more tests (thanks Vikhyat!), it seems that reproducing the problem does not have a 100% hit rate. There are two scenarios: - If a WRITE has finished (the brick returned a WRITE Reply), the lock is available for the 2nd client pretty quickly (< 1 minute). - When a WRITE Reply is sent to the client, and the client does not accept this, there are tcp-retransmissions sent (by the kernel, not directly by glusterfsd). There is a small window in the test-case where a 'poweroff -f' can be executed at the time when there are no outstanding WRITE replies. In this case, it seems that the lock is released correctly. We have simulated this by adding a sleep before the write() in the loop, and doing the 'poweroff -f' while sleeping. I think we'll have to look at how failures are caught when the brick sends a reply to a client that is offline. --- Additional comment from Niels de Vos on 2014-05-15 19:24:17 CEST --- This did not improve things, just documenting the suggestions from some of the network guys: > 1. Trigger a keepalive after 10 seconds of socket idle; > 2. Re-probe after every 10 seconds; > 3. Discard the connection after 3 failed probes. > > Which yields 30 seconds of failure in order to drop the socket. > > In order to enable it: > > 1. Edit /etc/sysctl.conf > 2. Append at the end of file the following snippet: > > ## Aggressive keepalive tracking > net.ipv4.tcp_keepalive_time = 10 > net.ipv4.tcp_keepalive_probes = 3 > net.tcp_keepalive_intvl = 10 > > 3. And then; load the values: > > # sysctl -p In a similar way, it is possible to change the retransmission tries and timeouts (default values given): net.ipv4.tcp_retries1 = 3 net.ipv4.tcp_retries2 = 15 Maybe this can be combined with a check for the tcp-socket option TCP_USER_TIMEOUT that was added to the Linux kernel with http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7 rpc/rpc-transport/socket/src/socket.c seems to be the right place to set this option, the KEEPALIVE is configured in __socket_keepalive() function. --- Additional comment from Niels de Vos on 2014-05-16 14:01:06 CEST --- Tested with changing the net.ipv4.tcp_retries2 to some lower values (on the storage server). Tuning this option may be a (temporary?) solution for reducing the fail-over time. On each test, I've captured a tcpdump to verify that there are tcp-retransmissions. After applying the sysctl change, the glusterfsd processes have been restarted. Results: These metrics are gathered from the captured tcpdumps. The starting time for the failover is set at the last WRITE Reply that fails and results in the retransmissions. The end of the fail-over time is set at the LK Reply that makes it possible for the 2nd client to continue. net.ipv4.tcp_retries2 | number of retrans | failover time -----------------------+-------------------+--------------- 2 | 3 | 3.0 3 | 4 | 6.3 4 | 5 | 12.6 5 | 6 | 25.5 6 | 7 | 51.3 7 | 7 | 51.5* 8 | 7 | 102.7 * = looks rather strange, needs to be checked again --- Additional comment from on 2014-05-16 17:06:30 CEST --- I have tested above given test case with net.ipv4.tcp_retries2 = 7 below are the test results: net.ipv4.tcp_retries2 | number of retrans | failover time ----------------------+-------------------+--------------- 7 | 7 | 77.0
REVIEW: http://review.gluster.org/7814 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on release-3.5 by Niels de Vos (ndevos)
REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on master by Justin Clift (justin)
This change needs some substantial work: - http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040755.html
REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#2) for review on master by Niels de Vos (ndevos)
http://review.gluster.org/8065 for the master branch has been merged. http://review.gluster.org/7814 can now get updated with a backport. There is no 3.6 bug for this yet, it would need to get included there so that we do not introduce 'regressions' between 3.5 -> 3.6.
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.