Bug 1099460 - file locks are not released within an acceptable time when a fuse-client uncleanly disconnects
Summary: file locks are not released within an acceptable time when a fuse-client uncl...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: transport
Version: 3.5.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Niels de Vos
QA Contact:
URL: http://supercolony.gluster.org/piperm...
Whiteboard:
Depends On: 1129787
Blocks: 1115915 1207556
TreeView+ depends on / blocked
 
Reported: 2014-05-20 11:25 UTC by Niels de Vos
Modified: 2016-06-17 15:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1097309
Environment:
Last Closed: 2016-06-17 15:57:47 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Niels de Vos 2014-05-20 11:25:51 UTC
+++ This bug was initially created as a clone of Bug #1097309 +++

Description of problem:

While using glusterfs-fuse or NFS as client for accessing RHS volume.
If we do hard poweroff (poweroff -f) to simulate the power outage from one of the fuse or NFS client which has lock on a file with write operation on the same file, other clients which are waiting on same lock gets lock after 15 minutes, which is *VERY* long.

Because of hard poweroff client is not able to send any signal that it is down to RHS server.

After checking netstat output in RHS server:

tcp        0    240 rhs2.server:49152       client1.rhs:exp1     ESTABLISHED 2592/glusterfsd   

glusterfsd has 240 bytes pending in it's Send-Q as client-1 is down.

and this send Q is getting flushed after 15 minutes , we should have some option to set the timeout value for client connection if client is not available.

Just like we have for RHS server :

Option: network.frame-timeout
Default Value: 1800
Description: Time frame after which the (file) operation would be declared as dead, if the *SERVER* does not respond for a particular (file) operation.

Option: network.ping-timeout
Default Value: 42
Description: Time duration for which the client waits to check if the *SERVER* is responsive.


Steps to Reproduce:
1. Mount RHS volume with glusterfs fuse or NFS on two  clients 
2.take a lock on a file with write on it from client-1 with some test program
3. run same program from client-2 which will wait fro client-1 to release the lock
4. run `poweroff -f` command from client-1 

Actual results:
RHS server is keeping the lock for 15 minutes with client-1 but client-1 is no longer available because of power outage 

Expected results:

RHS server should not keep lock for 15 minutes it should reset the connection from client which has power outage after some graceful amount of time.


After doing more tests (thanks Vikhyat!), it seems that reproducing the problem does not have a 100% hit rate. There are two scenarios:

- If a WRITE has finished (the brick returned a WRITE Reply), the lock is
  available for the 2nd client pretty quickly (< 1 minute).

- When a WRITE Reply is sent to the client, and the client does not accept this,
  there are tcp-retransmissions sent (by the kernel, not directly by glusterfsd).

There is a small window in the test-case where a 'poweroff -f' can be executed at the time when there are no outstanding WRITE replies. In this case, it seems that the lock is released correctly. We have simulated this by adding a sleep before the write() in the loop, and doing the 'poweroff -f' while sleeping.

I think we'll have to look at how failures are caught when the brick sends a reply to a client that is offline.


--- Additional comment from Niels de Vos on 2014-05-15 19:24:17 CEST ---

This did not improve things, just documenting the suggestions from some of the network guys:

> 1. Trigger a keepalive after 10 seconds of socket idle;
> 2. Re-probe after every 10 seconds;
> 3. Discard the connection after 3 failed probes.
>  
> Which yields 30 seconds of failure in order to drop the socket.
>  
> In order to enable it:
>  
> 1. Edit /etc/sysctl.conf
> 2. Append at the end of file the following snippet:
>  
> ## Aggressive keepalive tracking
> net.ipv4.tcp_keepalive_time = 10
> net.ipv4.tcp_keepalive_probes = 3
> net.tcp_keepalive_intvl = 10
>  
> 3. And then; load the values:
>  
> # sysctl -p


In a similar way, it is possible to change the retransmission tries and
timeouts (default values given):

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

Maybe this can be combined with a check for the tcp-socket option
TCP_USER_TIMEOUT that was added to the Linux kernel with
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7
rpc/rpc-transport/socket/src/socket.c seems to be the right place to set this option, the KEEPALIVE is configured in __socket_keepalive() function.

--- Additional comment from Niels de Vos on 2014-05-16 14:01:06 CEST ---

Tested with changing the net.ipv4.tcp_retries2 to some lower values (on the
storage server). Tuning this option may be a (temporary?) solution for
reducing the fail-over time.

On each test, I've captured a tcpdump to verify that there are tcp-retransmissions. After applying the sysctl change, the glusterfsd processes have been restarted.

Results:
These metrics are gathered from the captured tcpdumps. The starting time for
the failover is set at the last WRITE Reply that fails and results in the
retransmissions. The end of the fail-over time is set at the LK Reply that
makes it possible for the 2nd client to continue.

 net.ipv4.tcp_retries2 | number of retrans | failover time
-----------------------+-------------------+---------------
            2          |          3        |      3.0
            3          |          4        |      6.3
            4          |          5        |     12.6
            5          |          6        |     25.5
            6          |          7        |     51.3
            7          |          7        |     51.5*
            8          |          7        |    102.7

* = looks rather strange, needs to be checked again

--- Additional comment from  on 2014-05-16 17:06:30 CEST ---

I have tested above given test case with net.ipv4.tcp_retries2 = 7 

below are the test results:

net.ipv4.tcp_retries2 | number of retrans | failover time
----------------------+-------------------+---------------
            7         |           7       |      77.0

Comment 2 Anand Avati 2014-05-20 16:25:08 UTC
REVIEW: http://review.gluster.org/7814 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on release-3.5 by Niels de Vos (ndevos)

Comment 3 Anand Avati 2014-06-13 21:19:25 UTC
REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on master by Justin Clift (justin)

Comment 4 Niels de Vos 2014-07-03 11:20:39 UTC
This change needs some substantial work:
- http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040755.html

Comment 5 Anand Avati 2014-08-13 16:32:44 UTC
REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#2) for review on master by Niels de Vos (ndevos)

Comment 6 Niels de Vos 2015-03-30 07:49:22 UTC
http://review.gluster.org/8065 for the master branch has been merged. http://review.gluster.org/7814 can now get updated with a backport. There is no 3.6 bug for this yet, it would need to get included there so that we do not introduce 'regressions' between 3.5 -> 3.6.

Comment 7 Niels de Vos 2016-06-17 15:57:47 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.


Note You need to log in before you can comment on or make changes to this bug.