1129787 – file locks are not released within an acceptable time when a fuse-client uncleanly disconnects

Bug 1129787 - file locks are not released within an acceptable time when a fuse-client uncleanly disconnects

Summary: file locks are not released within an acceptable time when a fuse-client uncl...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	transport
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Niels de Vos
QA Contact:
Docs Contact:
URL:	http://supercolony.gluster.org/piperm...
Whiteboard:
Depends On:
Blocks:	1099460
TreeView+	depends on / blocked

Reported:	2014-08-13 16:36 UTC by Niels de Vos
Modified:	2015-05-14 17:43 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-3.7.0
Clone Of:
Environment:
Last Closed:	2015-05-14 17:27:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Niels de Vos 2014-08-13 16:36:30 UTC

+++ This bug was initially created as a clone of Bug #1099460 +++

Description of problem:

While using glusterfs-fuse or NFS as client for accessing RHS volume.
If we do hard poweroff (poweroff -f) to simulate the power outage from one of the fuse or NFS client which has lock on a file with write operation on the same file, other clients which are waiting on same lock gets lock after 15 minutes, which is *VERY* long.

Because of hard poweroff client is not able to send any signal that it is down to RHS server.

After checking netstat output in RHS server:

tcp        0    240 rhs2.server:49152       client1.rhs:exp1     ESTABLISHED 2592/glusterfsd   

glusterfsd has 240 bytes pending in it's Send-Q as client-1 is down.

and this send Q is getting flushed after 15 minutes , we should have some option to set the timeout value for client connection if client is not available.

Just like we have for RHS server :

Option: network.frame-timeout
Default Value: 1800
Description: Time frame after which the (file) operation would be declared as dead, if the *SERVER* does not respond for a particular (file) operation.

Option: network.ping-timeout
Default Value: 42
Description: Time duration for which the client waits to check if the *SERVER* is responsive.


Steps to Reproduce:
1. Mount RHS volume with glusterfs fuse or NFS on two  clients 
2.take a lock on a file with write on it from client-1 with some test program
3. run same program from client-2 which will wait fro client-1 to release the lock
4. run `poweroff -f` command from client-1 

Actual results:
RHS server is keeping the lock for 15 minutes with client-1 but client-1 is no longer available because of power outage 

Expected results:

RHS server should not keep lock for 15 minutes it should reset the connection from client which has power outage after some graceful amount of time.


After doing more tests (thanks Vikhyat!), it seems that reproducing the problem does not have a 100% hit rate. There are two scenarios:

- If a WRITE has finished (the brick returned a WRITE Reply), the lock is
  available for the 2nd client pretty quickly (< 1 minute).

- When a WRITE Reply is sent to the client, and the client does not accept this,
  there are tcp-retransmissions sent (by the kernel, not directly by glusterfsd).

There is a small window in the test-case where a 'poweroff -f' can be executed at the time when there are no outstanding WRITE replies. In this case, it seems that the lock is released correctly. We have simulated this by adding a sleep before the write() in the loop, and doing the 'poweroff -f' while sleeping.

I think we'll have to look at how failures are caught when the brick sends a reply to a client that is offline.


--- Additional comment from Niels de Vos on 2014-05-15 19:24:17 CEST ---

This did not improve things, just documenting the suggestions from some of the network guys:

> 1. Trigger a keepalive after 10 seconds of socket idle;
> 2. Re-probe after every 10 seconds;
> 3. Discard the connection after 3 failed probes.
>  
> Which yields 30 seconds of failure in order to drop the socket.
>  
> In order to enable it:
>  
> 1. Edit /etc/sysctl.conf
> 2. Append at the end of file the following snippet:
>  
> ## Aggressive keepalive tracking
> net.ipv4.tcp_keepalive_time = 10
> net.ipv4.tcp_keepalive_probes = 3
> net.tcp_keepalive_intvl = 10
>  
> 3. And then; load the values:
>  
> # sysctl -p


In a similar way, it is possible to change the retransmission tries and
timeouts (default values given):

net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15

Maybe this can be combined with a check for the tcp-socket option
TCP_USER_TIMEOUT that was added to the Linux kernel with
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7
rpc/rpc-transport/socket/src/socket.c seems to be the right place to set this option, the KEEPALIVE is configured in __socket_keepalive() function.

--- Additional comment from Niels de Vos on 2014-05-16 14:01:06 CEST ---

Tested with changing the net.ipv4.tcp_retries2 to some lower values (on the
storage server). Tuning this option may be a (temporary?) solution for
reducing the fail-over time.

On each test, I've captured a tcpdump to verify that there are tcp-retransmissions. After applying the sysctl change, the glusterfsd processes have been restarted.

Results:
These metrics are gathered from the captured tcpdumps. The starting time for
the failover is set at the last WRITE Reply that fails and results in the
retransmissions. The end of the fail-over time is set at the LK Reply that
makes it possible for the 2nd client to continue.

 net.ipv4.tcp_retries2 | number of retrans | failover time
-----------------------+-------------------+---------------
            2          |          3        |      3.0
            3          |          4        |      6.3
            4          |          5        |     12.6
            5          |          6        |     25.5
            6          |          7        |     51.3
            7          |          7        |     51.5*
            8          |          7        |    102.7

* = looks rather strange, needs to be checked again

--- Additional comment from  on 2014-05-16 17:06:30 CEST ---

I have tested above given test case with net.ipv4.tcp_retries2 = 7 

below are the test results:

net.ipv4.tcp_retries2 | number of retrans | failover time
----------------------+-------------------+---------------
            7         |           7       |      77.0

--- Additional comment from santosh pradhan on 2014-05-20 17:59:12 CEST ---


I would prefer the option to be read out as network.tcp-user-timeout (as TCP option TCP_USER_TIMEOUT) and 5 mins sounds reasonable to me.

--- Additional comment from Anand Avati on 2014-05-20 18:25:08 CEST ---

REVIEW: http://review.gluster.org/7814 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on release-3.5 by Niels de Vos (ndevos)

--- Additional comment from Anand Avati on 2014-06-13 23:19:25 CEST ---

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#1) for review on master by Justin Clift (justin)

--- Additional comment from Niels de Vos on 2014-07-03 13:20:39 CEST ---

This change needs some substantial work:
- http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040755.html

--- Additional comment from Anand Avati on 2014-08-13 18:32:44 CEST ---

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#2) for review on master by Niels de Vos (ndevos)

Comment 1 Anand Avati 2014-08-13 16:37:10 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#3) for review on master by Niels de Vos (ndevos)

Comment 2 Anand Avati 2014-08-13 17:46:17 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#4) for review on master by Harshavardhana (harsha)

Comment 3 Anand Avati 2014-09-04 22:18:22 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#5) for review on master by Harshavardhana (harsha)

Comment 4 Anand Avati 2015-02-19 13:36:39 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#6) for review on master by Niels de Vos (ndevos)

Comment 5 Anand Avati 2015-02-19 15:23:35 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#7) for review on master by Niels de Vos (ndevos)

Comment 6 Anand Avati 2015-02-19 17:40:43 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#8) for review on master by Niels de Vos (ndevos)

Comment 7 Anand Avati 2015-02-21 11:21:31 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#9) for review on master by Niels de Vos (ndevos)

Comment 8 Anand Avati 2015-02-21 14:57:07 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#10) for review on master by Niels de Vos (ndevos)

Comment 9 Anand Avati 2015-02-23 10:25:01 UTC

REVIEW: http://review.gluster.org/8065 (socket: use TCP_USER_TIMEOUT to detect client failures quicker) posted (#11) for review on master by Niels de Vos (ndevos)

Comment 10 Anand Avati 2015-03-17 12:10:21 UTC

COMMIT: http://review.gluster.org/8065 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit 6b3704990257643da54100d8581856a7d2c72f86
Author: Niels de Vos <ndevos>
Date:   Tue Feb 17 12:12:11 2015 +0100

    socket: use TCP_USER_TIMEOUT to detect client failures quicker
    
    Use the network.ping-timeout to set the TCP_USER_TIMEOUT socket option
    (see 'man 7 tcp'). The option sets the transport.tcp-user-timeout option
    that is handled in the rpc/socket layer on the protocol/server side.
    This socket option makes detecting unclean disconnected clients more
    reliable.
    
    When the socket gets closed, any locks that the client held are been
    released. This makes it possible to reduce the fail-over time for
    applications that run on systems that became unreachable due to
    a network partition or general system error client-side (kernel panic,
    hang, ...).
    
    It is not trivial to create a test-case for this at the moment. We need
    a client that unclean disconnects and an other client that tries to take
    over the lock from the disconnected client.
    
    URL: http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040755.html
    Change-Id: I5e5f540a49abfb5f398291f1818583a63a5f4bb4
    BUG: 1129787
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: http://review.gluster.org/8065
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: soumya k <skoduri>
    Reviewed-by: Santosh Pradhan <santosh.pradhan>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 11 Niels de Vos 2015-05-14 17:27:07 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 12 Niels de Vos 2015-05-14 17:35:32 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 13 Niels de Vos 2015-05-14 17:37:54 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 14 Niels de Vos 2015-05-14 17:43:17 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.