Bug 1144672

Summary: file locks are not released in frequently disconnects after apply BUG #1129787 patch
Product: [Community] GlusterFS Reporter: Jaden Liang <jaden1q84>
Component: rpcAssignee: bugs <bugs>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.4.5CC: bugs, jaden1q84, rgowdapp
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042315.html
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-29 17:31:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaden Liang 2014-09-20 08:01:43 UTC
Description of problem:

First of all, this issue happen after apply http://review.gluster.org/8065 and set the network.tcp-timeout as 30s.

In a replicated gluster volume with 2 nodes server. On the client-side, using mount.glusterfs to access that gluster volume. A test program running on one of the nodes opens a file in the volume and flock it(only flock once), then read and write file frequently. On one of nodes, we simulate a network disconnects for 15s then reconnect. Note that the 15s is less than the network.tcp-timeout 30s. Keep this disconnect and reconnect for some time and exit the test program, the FD in server-side wouldn't be closed. If restart the test program, it will failed while flock, return a Resource Temporarily Unavailable error.

Network failure timeline:
---(15s connected)---|---(15s disconnected)---|---(15s
connected)---|...repeat...

Version-Release number of selected component (if applicable):


How reproducible:
By simulation, this issue is reproducible.

Steps to Reproduce:
1. Setting a replicated volume with 2 server nodes(A and B). On node A, using fuse client to access the volume. 
Run a simple test program, create a file and flock it, then do some radom reading in a deadloop, just not unlock it or exit.

2. On node B, add 2 iptables commands to block the connection between fuse client on A and glusterfsd on B. eg
iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT
iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP

Note: 200.200.200.20 is the IP of A, 49154 is the listen port of glusterfsd on B.

These 2 commands is about to keep socket on node A closed first(by REJECT), and at the same time only drop OUTPUT packets on B whick can keep socket on node B alive.

3. After 30s, delete 2 iptables commands with 
iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT
iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP

4. Repeat 2-3 several times. Exit the test program, then restart it, it cannot flock again.

Actual results:
File flocks not released.

Expected results:
File flocks released.

Additional info:
Here is a preview patch to fix the issue, will submit to Gerrit later.

http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042315.html

The major modification is adding an id for different tcp connection between
a pair client and server to avoid a connection socket not close at the same
time.

Comment 1 Anand Avati 2014-09-20 09:58:59 UTC
REVIEW: http://review.gluster.org/8787 (* protocol: fix file flock not released in frequently disconnects) posted (#1) for review on release-3.4 by Jaden Liang (jaden1q84)

Comment 2 Jaden Liang 2014-09-22 06:20:16 UTC
A little mistakes of step 3 to reproduce. Use -D to delete iptables commands.

3. After 30s, delete 2 iptables commands with 
iptables -D INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT
iptables -D OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP

And port 49154 is the glusterfsd listening port, it might be different in different servers. It can be found with 'ps auxf | grep glusterfsd' to find out the brick glusterfsd of the testfile.

(In reply to Jaden Liang from comment #0)
> Description of problem:
> 
> First of all, this issue happen after apply http://review.gluster.org/8065
> and set the network.tcp-timeout as 30s.
> 
> In a replicated gluster volume with 2 nodes server. On the client-side,
> using mount.glusterfs to access that gluster volume. A test program running
> on one of the nodes opens a file in the volume and flock it(only flock
> once), then read and write file frequently. On one of nodes, we simulate a
> network disconnects for 15s then reconnect. Note that the 15s is less than
> the network.tcp-timeout 30s. Keep this disconnect and reconnect for some
> time and exit the test program, the FD in server-side wouldn't be closed. If
> restart the test program, it will failed while flock, return a Resource
> Temporarily Unavailable error.
> 
> Network failure timeline:
> ---(15s connected)---|---(15s disconnected)---|---(15s
> connected)---|...repeat...
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> By simulation, this issue is reproducible.
> 
> Steps to Reproduce:
> 1. Setting a replicated volume with 2 server nodes(A and B). On node A,
> using fuse client to access the volume. 
> Run a simple test program, create a file and flock it, then do some radom
> reading in a deadloop, just not unlock it or exit.
> 
> 2. On node B, add 2 iptables commands to block the connection between fuse
> client on A and glusterfsd on B. eg
> iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT
> iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP
> 
> Note: 200.200.200.20 is the IP of A, 49154 is the listen port of glusterfsd
> on B.
> 
> These 2 commands is about to keep socket on node A closed first(by REJECT),
> and at the same time only drop OUTPUT packets on B whick can keep socket on
> node B alive.
> 
> 3. After 30s, delete 2 iptables commands with 
> iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT
> iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP
> 
> 4. Repeat 2-3 several times. Exit the test program, then restart it, it
> cannot flock again.
> 
> Actual results:
> File flocks not released.
> 
> Expected results:
> File flocks released.
> 
> Additional info:
> Here is a preview patch to fix the issue, will submit to Gerrit later.
> 
> http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042315.
> html
> 
> The major modification is adding an id for different tcp connection between
> a pair client and server to avoid a connection socket not close at the same
> time.

Comment 3 Niels de Vos 2015-05-17 21:57:37 UTC
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 4 Raghavendra G 2015-07-29 17:31:39 UTC
Fix in 3.6 and later releases:
http://review.gluster.org/6669

Fix in 3.5 release:
http://review.gluster.org/8187

I think this bug can be closed.