Description of problem: First of all, this issue happen after apply http://review.gluster.org/8065 and set the network.tcp-timeout as 30s. In a replicated gluster volume with 2 nodes server. On the client-side, using mount.glusterfs to access that gluster volume. A test program running on one of the nodes opens a file in the volume and flock it(only flock once), then read and write file frequently. On one of nodes, we simulate a network disconnects for 15s then reconnect. Note that the 15s is less than the network.tcp-timeout 30s. Keep this disconnect and reconnect for some time and exit the test program, the FD in server-side wouldn't be closed. If restart the test program, it will failed while flock, return a Resource Temporarily Unavailable error. Network failure timeline: ---(15s connected)---|---(15s disconnected)---|---(15s connected)---|...repeat... Version-Release number of selected component (if applicable): How reproducible: By simulation, this issue is reproducible. Steps to Reproduce: 1. Setting a replicated volume with 2 server nodes(A and B). On node A, using fuse client to access the volume. Run a simple test program, create a file and flock it, then do some radom reading in a deadloop, just not unlock it or exit. 2. On node B, add 2 iptables commands to block the connection between fuse client on A and glusterfsd on B. eg iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP Note: 200.200.200.20 is the IP of A, 49154 is the listen port of glusterfsd on B. These 2 commands is about to keep socket on node A closed first(by REJECT), and at the same time only drop OUTPUT packets on B whick can keep socket on node B alive. 3. After 30s, delete 2 iptables commands with iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP 4. Repeat 2-3 several times. Exit the test program, then restart it, it cannot flock again. Actual results: File flocks not released. Expected results: File flocks released. Additional info: Here is a preview patch to fix the issue, will submit to Gerrit later. http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042315.html The major modification is adding an id for different tcp connection between a pair client and server to avoid a connection socket not close at the same time.
REVIEW: http://review.gluster.org/8787 (* protocol: fix file flock not released in frequently disconnects) posted (#1) for review on release-3.4 by Jaden Liang (jaden1q84)
A little mistakes of step 3 to reproduce. Use -D to delete iptables commands. 3. After 30s, delete 2 iptables commands with iptables -D INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT iptables -D OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP And port 49154 is the glusterfsd listening port, it might be different in different servers. It can be found with 'ps auxf | grep glusterfsd' to find out the brick glusterfsd of the testfile. (In reply to Jaden Liang from comment #0) > Description of problem: > > First of all, this issue happen after apply http://review.gluster.org/8065 > and set the network.tcp-timeout as 30s. > > In a replicated gluster volume with 2 nodes server. On the client-side, > using mount.glusterfs to access that gluster volume. A test program running > on one of the nodes opens a file in the volume and flock it(only flock > once), then read and write file frequently. On one of nodes, we simulate a > network disconnects for 15s then reconnect. Note that the 15s is less than > the network.tcp-timeout 30s. Keep this disconnect and reconnect for some > time and exit the test program, the FD in server-side wouldn't be closed. If > restart the test program, it will failed while flock, return a Resource > Temporarily Unavailable error. > > Network failure timeline: > ---(15s connected)---|---(15s disconnected)---|---(15s > connected)---|...repeat... > > Version-Release number of selected component (if applicable): > > > How reproducible: > By simulation, this issue is reproducible. > > Steps to Reproduce: > 1. Setting a replicated volume with 2 server nodes(A and B). On node A, > using fuse client to access the volume. > Run a simple test program, create a file and flock it, then do some radom > reading in a deadloop, just not unlock it or exit. > > 2. On node B, add 2 iptables commands to block the connection between fuse > client on A and glusterfsd on B. eg > iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT > iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP > > Note: 200.200.200.20 is the IP of A, 49154 is the listen port of glusterfsd > on B. > > These 2 commands is about to keep socket on node A closed first(by REJECT), > and at the same time only drop OUTPUT packets on B whick can keep socket on > node B alive. > > 3. After 30s, delete 2 iptables commands with > iptables -A INPUT -p tcp -s 200.200.200.20 --dport 49154 -j REJECT > iptables -A OUTPUT -p tcp -d 200.200.200.20 --sport 49154 -j DROP > > 4. Repeat 2-3 several times. Exit the test program, then restart it, it > cannot flock again. > > Actual results: > File flocks not released. > > Expected results: > File flocks released. > > Additional info: > Here is a preview patch to fix the issue, will submit to Gerrit later. > > http://supercolony.gluster.org/pipermail/gluster-devel/2014-September/042315. > html > > The major modification is adding an id for different tcp connection between > a pair client and server to avoid a connection socket not close at the same > time.
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
Fix in 3.6 and later releases: http://review.gluster.org/6669 Fix in 3.5 release: http://review.gluster.org/8187 I think this bug can be closed.