Bug 1141940 - Mount -t glusterfs never completes and all file-system commands hang
Summary: Mount -t glusterfs never completes and all file-system commands hang
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: 3.4.2
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-15 19:53 UTC by Anirban Ghoshal
Modified: 2015-10-07 13:50 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-07 13:50:53 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Anirban Ghoshal 2014-09-15 19:53:00 UTC
Description of problem:

I have been experimenting with server replacement procedures in a twin-replicated system say, between server1 and server2. I do the following:
- shutdown on server2.
- On server1, remove-brick of the bricks on server2 for all replicated volumes.  
- On server1, detach peer server2.

After this, if you 
- replace server2 with server3, 
- make bricks on server 3, 
- add them in replica 2 mode to the (now) distributed volumes on server1, 

and then attempt to mount the volumes, you might find that the mount.glusterfs program is stalled. Also, any file-system operation on the mount hangs, and cannot be interrupted (but kill -9 does terminate it). 

Here's an example:

root     21891 21890  0 16:47 ?        00:00:00 /bin/mount -t glusterfs server3:testvol /mnt
root     21892 21891  0 16:47 ?        00:00:00 /bin/sh /sbin/mount.glusterfs server3:testvol /mnt -o rw
root     21931 21892  0 16:47 ?        00:00:00 /bin/sh /sbin/mount.glusterfs server3:testvol /mnt -o rw
root     21932 21931  0 16:47 ?        00:00:00 stat -c %i /mnt

I checked the wchan for 21932 (stat) and found that it is 'fuse_get_req'. The funny thing here is that if I try to set any volume option, the problem immediately corrects itself. For example, I tried to set the following 2 options for debug: server.statedump-path and diagnostics.client-log-level and each time this caused the problem to get resolved. 
  
Version-Release number of selected component (if applicable):
glusterfs 3.4.2 over linux kernel 2.6.34.

How reproducible:
Intermittent.

Steps to Reproduce:
1. Take two servers, server1 and server2. Probe each other. 
2. Create bricks server1:/bricks/testvol and server2:/bricks/testvol
3. gluster volume create testvol replica 2 server1:/bricks/testvol and server2:/bricks/testvol
4. gluster volume start testvol
5. On both servers, mount -t glusterfs server<n>:testvol /mnt
6. Turn off server2.
7. gluster volume remove-brick testvol replica 1 server2:/bricks/testvol
8. gluster peer detach server2.
9. Introduce server3.
10. On server3, gluster peer probe server1.
11. Create server3:/bricks/testvol
12. On server3, gluster volume add-brick testvol replica 2 server3:/bricks/testvol 
13 On server 3, mount -t glusterfs server3:testvol /mnt


Actual results:

At this point, the mount command might hang indefinitely just so:

root     21891 21890  0 16:47 ?        00:00:00 /bin/mount -t glusterfs server3:testvol /mnt
root     21892 21891  0 16:47 ?        00:00:00 /bin/sh /sbin/mount.glusterfs server3:testvol /mnt -o rw
root     21931 21892  0 16:47 ?        00:00:00 /bin/sh /sbin/mount.glusterfs server3:testvol /mnt -o rw
root     21932 21931  0 16:47 ?        00:00:00 stat -c %i /mnt

Any command on the file-system such as ls/stat/df will also hang (and so will the strace of these commands). If you set a volume option on testvol using the 'gluster volume set ...' command, the problem disappears immediately.


Expected results:
Mount should complete successfully.

Additional info:

If you have two volumes like testvol and testvol2, then if the issue is observed on testvol, it will be also observed for testvol2. However, resolving it for testvol using volume set commands will not cause it to automatically go away for testvol2.

Comment 1 Baul 2014-11-17 06:17:47 UTC
we find  a case same as yours maybe。
mount hung because one brick cannot be connected by the glusterfs client because the brick ifconfig show too many packets dropped and overruns.

maybe netowork dose not related with the mount hang,just a suggest.
we now investigating the network dropped issue .

Comment 2 Baul 2014-11-17 07:21:15 UTC
sorry we also run gluster replace command (In reply to Baul from comment #1)
> we find  a case same as yours maybe。
> mount hung because one brick cannot be connected by the glusterfs client
> because the brick ifconfig show too many packets dropped and overruns.
> 
> maybe netowork dose not related with the mount hang,just a suggest.
> we now investigating the network dropped issue .

Comment 3 Niels de Vos 2015-05-17 22:00:19 UTC
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 4 Kaleb KEITHLEY 2015-10-07 13:50:53 UTC
GlusterFS 3.4.x has reached end-of-life.\                                                   \                                                                               If this bug still exists in a later release please reopen this and change the version or open a new bug.


Note You need to log in before you can comment on or make changes to this bug.