Bug 763874 (GLUSTER-2142) - glusterfs mountpoint hangs on disconnecting second node
Summary: glusterfs mountpoint hangs on disconnecting second node
Keywords:
Status: CLOSED WORKSFORME
Alias: GLUSTER-2142
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.1.0
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Anand Avati
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-23 15:40 UTC by Andreas Kimpfler
Modified: 2015-12-01 16:45 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Andreas Kimpfler 2010-11-23 15:40:26 UTC
I set up a two-server scenario with "volume create shared replica 2 transport tcp 192.168.99.9:/mnt/glusterfs/shared 192.168.99.10:/mnt/glusterfs/shared" and started the volume. Then mounted the volume on both of the servers.

Forcing the ethernet connecting between both servers down results in no responds from the local gluster mount.
As an example:
I did a "dd" on the first machine, brought the ethernet link down and tried to "ls" on the mountpoint or executing "df -h" on both servers. All this results in a stalled shell not responding to anything (including Ctrl-C). Cancelling the "dd" job doesn't work too.

Killing all glusterfs processes, unmounting the mountpoint brings system back to normal status (all shells are responding). Then mounting the volume again without reenabling the ethernet link works and the local mount is accessible.

Comment 1 Andreas Kimpfler 2010-11-26 11:28:45 UTC
A friend of mine tried this with four machines:
Volume Name: vstore
Type: Replicate
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 192.168.123.123:/srv/export
Brick2: 192.168.123.124:/srv/export
Brick3: 192.168.123.125:/srv/export
Brick4: 192.168.123.126:/srv/export

Setting up such a scenario and disconnecting Ethernet never falls into the same problem as i have. So we think this is might be a problem of the Quorum.

Comment 2 shishir gowda 2010-12-21 09:12:47 UTC
Hi,

Glusterfs uses ip addresses. If all the interfaces are down, then the mount-point cannot resolve the bricks (even if it is a local-host, as it does not use 127.0.0.1 addr). Since all the bricks go down, the Ops fail.

As for the hang, fixes in 3.1.1 have taken care of it, and after a 42sec timeout, the ops terminate. Once the network is back up, and the bricks are up, the mount point is active again. You do not have to restart all the servers/bricks.

When there were 4 bricks/server, since it had access to at least one of the bricks, the ops were successful in that instance.

With regards,
Shishir

Comment 3 Andreas Kimpfler 2011-01-17 12:28:13 UTC
Hi,

sorry for the late answer. 

Thx for your explanation, so in my understanding if a 2-node setup looses interface connectivity to the other node it lasts 42 seconds until the mount-point gets active again/is responding ?

I ask this because i want to use glusterfs to store images of my kvm virtual machines, maildirs and webhosting stuff on this two machines. Further plans are more machines that hold the same content.
If the glusterfs goes down, atm this means a totally unresponding mount-point, virtual machines that are not responding or maybe crashing, imap-servers that can't delive mailbox content and so on.


Regards,
Andreas

Comment 4 Amar Tumballi 2011-02-28 08:46:55 UTC
Hi Andreas,

Recently we fixes some bug similar to this (bug 763737), can you try the same experiment with latest git head (now available at https://github.com/gluster/glusterfs) or wait for few more days to have a QA release with these fixes.

Regards,
Amar

Comment 5 Amar Tumballi 2011-03-03 03:32:39 UTC
This works fine for us atm with 3.1.3qa2 release. Please see if it fixes issues for you.


Note You need to log in before you can comment on or make changes to this bug.