Bug 1597821

Summary: glusterfs client mount point fails with transport endpoint is not connected.
Product: [Community] GlusterFS Reporter: toma.todorov
Component: arbiterAssignee: bugs <bugs>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1CC: bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-04 15:29:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
glusterfs-client log file (/var/log/glusterfs)
none
[IMPORTANT] Refer to this attachment for the glusterfs-client log instead. none

Description toma.todorov 2018-07-03 16:33:00 UTC
Created attachment 1456280 [details]
glusterfs-client log file (/var/log/glusterfs)

Description of problem:
Assume a basic replica 3 arbiter 1 configuration, glusterfs-server 4.0.2 and glusterfs-client 4.0.2. 
glusterfs-client is installed on Ubuntu 18.04.

The volume is created by the following command 

gluster volume create brick01 replica 3 arbiter 1 
proxmoxVE-1:/mnt/gluster/bricks/brick01 
proxmoxVE-2:/mnt/gluster/bricks/brick01 
arbiter01:/mnt/gluster/bricks/brick01

Gluster volume info:

Volume Name: brick01
Type: Replicate
Volume ID: 2310c6f4-f83d-4691-97a7-cbebc01b3cf7
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: proxmoxVE-1:/mnt/gluster/bricks/brick01
Brick2: proxmoxVE-2:/mnt/gluster/bricks/brick01
Brick3: arbiter01:/mnt/gluster/bricks/brick01 (arbiter)

PROBLEM: In order to verify that write / read ops are permitted when one storage node is down as stated in the docs (https://docs.gluster.org/en/v3/Administrator%20Guide/arbiter-volumes-and-quorum/), an unexpected result occurs. After killing gluster processes on one of the non-arbiter nodes (using pkill ^gluster*) the client mount point fails with 'Transport endpoint is not connected.' (see attachment)

Even if the following additional options are set, the same result occures:
gluster volume set brick01 cluster.quorum-reads false
gluster volume set brick01 cluster.quorum-count 1

Version-Release number of selected component (if applicable): 4.0.2


How reproducible: Always.


Steps to Reproduce:
1. Setup replica 3 arbiter 1 configuration (glusterfs-server 4.0.2) where storage nodes are Debian based (Proxmox) physical nodes and arbiter is installed on Ubuntu 18.04 VM.
2. Setup gluster client on Ubuntu 18.04 VM (glusterfs-client 4.0.2)
3. Create mount point on the client (mount -t glusterfs proxmoxVE-1:/brick01 /home/<user>/brick01).
4. On proxmoxVE-1 or proxmoxVE-2, execute 'pkill ^gluster*'.
5. Operations on client side fails with 'Transport endpoint is not connected.'.

Actual results:
Operations on client side fails with 'Transport endpoint is not connected.'.

Expected results:
Operations on client side should be allowed as stated in the docs (https://docs.gluster.org/en/v3/Administrator%20Guide/arbiter-volumes-and-quorum/)

Additional info: See attachment, it's the glusterfs-client log file.

Comment 1 toma.todorov 2018-07-04 08:00:48 UTC
Created attachment 1456404 [details]
[IMPORTANT] Refer to this attachment for the glusterfs-client log instead.

Comment 2 toma.todorov 2018-07-04 08:04:07 UTC
EDIT:
As stated in the docs, 'cluster.quorum-type' is set to auto for arbiter configurations, and cluster.quorum-count is ignored. Please, ignore the additional settings

gluster volume set brick01 cluster.quorum-reads false
gluster volume set brick01 cluster.quorum-count 1

as well as attachment 1456280 [details] which doesn't give clear information.
Attachment 1456404 [details] gives correct information about client side logs for volume brick01 with no re-configured options.

Comment 3 toma.todorov 2018-07-04 15:29:08 UTC
'gluster volume heal brick01 enable' resolved the issue.
This eventually added reconfigured option 'cluster.self-heal-daemon: enable' to the volume. It seems that by default arbiter brick fails to heal (sync) simultaneously when any file operation occurs.