Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1455907

Summary:	heal info shows the status of the bricks as "Transport endpoint is not connected" though bricks are up
Product:	[Community] GlusterFS	Reporter:	Atin Mukherjee <amukherj>
Component:	core	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.11	CC:	amukherj, bugs, nchilaka, rhinduja, rkavunga, storage-qa-internal, tdesala
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	brick-multiplexing
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1454865	Environment:
Last Closed:	2017-05-30 18:53:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1454865, 1455912
Bug Blocks:	1448833

Comment 1 Atin Mukherjee 2017-05-26 12:22:02 UTC

Description of problem:
=======================
heal info command output shows the status of the bricks as  "Transport endpoint is not connected" though bricks are up and running.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
=================
always

Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and enable brick mux on it.
2) Start the volume and FUSE mount it on a client.
3) Set cluster.self-heal-daemon to off.
4) Create a 10 directory on the mount point.
5) Kill one brick of one of the replica sets in the volume and modify the permissions of all directories.
6) Start volume with force option.
7) Kill the other brick in the same replica set and modify permissions of the directory again.
8) Start volume with force option. Examine the output of `gluster volume heal <vol-name> info' command on the server.

Actual results:
===============
heal info command output shows the status of the bricks as  "Transport endpoint is not connected" though bricks are up and running.


RCA:

When we stop the volume GlusterD actually sends two terminate request to brick process one during brick op phase and another during commit phase. Without multiplexing, it wasn't causing any problem, because the process was supposed to stop. But with multiplexing, it is just a detach which will be executed twice. Now those two requests can be executed at the same time, if that happens we may delete the graph entry twice as we are not taking any locks during the link modification of graph in glusterfs_handle_detach.

So the linked list will me moved twice which is results a deletion of an independent brick.

--- Additional comment from Worker Ant on 2017-05-23 11:31:50 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach request inside lock) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-05-23 11:32:29 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach request inside lock) posted (#2) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-05-24 03:48:04 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach request inside lock) posted (#3) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-05-24 10:50:37 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach request inside lock) posted (#4) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-05-25 08:32:46 EDT ---

REVIEW: https://review.gluster.org/17374 (glusterfsd: process attach and detach request inside lock) posted (#5) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Worker Ant on 2017-05-26 08:11:32 EDT ---

COMMIT: https://review.gluster.org/17374 committed in master by Jeff Darcy (jeff.us) 
------
commit 3ca5ae2f3bff2371042b607b8e8a218bf316b48c
Author: Atin Mukherjee <amukherj>
Date:   Fri May 19 21:04:53 2017 +0530

    glusterfsd: process attach and detach request inside lock
    
    With brick multiplexing, there is a high possibility that attach and
    detach requests might be parallely processed and to avoid a concurrent
    update to the same graph list, a mutex lock is required.
    
    Credits : Rafi (rkavunga) for the RCA of this issue
    
    Change-Id: Ic8e6d1708655c8a143c5a3690968dfa572a32a9c
    BUG: 1454865
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/17374
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jeff.us>

Comment 2 Worker Ant 2017-05-26 12:22:12 UTC

REVIEW: https://review.gluster.org/17402 (glusterfsd: process attach and detach request inside lock) posted (#1) for review on release-3.11 by Atin Mukherjee (amukherj)

Comment 3 Worker Ant 2017-05-29 04:41:27 UTC

REVIEW: https://review.gluster.org/17402 (glusterfsd: process attach and detach request inside lock) posted (#2) for review on release-3.11 by Atin Mukherjee (amukherj)

Comment 4 Worker Ant 2017-05-29 13:56:12 UTC

COMMIT: https://review.gluster.org/17402 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 12c5b9d774f6e03b69efc8e276165debdf360cb6
Author: Atin Mukherjee <amukherj>
Date:   Fri May 19 21:04:53 2017 +0530

    glusterfsd: process attach and detach request inside lock
    
    With brick multiplexing, there is a high possibility that attach and
    detach requests might be parallely processed and to avoid a concurrent
    update to the same graph list, a mutex lock is required.
    
    Please note this backport defines the volfile_lock mutex which was done
    as part of a different patch https://review.gluster.org/15036 in
    mainline but is not available in release-3.11 branch.
    
    Credits : Rafi (rkavunga) for the RCA of this issue
    
    >Reviewed-on: https://review.gluster.org/17374
    >Smoke: Gluster Build System <jenkins.org>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Jeff Darcy <jeff.us>
    >(cherry picked from commit 3ca5ae2f3bff2371042b607b8e8a218bf316b48c)
    
    Change-Id: Ic8e6d1708655c8a143c5a3690968dfa572a32a9c
    BUG: 1455907
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: https://review.gluster.org/17402
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 5 Shyamsundar 2017-05-30 18:53:41 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/