Bug 1286108

Summary:	Write on fuse mount failed with "write error: Transport endpoint is not connected" after a successful remove brick operation
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Susant Kumar Palai <spalai>
Component:	distribute	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED WONTFIX	QA Contact:	Anoop <annair>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	amukherj, nbalacha, rgowdapp, rhs-bugs, spalai, spandura, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	dht-rca-unknown, dht-must-fix
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	991402	Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	991402
Bug Blocks:	1286180

Comment 2 Nithya Balachandran 2016-06-24 08:18:11 UTC

RCA required. To be tried against the latest build.

Comment 3 Raghavendra G 2016-06-24 11:53:53 UTC

I think when the commit operation is done, glusterd might've rebooted all the bricks as the volfile has changed. That might've caused the ENOTCONN.

@Atin,

If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3 and commit the operation would the bricks b1, b2 are rebooted?

regards,
Raghavendra

Comment 4 Atin Mukherjee 2016-06-29 05:01:27 UTC

(In reply to Raghavendra G from comment #3)
> I think when the commit operation is done, glusterd might've rebooted all
> the bricks as the volfile has changed. That might've caused the ENOTCONN.
> 
> @Atin,
> 
> If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3
> and commit the operation would the bricks b1, b2 are rebooted?

No, the bricks aren't rebooted in this case. I've checked that both from code and a quick test [1]

[1] https://paste.fedoraproject.org/386164/17642214/

> 
> regards,
> Raghavendra

Comment 5 Raghavendra G 2016-07-04 07:50:03 UTC

Following is an hypothesis as to why this issue might've happened

>[2013-08-02 10:42:18.062268] W [fuse-bridge.c:5103:fuse_migrate_fd]
> 0-glusterfs-fuse: syncop_fsync failed (Transport endpoint is not connected) on 
> fd (0x1d9338c)(basefd:0x1d9338c basefd-inode.gfid:7a315be7-683f-
> 4a4c-b6d6-85936bde21a1) (old-subvolume:vol_dis_rep-0-new-subvolume:vol_dis_rep-1)

As we can see from the logs above, fsync during graph switch failed. A possible hypothesis is that the removed brick was killed even before the clients were given a chance to "react" (like flushing the cached-writes in old-graph - which fd-migration does) to remove-brick operation. Since, fuse was not able to "migrate" the fd, application continuity is broken. A possible fix should terminate the brick process only after _all_ clients are given a chance to migrate the fds.