Bug 1286108

Summary: Write on fuse mount failed with "write error: Transport endpoint is not connected" after a successful remove brick operation
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Susant Kumar Palai <spalai>
Component: distributeAssignee: Raghavendra G <rgowdapp>
Status: CLOSED WONTFIX QA Contact: Anoop <annair>
Severity: unspecified Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, nbalacha, rgowdapp, rhs-bugs, spalai, spandura, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: dht-rca-unknown, dht-must-fix
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 991402 Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 991402    
Bug Blocks: 1286180    

Comment 2 Nithya Balachandran 2016-06-24 08:18:11 UTC
RCA required. To be tried against the latest build.

Comment 3 Raghavendra G 2016-06-24 11:53:53 UTC
I think when the commit operation is done, glusterd might've rebooted all the bricks as the volfile has changed. That might've caused the ENOTCONN.

@Atin,

If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3 and commit the operation would the bricks b1, b2 are rebooted?

regards,
Raghavendra

Comment 4 Atin Mukherjee 2016-06-29 05:01:27 UTC
(In reply to Raghavendra G from comment #3)
> I think when the commit operation is done, glusterd might've rebooted all
> the bricks as the volfile has changed. That might've caused the ENOTCONN.
> 
> @Atin,
> 
> If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3
> and commit the operation would the bricks b1, b2 are rebooted?

No, the bricks aren't rebooted in this case. I've checked that both from code and a quick test [1]

[1] https://paste.fedoraproject.org/386164/17642214/

> 
> regards,
> Raghavendra

Comment 5 Raghavendra G 2016-07-04 07:50:03 UTC
Following is an hypothesis as to why this issue might've happened

>[2013-08-02 10:42:18.062268] W [fuse-bridge.c:5103:fuse_migrate_fd]
> 0-glusterfs-fuse: syncop_fsync failed (Transport endpoint is not connected) on 
> fd (0x1d9338c)(basefd:0x1d9338c basefd-inode.gfid:7a315be7-683f-
> 4a4c-b6d6-85936bde21a1) (old-subvolume:vol_dis_rep-0-new-subvolume:vol_dis_rep-1)

As we can see from the logs above, fsync during graph switch failed. A possible hypothesis is that the removed brick was killed even before the clients were given a chance to "react" (like flushing the cached-writes in old-graph - which fd-migration does) to remove-brick operation. Since, fuse was not able to "migrate" the fd, application continuity is broken. A possible fix should terminate the brick process only after _all_ clients are given a chance to migrate the fds.