Red Hat Bugzilla – Bug 1286108
Write on fuse mount failed with "write error: Transport endpoint is not connected" after a successful remove brick operation
Last modified: 2018-02-14 13:03:40 EST
RCA required. To be tried against the latest build.
I think when the commit operation is done, glusterd might've rebooted all the bricks as the volfile has changed. That might've caused the ENOTCONN.
If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3 and commit the operation would the bricks b1, b2 are rebooted?
(In reply to Raghavendra G from comment #3)
> I think when the commit operation is done, glusterd might've rebooted all
> the bricks as the volfile has changed. That might've caused the ENOTCONN.
> If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3
> and commit the operation would the bricks b1, b2 are rebooted?
No, the bricks aren't rebooted in this case. I've checked that both from code and a quick test 
Following is an hypothesis as to why this issue might've happened
>[2013-08-02 10:42:18.062268] W [fuse-bridge.c:5103:fuse_migrate_fd]
> 0-glusterfs-fuse: syncop_fsync failed (Transport endpoint is not connected) on
> fd (0x1d9338c)(basefd:0x1d9338c basefd-inode.gfid:7a315be7-683f-
> 4a4c-b6d6-85936bde21a1) (old-subvolume:vol_dis_rep-0-new-subvolume:vol_dis_rep-1)
As we can see from the logs above, fsync during graph switch failed. A possible hypothesis is that the removed brick was killed even before the clients were given a chance to "react" (like flushing the cached-writes in old-graph - which fd-migration does) to remove-brick operation. Since, fuse was not able to "migrate" the fd, application continuity is broken. A possible fix should terminate the brick process only after _all_ clients are given a chance to migrate the fds.