Bug 1286108 - Write on fuse mount failed with "write error: Transport endpoint is not connected" after a successful remove brick operation
Write on fuse mount failed with "write error: Transport endpoint is not conne...
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute (Show other bugs)
3.1
Unspecified Unspecified
high Severity unspecified
: ---
: ---
Assigned To: Raghavendra G
Anoop
dht-rca-unknown, dht-must-fix
: Triaged, ZStream
Depends On: 991402
Blocks: 1286180
  Show dependency treegraph
 
Reported: 2015-11-27 05:51 EST by Susant Kumar Palai
Modified: 2018-02-14 13:03 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 991402
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 2 Nithya Balachandran 2016-06-24 04:18:11 EDT
RCA required. To be tried against the latest build.
Comment 3 Raghavendra G 2016-06-24 07:53:53 EDT
I think when the commit operation is done, glusterd might've rebooted all the bricks as the volfile has changed. That might've caused the ENOTCONN.

@Atin,

If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3 and commit the operation would the bricks b1, b2 are rebooted?

regards,
Raghavendra
Comment 4 Atin Mukherjee 2016-06-29 01:01:27 EDT
(In reply to Raghavendra G from comment #3)
> I think when the commit operation is done, glusterd might've rebooted all
> the bricks as the volfile has changed. That might've caused the ENOTCONN.
> 
> @Atin,
> 
> If my volume is a plain distribute of 3 bricks b1, b2, b3 and If I remove b3
> and commit the operation would the bricks b1, b2 are rebooted?

No, the bricks aren't rebooted in this case. I've checked that both from code and a quick test [1]

[1] https://paste.fedoraproject.org/386164/17642214/

> 
> regards,
> Raghavendra
Comment 5 Raghavendra G 2016-07-04 03:50:03 EDT
Following is an hypothesis as to why this issue might've happened

>[2013-08-02 10:42:18.062268] W [fuse-bridge.c:5103:fuse_migrate_fd]
> 0-glusterfs-fuse: syncop_fsync failed (Transport endpoint is not connected) on 
> fd (0x1d9338c)(basefd:0x1d9338c basefd-inode.gfid:7a315be7-683f-
> 4a4c-b6d6-85936bde21a1) (old-subvolume:vol_dis_rep-0-new-subvolume:vol_dis_rep-1)

As we can see from the logs above, fsync during graph switch failed. A possible hypothesis is that the removed brick was killed even before the clients were given a chance to "react" (like flushing the cached-writes in old-graph - which fd-migration does) to remove-brick operation. Since, fuse was not able to "migrate" the fd, application continuity is broken. A possible fix should terminate the brick process only after _all_ clients are given a chance to migrate the fds.

Note You need to log in before you can comment on or make changes to this bug.