[2016-12-15 11:30:02.898694] E [MSGID: 106054] [glusterd utils.c:11666:glusterd_handle_replicate_brick_ops] 0- management: Failed to set extended attribute trusted.add-brick : Read-only file system [Read-only file system]
Looks like http://review.gluster.org/12451 has introduced this check.
I managed to git bisect the patch http://review.gluster.org/#/c/15802/ which has caused the regression.
@Pranith - Any guess what from this patch made this regression?
upstream mainline patch http://review.gluster.org/16214 posted for review.
Copying the discussion from upstream bug
(In reply to Mohit Agrawal from comment #4)
> It seems it is expected behavior.As per current dht code in first attempt
> layout sets only when all subvolumes
> are up otherwise it will not set layout and throws error.
At worst case, we'd need to have a validation in GlusterD to block users end up into this situation otherwise GlusterD will end up into an inconsistent state where in one of the nodes the commit will fail where as in the others it will go through and the transaction will not be roll backed due to the limitation of GlusterD's design.
> Below is the case of plain distributed environment when i have killed one
> brick after start volume then mount is failing as per current dht behavior
> [root@dhcp10-210 ~]# systemctl restart glusterd.service
> [root@dhcp10-210 ~]# gluster v create test 10.65.7.254:/dist1/brick1
> volume create: test: success: please start the volume to access data
> [root@dhcp10-210 ~]# gluster v start test
> volume start: test: success
> [root@dhcp10-210 ~]# gluster v status
> Status of volume: test
> Gluster process TCP Port RDMA Port Online Pid
> Brick 10.65.7.254:/dist1/brick1 49152 0 Y
> Brick 10.65.7.254:/dist2/brick2 49153 0 Y
> Task Status of Volume test
> There are no active volume tasks
> [root@dhcp10-210 ~]# kill 11136
> [root@dhcp10-210 ~]# mount -t glusterfs 10.65.7.254:/test /mnt
> Mount failed. Please check the log file for more detail
> [2016-12-26 06:11:14.871167] W [MSGID: 109005]
> [dht-selfheal.c:2102:dht_selfheal_directory] 0-test-dht: Directory selfheal
> failed: 1 subvolumes down.Not fixing. path = /, gfid =
> [2016-12-26 06:11:14.880232] W [fuse-bridge.c:767:fuse_attr_cbk]
> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Stale file handle)
> As of now we think it is a corner case , it would be difficult to provide a
> fix unless there is any data loss in this case.
> Mohit Agrawal
http://review.gluster.org/#/c/16330 is posted for review
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/94355
1) Created a 1*2 volume
2) Ran a FIO Workload on randomwrites from fuse mount.
3) killed a brick from first server node(1).
4) now added a brick in the cluster making it arbiter vol 1*2+1
[root@dhcp47-141 ~]# gluster volume add-brick blanc replica 3 arbiter 1 dhcp47-144.lab.eng.blr.redhat.com:/bricks/brick0/blanc
volume add-brick: failed: Brick /bricks/brick0/blanc is down, changing replica count needs all the bricks to be up to avoid data loss
Hence marking this bug verified as per the design change.
Reviewed and updated the doc text, hence removing the needinfo flag.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.