Bug 1404989
Summary: | Fail add-brick command if replica count changes | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Karan Sandha <ksandha> | |
Component: | distribute | Assignee: | Karthik U S <ksubrahm> | |
Status: | CLOSED ERRATA | QA Contact: | Karan Sandha <ksandha> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, ksandha, ksubrahm, moagrawa, pkarampu, rcyriac, rhinduja, rhs-bugs, storage-qa-internal | |
Target Milestone: | --- | Keywords: | Regression | |
Target Release: | RHGS 3.2.0 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-11 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the 'gluster volume add-brick' command failed on some nodes when a distributed volume was converted into a replicated volume, and the volume was not mounted, and no lookup had been performed. This could result in inconsistent data across gluster nodes. To avoid this situation, the 'gluster volume add-brick' command is no longer allowed when the replica count has increased and there any replica bricks are unavailable.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1406411 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 05:57:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1351528, 1351530, 1406411 |
Comment 3
Atin Mukherjee
2016-12-15 11:49:09 UTC
I managed to git bisect the patch http://review.gluster.org/#/c/15802/ which has caused the regression. @Pranith - Any guess what from this patch made this regression? upstream mainline patch http://review.gluster.org/16214 posted for review. Copying the discussion from upstream bug (In reply to Mohit Agrawal from comment #4) > Hi, > > It seems it is expected behavior.As per current dht code in first attempt > layout sets only when all subvolumes > are up otherwise it will not set layout and throws error. At worst case, we'd need to have a validation in GlusterD to block users end up into this situation otherwise GlusterD will end up into an inconsistent state where in one of the nodes the commit will fail where as in the others it will go through and the transaction will not be roll backed due to the limitation of GlusterD's design. > > Below is the case of plain distributed environment when i have killed one > brick after start volume then mount is failing as per current dht behavior > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > [root@dhcp10-210 ~]# systemctl restart glusterd.service > [root@dhcp10-210 ~]# gluster v create test 10.65.7.254:/dist1/brick1 > 10.65.7.254:/dist2/brick2 > volume create: test: success: please start the volume to access data > [root@dhcp10-210 ~]# gluster v start test > volume start: test: success > [root@dhcp10-210 ~]# gluster v status > Status of volume: test > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.65.7.254:/dist1/brick1 49152 0 Y > 11117 > Brick 10.65.7.254:/dist2/brick2 49153 0 Y > 11136 > > Task Status of Volume test > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@dhcp10-210 ~]# kill 11136 > [root@dhcp10-210 ~]# mount -t glusterfs 10.65.7.254:/test /mnt > Mount failed. Please check the log file for more detail > > [2016-12-26 06:11:14.871167] W [MSGID: 109005] > [dht-selfheal.c:2102:dht_selfheal_directory] 0-test-dht: Directory selfheal > failed: 1 subvolumes down.Not fixing. path = /, gfid = > [2016-12-26 06:11:14.880232] W [fuse-bridge.c:767:fuse_attr_cbk] > 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Stale file handle) > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > > As of now we think it is a corner case , it would be difficult to provide a > fix unless there is any data loss in this case. > > > Regards > Mohit Agrawal http://review.gluster.org/#/c/16330 is posted for review downstream patch : https://code.engineering.redhat.com/gerrit/#/c/94355 1) Created a 1*2 volume 2) Ran a FIO Workload on randomwrites from fuse mount. 3) killed a brick from first server node(1). 4) now added a brick in the cluster making it arbiter vol 1*2+1 [root@dhcp47-141 ~]# gluster volume add-brick blanc replica 3 arbiter 1 dhcp47-144.lab.eng.blr.redhat.com:/bricks/brick0/blanc volume add-brick: failed: Brick /bricks/brick0/blanc is down, changing replica count needs all the bricks to be up to avoid data loss Hence marking this bug verified as per the design change. Reviewed and updated the doc text, hence removing the needinfo flag. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |