Description of problem: ======================= In one of my scenario, split-brain is taking way longer time as follows: [root@dhcp37-64 glusterfs]# time gluster volume heal master info split-brain Brick 10.70.37.64:/rhs/brick1/b1 Status: Connected Number of entries in split-brain: 0 Brick 10.70.37.60:/rhs/brick1/b2 Status: Connected Number of entries in split-brain: 0 Brick 10.70.37.64:/rhs/brick2/b3 Status: Connected Number of entries in split-brain: 0 Brick 10.70.37.60:/rhs/brick2/b4 Status: Connected Number of entries in split-brain: 0 real 1m25.705s user 0m13.429s sys 0m23.291s [root@dhcp37-64 glusterfs]# I am able to reproduce this twice via following: 1. Create 2x2 volume 2. Turn of shd 3. Bring down 1 brick 4. Write Data from mount: For records, i wrote: for i in {create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink}; do crefi --multi -n 5 -b 10 -d 10 --max=10K --min=500 --random -T 10 -t text --fop=$i /mnt/master/ ; sleep 10 ; done 5. Bring the brick back 6. Performed client side healing Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.8.4-41.el7rhgs.x86_64 How reproducible: ================= 2/2 Actual results: =============== split-brain succeeds but takes a longer time.
What was the output of heal info at this time? `info split-brain` goes through all files that need heal, performs lookups, examines xattrs etc and only prints the ones in split-brain. So if there are a million files that need heal but zero in split-brain, `info split-brain` would still take a long time. Unless heal-info also had zero entries, I don't think this is a bug.
(In reply to Ravishankar N from comment #3) > What was the output of heal info at this time? `info split-brain` goes > through all files that need heal, performs lookups, examines xattrs etc and > only prints the ones in split-brain. So if there are a million files that > need heal but zero in split-brain, `info split-brain` would still take a > long time. Unless heal-info also had zero entries, I don't think this is a > bug. I had some files to be healed, hence the time taken could be because of the this explanation. However, if there are millions of files to be healed (which was the case in one of recent customer case), this delay could be perceived hang. This is in that case a usability bug and requires an enhancement in the design. A warning or info is definitely required to let user know that it might take a while and it could be run in background.
(In reply to Rahul Hinduja from comment #5) > (In reply to Ravishankar N from comment #3) > > What was the output of heal info at this time? `info split-brain` goes > > through all files that need heal, performs lookups, examines xattrs etc and > > only prints the ones in split-brain. So if there are a million files that > > need heal but zero in split-brain, `info split-brain` would still take a > > long time. Unless heal-info also had zero entries, I don't think this is a > > bug. > > I had some files to be healed, hence the time taken could be because of the > this explanation. However, if there are millions of files to be healed > (which was the case in one of recent customer case), this delay could be > perceived hang. > > This is in that case a usability bug and requires an enhancement in the > design. A warning or info is definitely required to let user know that it > might take a while and it could be run in background. Thanks Rahul, I think we can fix this upstream first and not target this for 3.4.0. We could do this as part of https://bugzilla.redhat.com/show_bug.cgi?id=1349352#c12 which calls for more changes from usability point of view. Does that sound ok?
> Thanks Rahul, I think we can fix this upstream first and not target this for > 3.4.0. We could do this as part of > https://bugzilla.redhat.com/show_bug.cgi?id=1349352#c12 which calls for more > changes from usability point of view. Does that sound ok? Agree, I am ok to defer from 3.4.0 and to be fixed as part of 1349352
This bug is being fixed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1721355. If this issue is seen even after the fix, please feel free to re-open this bug. *** This bug has been marked as a duplicate of bug 1721355 ***