Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 820518

Summary: Issues with rebalance and self heal going simultanously
Product: [Community] GlusterFS Reporter: shylesh <shmohan>
Component: coreAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: pre-releaseCC: amarts, gluster-bugs, pierre.francois, pkarampu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-11 08:26:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sos report none

Description shylesh 2012-05-10 09:35:11 UTC
Created attachment 583493 [details]
sos report

Description of problem:
While self heal is happening rebalance doesnot completely migrated the files, rerunning rebalance again migrates some of the files

Version-Release number of selected component (if applicable):
3.3.0qa40

How reproducible:


Steps to Reproduce:
1. create a pure replica volume with count 2 (sat brick1, brick2)
2. created 100 files of 10 MB, directories of depth 100 with each level containing 1 file of 1MB each
3. Peer probed another machine and did a add-brick (brick3, brick4)
4. Initiated the rebalance and brought down brick2
5. After some time brought up the brick 
6. Checking the rebalance status says completed with count 34 , log messages says transport end point not connected,
7. Rerunning rebalance again migrates some of the files totaling upto 64.
8. Are eual-checksum on mount point before and after rebalance are same.

Actual results:
Status should not say completed until proper data migration happens.

Attached the SOS report: Rebalance log location var/log/glusterfs/repl-rebalance.log

Comment 1 Amar Tumballi 2012-05-11 07:42:08 UTC
I suspect mostly the bug is due to the issue of migrating from pure replica to distributed replica, but considering arequal-checksums are fine, i would reduce the priority.

Comment 2 shishir gowda 2012-06-08 08:42:10 UTC
I tried to re-create this on a single node. I still got failures.

When the brick is down, we seem to be getting duplicate entries from readdirp, and hence multiple migration attempts are being done.

[2012-06-08 13:32:28.202224] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1
.
.
[2012-06-08 13:32:31.624447] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1
.
.
[2012-06-08 13:32:31.640475] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1

This leads to few of the migrations to fail.
Looks like when a child of replica is down, readdir doesn't perform as expected.
Note, if a child of DHT is down, we stop rebalance(assert-on-child-down is set to on).

Comment 3 Amar Tumballi 2012-07-11 04:10:48 UTC
Part of the issue got fixed by default loading of distribute even in case of one subvolume (for the layout xattr creation). Also considering we disable replicate self-heal in rebalance process, this issue is not seen much.

Other than that, the issue is happening because replicate is returning wrong (or say in-correct) offset for readdirp_cbk() when a brick goes down. Need more feedback from replicate team to handle this issue. 


Currently I don't see an issue with distribute (ie, rebalance process)

Comment 4 shishir gowda 2012-10-29 13:09:30 UTC
This is replica related bug. Readdir returns different entries at different offset from subvolume, when a replica pair goes down.

Comment 5 Pranith Kumar K 2012-12-11 08:26:25 UTC

*** This bug has been marked as a duplicate of bug 859387 ***