Bug 820518 - Issues with rebalance and self heal going simultanously
Summary: Issues with rebalance and self heal going simultanously
Keywords:
Status: CLOSED DUPLICATE of bug 859387
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: pre-release
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-10 09:35 UTC by shylesh
Modified: 2013-11-22 23:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-12-11 08:26:25 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
sos report (1000.45 KB, application/x-xz)
2012-05-10 09:35 UTC, shylesh
no flags Details

Description shylesh 2012-05-10 09:35:11 UTC
Created attachment 583493 [details]
sos report

Description of problem:
While self heal is happening rebalance doesnot completely migrated the files, rerunning rebalance again migrates some of the files

Version-Release number of selected component (if applicable):
3.3.0qa40

How reproducible:


Steps to Reproduce:
1. create a pure replica volume with count 2 (sat brick1, brick2)
2. created 100 files of 10 MB, directories of depth 100 with each level containing 1 file of 1MB each
3. Peer probed another machine and did a add-brick (brick3, brick4)
4. Initiated the rebalance and brought down brick2
5. After some time brought up the brick 
6. Checking the rebalance status says completed with count 34 , log messages says transport end point not connected,
7. Rerunning rebalance again migrates some of the files totaling upto 64.
8. Are eual-checksum on mount point before and after rebalance are same.

Actual results:
Status should not say completed until proper data migration happens.

Attached the SOS report: Rebalance log location var/log/glusterfs/repl-rebalance.log

Comment 1 Amar Tumballi 2012-05-11 07:42:08 UTC
I suspect mostly the bug is due to the issue of migrating from pure replica to distributed replica, but considering arequal-checksums are fine, i would reduce the priority.

Comment 2 shishir gowda 2012-06-08 08:42:10 UTC
I tried to re-create this on a single node. I still got failures.

When the brick is down, we seem to be getting duplicate entries from readdirp, and hence multiple migration attempts are being done.

[2012-06-08 13:32:28.202224] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1
.
.
[2012-06-08 13:32:31.624447] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1
.
.
[2012-06-08 13:32:31.640475] I [dht-rebalance.c:639:dht_migrate_file] 0-new-dht: /55.file: attempting to move from new-replicate-0 to new-replicate-1

This leads to few of the migrations to fail.
Looks like when a child of replica is down, readdir doesn't perform as expected.
Note, if a child of DHT is down, we stop rebalance(assert-on-child-down is set to on).

Comment 3 Amar Tumballi 2012-07-11 04:10:48 UTC
Part of the issue got fixed by default loading of distribute even in case of one subvolume (for the layout xattr creation). Also considering we disable replicate self-heal in rebalance process, this issue is not seen much.

Other than that, the issue is happening because replicate is returning wrong (or say in-correct) offset for readdirp_cbk() when a brick goes down. Need more feedback from replicate team to handle this issue. 


Currently I don't see an issue with distribute (ie, rebalance process)

Comment 4 shishir gowda 2012-10-29 13:09:30 UTC
This is replica related bug. Readdir returns different entries at different offset from subvolume, when a replica pair goes down.

Comment 5 Pranith Kumar K 2012-12-11 08:26:25 UTC

*** This bug has been marked as a duplicate of bug 859387 ***


Note You need to log in before you can comment on or make changes to this bug.