Bug 1315781
Summary: | AFR returns the node uuid of the same node for every file in the replica | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nithya Balachandran <nbalacha> | ||||
Component: | replicate | Assignee: | Karthik U S <ksubrahm> | ||||
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.1 | CC: | amukherj, aspandey, asrivast, ksubrahm, nchilaka, ravishankar, rcyriac, rhinduja, rhs-bugs | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 3.3.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.8.4-26 | Doc Type: | Bug Fix | ||||
Doc Text: |
The rebalance process uses an extended attribute to determine which node migrates a file. In replicated and erasure-coded (dispersed) volumes, only the first node of a replica set was listed in this attribute, so only the first node of a replica set migrated files. Replicated and erasure-coded volumes now list all nodes in a replica set, ensuring that rebalance processes on all nodes migrate files as expected.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1366817 (view as bug list) | Environment: | |||||
Last Closed: | 2017-09-21 04:25:52 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1462693, 1462790, 1463250, 1464078, 1487647 | ||||||
Bug Blocks: | 1366817, 1417147, 1451561, 1451573, 1487042 | ||||||
Attachments: |
|
Description
Nithya Balachandran
2016-03-08 15:19:21 UTC
upstream patch : https://review.gluster.org/17084 one more upstream patch in addition to 17084: https://review.gluster.org/#/c/17239/ on_qa validation blocked due to 1462693 - with AFR now making both nodes to return UUID for a file will result in georep consuming more resources ON_QA VALIDATION: TEST BUILD:3.8.4-36 below are the terminologies used regularly in below cases 1x2 volume with replicas as b1 on n1 and b2 on n2 add-brick to make the volume 2x2 with new replicas as b3 on n1 and b4 on n2 TC#1)Now both nodes in a replica set must participate in rebalance...previously only one node used to migrate files(check rebal status) ---->PASS, this also reduces rebalance time overall, as now all the nodes of replica participate in rebalance instead of the first node TC#2)When a brick is down, the node hosting the brick must continue with rebalance -->PASSES in general, see next case too, but fails to rebalance remaining files on the directory it was working on. It moves to next directory (Raised a BZ#1476676 - Rebalance skips files when a brick goes downs inspite of afr passing both node ids of replica to rebalance) TC#3)the other node of a replica src must be able to rebalance all files pending, when one of the src_brick is down. Nodes must be able to rebalance files from other nodes too. That is n1 must be able to rebalance files even if they are on n2 , as long as n1 and n2 are participating in same dht subvol range. Eg: If i have a 4 node setup with replicas on n1,n2 and n3,n4. Then if a rebalance is triggered, and b1 goes down, n1 must still be able to rebalance files, by getting from n2.(it won't be able to rebalance n3/n4 related bricks as they are in different subvols)--->PASS TC#4)afr must still pass both the UUIDs to dht layer, even if one of the src_replicas are down. This can be verified by below -->PASS >on 1x2,mount vol and create atleast 3 directories(say dir{1..3}) with say 1lakh files in each >now add-brick to make it 2x2 >now trigger rebal >while rebal is in progress,as part of start of rebalance, rebalance picks the directories requiring rebalance. Once it starts, the first directory will be picked for rebalancing its content, say it was dir1, now bring down b1 >rebalance from n1 may skip files in dir1(the current dir where rebal was in progress), however, it must proceed to dir2 to rebalance those files, as afr would be still sending both node UUIDs as b2(other replica) is still up. If it doesn't send, then n1 would stop rebalance , which is a problem. However afr does send and hence this case is working as expected, as n1 goes ahead with rebalancing of dir2 and dir3 TC#5)check with ec, if all nodes participate in rebalance-->PASS. yes all participate TC#6) Only nodes hosting replica set which are participating in rebalance must work on reblance-->PASS Had 1x2 added new replica pair with b3 on n1 and b4 on n3(new node), did a rebalance. n3 doesn't participate. That makes sense ,given that n3 is destination and afr of primary replica pair passes uuid of only n1 and n2 to dht layer(because, that afr replica exists only on n1 and n2) . Same with remove brick..only nodes hosting bricks being removed, particpate in rebalnce TC#7) Check for arbiter volume, All nodes must participate====>PASS moving to verified as Most thematic(testcases to test core functionality of the fix) are working and PASSED However raised , below bugs BZ#1476676 and BZ#1476828 also raised bZ#1476852 - DHT layer must dynamically load balance rebalance activity instead of hard presetting entries for each node Created attachment 1307190 [details]
crude testcase and logs while validationg
Looks good to me. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |