Description of problem ====================== Previously when a read is requested, AFR was returning only one uuid out of the two replica bricks, but this meant that when a rebalance was triggered only one brick participating and hence load balancing was missing. This problem was fixed with BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica However this fix didn't consider other scenarios and consumers of this facility. for eg, georep too was relying on this service from AFR, and now with this service, it means that georep will be served by both replica bricks and hence syncing of data can happen twice for the same data. Now instead of one brick being passive and other being active, both turn active and hence lead to unnecessary resource utilization(eg CPU) Version-Release number of selected component (if applicable): ====== 3.8.4-28 Proposed Resolution : ------------------- option 1: fix this problem not just for georep, but all the consumers of AFR or option 2: else roll back the fix if the above is complicated for time being ("BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica")
Note; this bug is blocking validation of BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica
We'd need changes from geo-rep and dht to fix this issue. geo-rep upstream patch : https://review.gluster.org/17582
(In reply to Atin Mukherjee from comment #4) > We'd need changes from geo-rep and dht to fix this issue. > > geo-rep upstream patch : https://review.gluster.org/17582 this approach has again changed as it breaks tiering. Final decision was to fall back to the old behavior. AFR patch : https://review.gluster.org/17576 EC patch : https://review.gluster.org/17594 DHT patch : yet to be posted
(In reply to Atin Mukherjee from comment #6) > (In reply to Atin Mukherjee from comment #4) > > We'd need changes from geo-rep and dht to fix this issue. > > > > geo-rep upstream patch : https://review.gluster.org/17582 > > this approach has again changed as it breaks tiering. > > Final decision was to fall back to the old behavior. > > AFR patch : https://review.gluster.org/17576 > EC patch : https://review.gluster.org/17594 > DHT patch : yet to be posted DHT & tiering changes : https://review.gluster.org/17595
on_qa validation on 3.8.4-36 As part of the fix, In all cases, the only one node must be active in master and slave case 1 : ======= create a master cluser (2 nodes) and slave (2 nodes) create a 1x2 volume on master and on slave meta_volume is not enabled create and start the geo-replication session Geo-replication shows one active brick (from one node) Create data from master mount Data sta syncs correctly, checksum matched RESULT: Only 1 Active is shown from 1x2 volume on 2 node setup case 2 : ====== (Continued from previous case) kill -9 brick pid which is Active The other brick becomes Active and the killed brick become passive create data from master mount Data syncs correctly, checksum matched RESULT: Bringing down Active bricks, makes the worker passive and other brick becomes Active from Passive case 3: ====== (continued from previous case) Bring the active node down The brick that is passive becomes active (only one brick==> active) RESULT: If the active node was brought down, only entry shown is the online nodes and that brick becomes ACTIVE case 4: ======= create a dispersed master volume (6-node cluster) 1 x (4 + 2) = 6 and a DR volume on slave, 2x2 (from a 2-node cluster) create and start the geo-replication session without enabling meta_volume Create data from master mount Data syncs correctly, checksum matched RESULT: one brick ACTIVE and the rest are passive case 5: ====== (continued from previous case) kill -9 active brick Another brick becomes active and the rest are passive Create data from the master mount Data syncs correctly, checksum matched RESULT: Bringing down Active bricks, makes the worker passive and one of the other brick becomes Active from passive
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774