Bug 1462693 - with AFR now making both nodes to return UUID for a file will result in georep consuming more resources
with AFR now making both nodes to return UUID for a file will result in geore...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity urgent
: ---
: RHGS 3.3.0
Assigned To: Karthik U S
nchilaka
3.3.0-devel-freeze-exception
:
Depends On: 1464078 1462790 1463250 1487647
Blocks: 1315781 1417151
  Show dependency treegraph
 
Reported: 2017-06-19 05:57 EDT by nchilaka
Modified: 2017-09-21 00:59 EDT (History)
9 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-32
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1462790 1464078 (view as bug list)
Environment:
Last Closed: 2017-09-21 00:59:42 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2017-06-19 05:57:30 EDT
Description of problem
======================
Previously when a read is requested, AFR was returning only one uuid out of the two replica bricks, but this meant that when a rebalance was triggered only one brick participating and hence load balancing was missing.
This problem was fixed with BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica

However this fix didn't consider other scenarios and consumers of this facility.
for eg, georep too was relying on this service from AFR, and now with this service, it means that georep will be served by both replica bricks and hence syncing of data can happen twice for the same data. Now instead of one brick being passive and other being active, both turn active and hence lead to unnecessary resource utilization(eg CPU)


Version-Release number of selected component (if applicable):
======
3.8.4-28



Proposed Resolution :
-------------------
option 1: fix this problem not just for georep, but all the consumers of AFR
or 
option 2: else roll back the fix if the above is complicated for time being  ("BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica")
Comment 2 nchilaka 2017-06-19 06:00:34 EDT
Note; this bug is blocking validation of BZ#1315781 - AFR returns the node uuid of the same node for every file in the replica
Comment 4 Atin Mukherjee 2017-06-20 11:18:10 EDT
We'd need changes from geo-rep and dht to fix this issue.

geo-rep upstream patch : https://review.gluster.org/17582
Comment 6 Atin Mukherjee 2017-06-21 07:42:46 EDT
(In reply to Atin Mukherjee from comment #4)
> We'd need changes from geo-rep and dht to fix this issue.
> 
> geo-rep upstream patch : https://review.gluster.org/17582

this approach has again changed as it breaks tiering.

Final decision was to fall back to the old behavior. 

AFR patch : https://review.gluster.org/17576
EC patch : https://review.gluster.org/17594
DHT patch : yet to be posted
Comment 7 Atin Mukherjee 2017-06-23 02:43:57 EDT
(In reply to Atin Mukherjee from comment #6)
> (In reply to Atin Mukherjee from comment #4)
> > We'd need changes from geo-rep and dht to fix this issue.
> > 
> > geo-rep upstream patch : https://review.gluster.org/17582
> 
> this approach has again changed as it breaks tiering.
> 
> Final decision was to fall back to the old behavior. 
> 
> AFR patch : https://review.gluster.org/17576
> EC patch : https://review.gluster.org/17594
> DHT patch : yet to be posted

DHT & tiering changes : https://review.gluster.org/17595
Comment 12 nchilaka 2017-07-31 11:37:24 EDT
on_qa validation on 3.8.4-36

As part of the fix, In all cases, the only one node must be active in master and slave



case 1 :
=======
create a master cluser (2 nodes) and slave (2 nodes)
create a 1x2 volume on master and on slave 
meta_volume is not enabled
create and start the geo-replication session
Geo-replication shows one active brick (from one node)
Create data from master mount
Data sta syncs correctly, checksum matched

RESULT: Only 1 Active is shown from 1x2 volume on 2 node setup

case 2 :
======
(Continued from previous case)
kill -9 brick pid which is Active
The other brick becomes Active and the killed brick become passive
create data from master mount
Data syncs correctly, checksum matched

RESULT: Bringing down Active bricks, makes the worker passive and other brick becomes Active from Passive 


case 3:
======
(continued from previous case)
Bring the active node down
The brick that is passive becomes active (only one brick==> active)

RESULT: If the active node was brought down, only entry shown is the online nodes and that brick becomes ACTIVE

case 4:
=======
create a dispersed master volume (6-node cluster) 1 x (4 + 2) = 6 and a DR volume on slave, 2x2 (from a 2-node cluster)
create and start the geo-replication session without enabling meta_volume
Create data from master mount
Data syncs correctly, checksum matched

RESULT: one brick ACTIVE and the rest are passive

case 5:
======
(continued from previous case)
kill -9 active brick
Another brick becomes active and the rest are passive
Create data from the master mount
Data syncs correctly, checksum matched

 
RESULT: Bringing down Active bricks, makes the worker passive and one of the other brick becomes Active from passive
Comment 15 errata-xmlrpc 2017-09-21 00:59:42 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.