Bug 1462305

Summary: [Doc] Improve geo-replication documentation
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: John Call <jcall>
Component: doc-Administration_GuideAssignee: Divya <divya>
doc-Administration_Guide sub component: 3.2 Release QA Contact: Rochelle <rallan>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: asriram, avishwan, bkunal, jcall, mhideo, nlevinki, rhinduja, rhs-bugs, rwheeler, storage-doc
Version: rhgs-3.2   
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-21 04:23:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417157, 1468976    

Description John Call 2017-06-16 16:48:56 UTC
Document URL: 
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/html-single/administration_guide/#chap-Managing_Geo-replication

Section Number and Name: 
10. Managing Geo-replication

Describe the issue: 
The creation of geo-replication sync processes is not described.  In general,  the entire geo-replication section does not provide enough information.  It needs to be improved to mention the brick/volume type and configuration in addition to the "sync-jobs" tunable.

Suggestions for improvement: 
Please add language similar to this.  Taken from a sme-storage conversation with Aravinda

<quote>
If volume type is Distribute then Monitor process will start one worker for each brick.  However, if volume type is Replica or Disperse then Monitor process will make one brick active and all others will be passive.  This is to avoid duplicate syncing.

For example, a Distributed+Disperse volume defined as "1 x (4 + 2)" will give one active session, and a volume defined as "2 x (4 + 2)" will give two active sessions.
</quote>

Additional information: 
https://access.redhat.com/support/cases/#/case/01840400
https://bugzilla.redhat.com/show_bug.cgi?id=1451178

Comment 2 Aravinda VK 2017-06-17 11:36:52 UTC
(In reply to John Call from comment #0)
> Document URL: 
> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/
> html-single/administration_guide/#chap-Managing_Geo-replication
> 
> Section Number and Name: 
> 10. Managing Geo-replication
> 
> Describe the issue: 
> The creation of geo-replication sync processes is not described.  In
> general,  the entire geo-replication section does not provide enough
> information.  It needs to be improved to mention the brick/volume type and
> configuration in addition to the "sync-jobs" tunable.
> 
> Suggestions for improvement: 
> Please add language similar to this.  Taken from a sme-storage conversation
> with Aravinda
> 
> <quote>
> If volume type is Distribute then Monitor process will start one worker for
> each brick.  However, if volume type is Replica or Disperse then Monitor
> process will make one brick active and all others will be passive.  This is
> to avoid duplicate syncing.
> 
> For example, a Distributed+Disperse volume defined as "1 x (4 + 2)" will
> give one active session, and a volume defined as "2 x (4 + 2)" will give two
> active sessions.
> </quote>
> 
> Additional information: 
> https://access.redhat.com/support/cases/#/case/01840400
> https://bugzilla.redhat.com/show_bug.cgi?id=1451178

Number of workers and sync-jobs are entirely different things.

Number of workers will always equal to number of bricks in Master Volume. (Number of status lines for that session).

Number of sync-jobs represents Max number of syncer threads inside each Workers. For example, if number of Bricks is 6 and sync-jobs is 3 then in full utilization total number of syncer threads for that volume is 6x3=18 which is accros the nodes and workers.

Number of Active workers:
- In case of distribute volume, All bricks will be Active and participate in syncing.
- In case of Replica/EC, one brick/worker from Replica/EC group(subvolume) will be Active and participate in syncing, to avoid duplicate syncing from other bricks.

Comment 4 Divya 2017-06-28 10:48:37 UTC
Hi John, Bipin,

I had a meeting with Aravinda and Kotresh to discuss about the doc updates for this bug.

1. Based on my discussion, I plan to update the "sync jobs" option's description to the following:

sync-jobs N - The number of sync-jobs represents the maximum  number of syncer threads inside each worker. The number of workers will always to equal to number of bricks in the Master volume. For example, in a distributed-replicated volume (3 X 2), three workers will be active and if the sync-jobs is configured to 3, then the total number of syncer threads for that volume is 3 x 3 = 9 which is across the nodes and workers.

Note: Active and Passive Workers - The number of active workers is based on the volume configuration. In case of a distribute volume, all bricks (workers) will be active and participate in syncing. In case of replicate or dispersed volume, one worker from each replicate/disperse group (subvolume) will be active and participate in syncing. This is to avoid duplicate syncing from other bricks. The remaining workers in each replicate/disperse group (subvolume)  will be passive. In case the active worker goes down, one of the passive worker from the same replicate/disperse group will become an active worker.

Please let me know if this fine. 

2. Regarding your comment "In general, the entire geo-replication section does not provide enough information."

Regarding this comment, Aravinda, Kotresh, and I feel that creating an KCS article to provide information on geo-replication architecture and how things work would be helpful.

If you agree, I will create a new bug to track the KCS article effort. Please share your thoughts.

Thanks!

Comment 5 Bipin Kunal 2017-06-28 11:28:13 UTC
(In reply to Divya from comment #4)
> Hi John, Bipin,
> 
> I had a meeting with Aravinda and Kotresh to discuss about the doc updates
> for this bug.
> 
> 1. Based on my discussion, I plan to update the "sync jobs" option's
> description to the following:
> 
> sync-jobs N - The number of sync-jobs represents the maximum  number of
> syncer threads inside each worker. The number of workers will always to
> equal to number of bricks in the Master volume. For example, in a
> distributed-replicated volume (3 X 2), three workers will be active and if
> the sync-jobs is configured to 3, then the total number of syncer threads
> for that volume is 3 x 3 = 9 which is across the nodes and workers.
> 
> Note: Active and Passive Workers - The number of active workers is based on
> the volume configuration. In case of a distribute volume, all bricks
> (workers) will be active and participate in syncing. In case of replicate or
> dispersed volume, one worker from each replicate/disperse group (subvolume)
> will be active and participate in syncing. This is to avoid duplicate
> syncing from other bricks. The remaining workers in each replicate/disperse
> group (subvolume)  will be passive. In case the active worker goes down, one
> of the passive worker from the same replicate/disperse group will become an
> active worker.
> 
> Please let me know if this fine. 

Looks fine to me.

> 
> 2. Regarding your comment "In general, the entire geo-replication section
> does not provide enough information."
> 
> Regarding this comment, Aravinda, Kotresh, and I feel that creating an KCS
> article to provide information on geo-replication architecture and how
> things work would be helpful.

We are already working on creating doc for internal consumption. We can create KCS out of that. In long run plan is to have even some feature details for each section.

But for now creating KCS and may be updating upstream doc should be enough.

> 
> If you agree, I will create a new bug to track the KCS article effort.
> Please share your thoughts.

Sure, Please let me know the Bug ID once you have.

> 
> Thanks!

Comment 6 Bipin Kunal 2017-06-28 11:28:58 UTC
setting needinfo for John to get his feedback,

Comment 7 John Call 2017-06-30 03:39:04 UTC
(In reply to Divya from comment #4)
> 
> sync-jobs N - The number of sync-jobs represents the maximum  number of
> syncer threads inside each worker. The number of workers will always to
> equal to number of bricks in the Master volume. For example, in a
> distributed-replicated volume (3 X 2), three workers will be active and if
> the sync-jobs is configured to 3, then the total number of syncer threads
> for that volume is 3 x 3 = 9 which is across the nodes and workers.

This still seems too vague.  What is a thread?  How does this relate to rsync?  Are "threads" an rsync process, or some process creating/modifying the changelog, or something else?  The current doc describes sync-jobs as "The number of simultaneous files/directories that can be synchronized."  I understand rsync is a serial (one-file-at-a-time) operation.

I like the example and would consider a few punctuation changes, perhaps like this.

...  The number of workers is always equal to the number of bricks in the Master volume.  For example, a distributed-replicated volume of (3 x 2) with sync-jobs configured at 3 results in 9 total sync-jobs (aka threads) across all nodes/servers.

> 
> Note: Active and Passive Workers - The number of active workers is based on
> the volume configuration. In case of a distribute volume, all bricks
> (workers) will be active and participate in syncing. In case of replicate or
> dispersed volume, one worker from each replicate/disperse group (subvolume)
> will be active and participate in syncing. This is to avoid duplicate
> syncing from other bricks. The remaining workers in each replicate/disperse
> group (subvolume)  will be passive. In case the active worker goes down, one
> of the passive worker from the same replicate/disperse group will become an
> active worker.

Great!

> 2. Regarding your comment "In general, the entire geo-replication section
> does not provide enough information."
> 
> Regarding this comment, Aravinda, Kotresh, and I feel that creating an KCS
> article to provide information on geo-replication architecture and how
> things work would be helpful.

Agreed, KCS is good, or a blog at http://redhatstorage.redhat.com/

Thanks for your help!

Comment 8 Divya 2017-07-03 10:06:09 UTC
(In reply to John Call from comment #7)
> (In reply to Divya from comment #4)
> > 
> > sync-jobs N - The number of sync-jobs represents the maximum  number of
> > syncer threads inside each worker. The number of workers will always to
> > equal to number of bricks in the Master volume. For example, in a
> > distributed-replicated volume (3 X 2), three workers will be active and if
> > the sync-jobs is configured to 3, then the total number of syncer threads
> > for that volume is 3 x 3 = 9 which is across the nodes and workers.
> 
> This still seems too vague.  What is a thread?  How does this relate to
> rsync?  Are "threads" an rsync process, or some process creating/modifying
> the changelog, or something else?  The current doc describes sync-jobs as
> "The number of simultaneous files/directories that can be synchronized."  I
> understand rsync is a serial (one-file-at-a-time) operation.
> 
> I like the example and would consider a few punctuation changes, perhaps
> like this.
> 
> ...  The number of workers is always equal to the number of bricks in the
> Master volume.  For example, a distributed-replicated volume of (3 x 2) with
> sync-jobs configured at 3 results in 9 total sync-jobs (aka threads) across
> all nodes/servers.

Johh, 

Thanks for your feedback. Based on your inputs, planning to update the description as:

sync-jobs N - The number of sync-jobs represents the maximum  number of syncer threads (rsync processes or tar over ssh processes for syncing) inside each worker.The number of workers is always equal to the number of bricks in the Master volume.  For example, a distributed-replicated volume of (3 x 2) with sync-jobs configured at 3 results in 9 total sync-jobs (aka threads) across all nodes/servers.

Could you confirm if this is fine.

Thanks!

> 
> > 2. Regarding your comment "In general, the entire geo-replication section
> > does not provide enough information."
> > 
> > Regarding this comment, Aravinda, Kotresh, and I feel that creating an KCS
> > article to provide information on geo-replication architecture and how
> > things work would be helpful.
> 
> Agreed, KCS is good, or a blog at http://redhatstorage.redhat.com/
> 
> Thanks for your help!

Comment 9 Divya 2017-07-03 10:07:59 UTC
(In reply to Bipin Kunal from comment #5)
> (In reply to Divya from comment #4)
> > Hi John, Bipin,
> > 
> > I had a meeting with Aravinda and Kotresh to discuss about the doc updates
> > for this bug.
> > 
> > 1. Based on my discussion, I plan to update the "sync jobs" option's
> > description to the following:
> > 
> > sync-jobs N - The number of sync-jobs represents the maximum  number of
> > syncer threads inside each worker. The number of workers will always to
> > equal to number of bricks in the Master volume. For example, in a
> > distributed-replicated volume (3 X 2), three workers will be active and if
> > the sync-jobs is configured to 3, then the total number of syncer threads
> > for that volume is 3 x 3 = 9 which is across the nodes and workers.
> > 
> > Note: Active and Passive Workers - The number of active workers is based on
> > the volume configuration. In case of a distribute volume, all bricks
> > (workers) will be active and participate in syncing. In case of replicate or
> > dispersed volume, one worker from each replicate/disperse group (subvolume)
> > will be active and participate in syncing. This is to avoid duplicate
> > syncing from other bricks. The remaining workers in each replicate/disperse
> > group (subvolume)  will be passive. In case the active worker goes down, one
> > of the passive worker from the same replicate/disperse group will become an
> > active worker.
> > 
> > Please let me know if this fine. 
> 
> Looks fine to me.

Thanks for the confirmation, Bipin!
> 
> > 
> > 2. Regarding your comment "In general, the entire geo-replication section
> > does not provide enough information."
> > 
> > Regarding this comment, Aravinda, Kotresh, and I feel that creating an KCS
> > article to provide information on geo-replication architecture and how
> > things work would be helpful.
> 
> We are already working on creating doc for internal consumption. We can
> create KCS out of that. In long run plan is to have even some feature
> details for each section.
> 
> But for now creating KCS and may be updating upstream doc should be enough.
> 
> > 
> > If you agree, I will create a new bug to track the KCS article effort.
> > Please share your thoughts.
> 
> Sure, Please let me know the Bug ID once you have.

I have created https://bugzilla.redhat.com/show_bug.cgi?id=1467279 to track the KCS article effort.

Cheers!

Comment 10 John Call 2017-07-03 17:20:26 UTC
> Could you confirm if this is fine.

Yes, this seems fine.

Comment 11 Divya 2017-07-06 08:18:21 UTC
Thanks for the confirmation, John!

Updated the sync-jobs' description. Link to the doc: https://access.qa.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/administration_guide/#sect-Configuring_a_Geo-replication_Session

Moving the bug ON_QA.

Comment 12 Rochelle 2017-07-18 06:30:33 UTC
Looked into the information provided and discussed in the above comments [4 and 8] with the document provided in comment 11.

The documentation looks good to me.

Moving this bug to verified.