Bug 1031682

Summary: dist-geo-rep: One of the nodes is taking more than 6 days to sync data when it's Active replica pair went down unexpectedly.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: M S Vishwanath Bhat <vbhat>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED EOL QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: avishwan, chrisw, csaba, mzywusko
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: consistency
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description M S Vishwanath Bhat 2013-11-18 14:35:50 UTC
Description of problem:
I created a 12 node geo-rep master cluster with 6*2 dist-rep volume and created 19+ million files on it. Now I create one more 12 node slave cluster which has 6*2 dist-rep volume and start the geo-rep between them. I was using tar+ssh as the syncing method. For some unknown reason yet, one of the glusterfsd went down unexpectedly and the other node which became Active is taking mre than 6 days to sync it. All other nodes have finished syncing more than 24 hours ago.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.43rhs-1.el6rhs.x86_64

How reproducible:
I hit this in my 19+ million file setup. Not easy to reproduce it at all it is reproducible.

Steps to Reproduce:
1. Create 6*2 master volume with 12 nodes.
2. Create around 20 million small files in it. 
3. Now create 6*2 slave volume with 12 nodes.
4. Now geo-rep create between master and slave. But don't start the geo-rep session yet.
5. Use the config CLI to use tar+ssh as the syncing method. use-tarssh true.
6. start the geo-rep session between master and slave.
7. Set the checkpoint to now.

Actual results:

Status details on the 6th day after staring geo-rep of 20 million files.

MASTER NODE               MASTER VOL    MASTER BRICK           SLAVE             STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS       FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING    FILES SKIPPED
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Michal.blr.redhat.com     master        /rhs/bricks/brick0     elton::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39    Changelog Crawl    3080427        0                0                0                  0
Garret.blr.redhat.com     master        /rhs/bricks/brick2     elbert::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51    Changelog Crawl    4336049        0                0                0                  0
Javier.blr.redhat.com     master        /rhs/bricks/brick4     arden::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15    Changelog Crawl    4819630        0                0                0                  0
Cruz.blr.redhat.com       master        /rhs/bricks/brick5     alvaro::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0
Tim.blr.redhat.com        master        /rhs/bricks/brick1     ulysses::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0
Harris.blr.redhat.com     master        /rhs/bricks/brick3     silas::slave      Passive    N/A                                                                         N/A                0              0                0                0                  0
Morgan.blr.redhat.com     master        /rhs/bricks/brick10    wilmer::slave     Passive    N/A                                                                         N/A                1785855        8192             0                0                  0
Barrett.blr.redhat.com    master        /rhs/bricks/brick6     arnoldo::slave    Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43    Changelog Crawl    5668547        0                0                0                  0
Victor.blr.redhat.com     master        /rhs/bricks/brick9     maxwell::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0
Danny.blr.redhat.com      master        /rhs/bricks/brick7     forest::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0
Normand.blr.redhat.com    master        /rhs/bricks/brick8     dorsey::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15    Changelog Crawl    5710836        0                0                0                  0
Willard.blr.redhat.com    master        /rhs/bricks/brick11    jasper::slave     Active     checkpoint as of 2013-11-12 21:23:31 is not reached yet                     Hybrid Crawl       7659513        8192             0                0                  0



If you observer one node which is running in Willard has not completed checkpointing yet. And initially brick10 which is running in the Morgan was Active node. But when the glusterfsd in Morgan went down for some unknown reason, brick11 in Willard which is a replica pair of brick10 took over and became Active. But it's been 5 days since and the data hasn't been synced yet.


And status details also indicate that one of the node Michal has synced all data in around 3 days. But other nodes have taken from 4 to 5 days to complete. But this particular node seems to be too slow to sync.

Expected results:
it should not take too long to sync the data.

Additional info:


I still have the set-up ready for further investigation. Else please let me know what other info needs for further investigation.

Comment 1 M S Vishwanath Bhat 2013-11-18 14:39:28 UTC
When I restarted the glusterfsd using volume start force, both the nodes are Active. I have pasted the status detail below.


MASTER NODE               MASTER VOL    MASTER BRICK           SLAVE             STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS       FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING    FILES SKIPPED   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Michal.blr.redhat.com     master        /rhs/bricks/brick0     elton::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39    Changelog Crawl    3080427        0                0                0                  0               
Javier.blr.redhat.com     master        /rhs/bricks/brick4     arden::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15    Changelog Crawl    4819630        0                0                0                  0               
Morgan.blr.redhat.com     master        /rhs/bricks/brick10    wilmer::slave     Active     checkpoint as of 2013-11-12 21:23:31 is not reached yet                     Hybrid Crawl       1810430        8192             0                0                  0               
Harris.blr.redhat.com     master        /rhs/bricks/brick3     silas::slave      Passive    N/A                                                                         N/A                0              0                0                0                  0               
Barrett.blr.redhat.com    master        /rhs/bricks/brick6     arnoldo::slave    Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43    Changelog Crawl    5668547        0                0                0                  0               
Victor.blr.redhat.com     master        /rhs/bricks/brick9     maxwell::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0               
Garret.blr.redhat.com     master        /rhs/bricks/brick2     elbert::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51    Changelog Crawl    4336049        0                0                0                  0               
Normand.blr.redhat.com    master        /rhs/bricks/brick8     dorsey::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15    Changelog Crawl    5710836        0                0                0                  0               
Tim.blr.redhat.com        master        /rhs/bricks/brick1     ulysses::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0               
Danny.blr.redhat.com      master        /rhs/bricks/brick7     forest::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0               
Cruz.blr.redhat.com       master        /rhs/bricks/brick5     alvaro::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0               
Willard.blr.redhat.com    master        /rhs/bricks/brick11    jasper::slave     Active     checkpoint as of 2013-11-12 21:23:31 is not reached yet                     Hybrid Crawl       7782393        8192             0                0                  0               



Morgan and willard have 335 and 2105 xsync changelogs generated so far.

Comment 3 Aravinda VK 2015-11-25 08:50:26 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 4 Aravinda VK 2015-11-25 08:51:50 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.