1031682 – dist-geo-rep: One of the nodes is taking more than 6 days to sync data when it's Active replica pair went down unexpectedly.

Bug 1031682 - dist-geo-rep: One of the nodes is taking more than 6 days to sync data when it's Active replica pair went down unexpectedly.

Summary: dist-geo-rep: One of the nodes is taking more than 6 days to sync data when i...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	consistency
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-18 14:35 UTC by M S Vishwanath Bhat
Modified:	2016-06-01 01:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description M S Vishwanath Bhat 2013-11-18 14:35:50 UTC

Description of problem:
I created a 12 node geo-rep master cluster with 6*2 dist-rep volume and created 19+ million files on it. Now I create one more 12 node slave cluster which has 6*2 dist-rep volume and start the geo-rep between them. I was using tar+ssh as the syncing method. For some unknown reason yet, one of the glusterfsd went down unexpectedly and the other node which became Active is taking mre than 6 days to sync it. All other nodes have finished syncing more than 24 hours ago.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.43rhs-1.el6rhs.x86_64

How reproducible:
I hit this in my 19+ million file setup. Not easy to reproduce it at all it is reproducible.

Steps to Reproduce:
1. Create 6*2 master volume with 12 nodes.
2. Create around 20 million small files in it.
3. Now create 6*2 slave volume with 12 nodes.
4. Now geo-rep create between master and slave. But don't start the geo-rep session yet.
5. Use the config CLI to use tar+ssh as the syncing method. use-tarssh true.
6. start the geo-rep session between master and slave.
7. Set the checkpoint to now.

Actual results:

Status details on the 6th day after staring geo-rep of 20 million files.

MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Michal.blr.redhat.com master /rhs/bricks/brick0 elton::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39 Changelog Crawl 3080427 0 0 0 0
Garret.blr.redhat.com master /rhs/bricks/brick2 elbert::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51 Changelog Crawl 4336049 0 0 0 0
Javier.blr.redhat.com master /rhs/bricks/brick4 arden::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15 Changelog Crawl 4819630 0 0 0 0
Cruz.blr.redhat.com master /rhs/bricks/brick5 alvaro::slave Passive N/A N/A 0 0 0 0 0
Tim.blr.redhat.com master /rhs/bricks/brick1 ulysses::slave Passive N/A N/A 0 0 0 0 0
Harris.blr.redhat.com master /rhs/bricks/brick3 silas::slave Passive N/A N/A 0 0 0 0 0
Morgan.blr.redhat.com master /rhs/bricks/brick10 wilmer::slave Passive N/A N/A 1785855 8192 0 0 0
Barrett.blr.redhat.com master /rhs/bricks/brick6 arnoldo::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43 Changelog Crawl 5668547 0 0 0 0
Victor.blr.redhat.com master /rhs/bricks/brick9 maxwell::slave Passive N/A N/A 0 0 0 0 0
Danny.blr.redhat.com master /rhs/bricks/brick7 forest::slave Passive N/A N/A 0 0 0 0 0
Normand.blr.redhat.com master /rhs/bricks/brick8 dorsey::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15 Changelog Crawl 5710836 0 0 0 0
Willard.blr.redhat.com master /rhs/bricks/brick11 jasper::slave Active checkpoint as of 2013-11-12 21:23:31 is not reached yet Hybrid Crawl 7659513 8192 0 0 0

If you observer one node which is running in Willard has not completed checkpointing yet. And initially brick10 which is running in the Morgan was Active node. But when the glusterfsd in Morgan went down for some unknown reason, brick11 in Willard which is a replica pair of brick10 took over and became Active. But it's been 5 days since and the data hasn't been synced yet.

And status details also indicate that one of the node Michal has synced all data in around 3 days. But other nodes have taken from 4 to 5 days to complete. But this particular node seems to be too slow to sync.

Expected results:
it should not take too long to sync the data.

Additional info:

I still have the set-up ready for further investigation. Else please let me know what other info needs for further investigation.

Comment 1 M S Vishwanath Bhat 2013-11-18 14:39:28 UTC

When I restarted the glusterfsd using volume start force, both the nodes are Active. I have pasted the status detail below.


MASTER NODE               MASTER VOL    MASTER BRICK           SLAVE             STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS       FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING    FILES SKIPPED   
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Michal.blr.redhat.com     master        /rhs/bricks/brick0     elton::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39    Changelog Crawl    3080427        0                0                0                  0               
Javier.blr.redhat.com     master        /rhs/bricks/brick4     arden::slave      Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15    Changelog Crawl    4819630        0                0                0                  0               
Morgan.blr.redhat.com     master        /rhs/bricks/brick10    wilmer::slave     Active     checkpoint as of 2013-11-12 21:23:31 is not reached yet                     Hybrid Crawl       1810430        8192             0                0                  0               
Harris.blr.redhat.com     master        /rhs/bricks/brick3     silas::slave      Passive    N/A                                                                         N/A                0              0                0                0                  0               
Barrett.blr.redhat.com    master        /rhs/bricks/brick6     arnoldo::slave    Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43    Changelog Crawl    5668547        0                0                0                  0               
Victor.blr.redhat.com     master        /rhs/bricks/brick9     maxwell::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0               
Garret.blr.redhat.com     master        /rhs/bricks/brick2     elbert::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51    Changelog Crawl    4336049        0                0                0                  0               
Normand.blr.redhat.com    master        /rhs/bricks/brick8     dorsey::slave     Active     checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15    Changelog Crawl    5710836        0                0                0                  0               
Tim.blr.redhat.com        master        /rhs/bricks/brick1     ulysses::slave    Passive    N/A                                                                         N/A                0              0                0                0                  0               
Danny.blr.redhat.com      master        /rhs/bricks/brick7     forest::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0               
Cruz.blr.redhat.com       master        /rhs/bricks/brick5     alvaro::slave     Passive    N/A                                                                         N/A                0              0                0                0                  0               
Willard.blr.redhat.com    master        /rhs/bricks/brick11    jasper::slave     Active     checkpoint as of 2013-11-12 21:23:31 is not reached yet                     Hybrid Crawl       7782393        8192             0                0                  0               



Morgan and willard have 335 and 2105 xsync changelogs generated so far.

Comment 3 Aravinda VK 2015-11-25 08:50:26 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 4 Aravinda VK 2015-11-25 08:51:50 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.