Description of problem: I created a 12 node geo-rep master cluster with 6*2 dist-rep volume and created 19+ million files on it. Now I create one more 12 node slave cluster which has 6*2 dist-rep volume and start the geo-rep between them. I was using tar+ssh as the syncing method. For some unknown reason yet, one of the glusterfsd went down unexpectedly and the other node which became Active is taking mre than 6 days to sync it. All other nodes have finished syncing more than 24 hours ago. Version-Release number of selected component (if applicable): glusterfs-3.4.0.43rhs-1.el6rhs.x86_64 How reproducible: I hit this in my 19+ million file setup. Not easy to reproduce it at all it is reproducible. Steps to Reproduce: 1. Create 6*2 master volume with 12 nodes. 2. Create around 20 million small files in it. 3. Now create 6*2 slave volume with 12 nodes. 4. Now geo-rep create between master and slave. But don't start the geo-rep session yet. 5. Use the config CLI to use tar+ssh as the syncing method. use-tarssh true. 6. start the geo-rep session between master and slave. 7. Set the checkpoint to now. Actual results: Status details on the 6th day after staring geo-rep of 20 million files. MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Michal.blr.redhat.com master /rhs/bricks/brick0 elton::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39 Changelog Crawl 3080427 0 0 0 0 Garret.blr.redhat.com master /rhs/bricks/brick2 elbert::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51 Changelog Crawl 4336049 0 0 0 0 Javier.blr.redhat.com master /rhs/bricks/brick4 arden::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15 Changelog Crawl 4819630 0 0 0 0 Cruz.blr.redhat.com master /rhs/bricks/brick5 alvaro::slave Passive N/A N/A 0 0 0 0 0 Tim.blr.redhat.com master /rhs/bricks/brick1 ulysses::slave Passive N/A N/A 0 0 0 0 0 Harris.blr.redhat.com master /rhs/bricks/brick3 silas::slave Passive N/A N/A 0 0 0 0 0 Morgan.blr.redhat.com master /rhs/bricks/brick10 wilmer::slave Passive N/A N/A 1785855 8192 0 0 0 Barrett.blr.redhat.com master /rhs/bricks/brick6 arnoldo::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43 Changelog Crawl 5668547 0 0 0 0 Victor.blr.redhat.com master /rhs/bricks/brick9 maxwell::slave Passive N/A N/A 0 0 0 0 0 Danny.blr.redhat.com master /rhs/bricks/brick7 forest::slave Passive N/A N/A 0 0 0 0 0 Normand.blr.redhat.com master /rhs/bricks/brick8 dorsey::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15 Changelog Crawl 5710836 0 0 0 0 Willard.blr.redhat.com master /rhs/bricks/brick11 jasper::slave Active checkpoint as of 2013-11-12 21:23:31 is not reached yet Hybrid Crawl 7659513 8192 0 0 0 If you observer one node which is running in Willard has not completed checkpointing yet. And initially brick10 which is running in the Morgan was Active node. But when the glusterfsd in Morgan went down for some unknown reason, brick11 in Willard which is a replica pair of brick10 took over and became Active. But it's been 5 days since and the data hasn't been synced yet. And status details also indicate that one of the node Michal has synced all data in around 3 days. But other nodes have taken from 4 to 5 days to complete. But this particular node seems to be too slow to sync. Expected results: it should not take too long to sync the data. Additional info: I still have the set-up ready for further investigation. Else please let me know what other info needs for further investigation.
When I restarted the glusterfsd using volume start force, both the nodes are Active. I have pasted the status detail below. MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING FILES SKIPPED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Michal.blr.redhat.com master /rhs/bricks/brick0 elton::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-15 15:41:39 Changelog Crawl 3080427 0 0 0 0 Javier.blr.redhat.com master /rhs/bricks/brick4 arden::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 12:22:15 Changelog Crawl 4819630 0 0 0 0 Morgan.blr.redhat.com master /rhs/bricks/brick10 wilmer::slave Active checkpoint as of 2013-11-12 21:23:31 is not reached yet Hybrid Crawl 1810430 8192 0 0 0 Harris.blr.redhat.com master /rhs/bricks/brick3 silas::slave Passive N/A N/A 0 0 0 0 0 Barrett.blr.redhat.com master /rhs/bricks/brick6 arnoldo::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 00:47:43 Changelog Crawl 5668547 0 0 0 0 Victor.blr.redhat.com master /rhs/bricks/brick9 maxwell::slave Passive N/A N/A 0 0 0 0 0 Garret.blr.redhat.com master /rhs/bricks/brick2 elbert::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-17 04:55:51 Changelog Crawl 4336049 0 0 0 0 Normand.blr.redhat.com master /rhs/bricks/brick8 dorsey::slave Active checkpoint as of 2013-11-12 21:23:31 is completed at 2013-11-18 02:44:15 Changelog Crawl 5710836 0 0 0 0 Tim.blr.redhat.com master /rhs/bricks/brick1 ulysses::slave Passive N/A N/A 0 0 0 0 0 Danny.blr.redhat.com master /rhs/bricks/brick7 forest::slave Passive N/A N/A 0 0 0 0 0 Cruz.blr.redhat.com master /rhs/bricks/brick5 alvaro::slave Passive N/A N/A 0 0 0 0 0 Willard.blr.redhat.com master /rhs/bricks/brick11 jasper::slave Active checkpoint as of 2013-11-12 21:23:31 is not reached yet Hybrid Crawl 7782393 8192 0 0 0 Morgan and willard have 335 and 2105 xsync changelogs generated so far.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.