Bug 1177527 - Geo-Replication : many files are missing in slave volume
Summary: Geo-Replication : many files are missing in slave volume
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: geo-replication
Version: mainline
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Aravinda VK
QA Contact:
URL:
Whiteboard:
Depends On: 1174659
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-12-28 15:12 UTC by Aravinda VK
Modified: 2015-05-14 17:35 UTC (History)
15 users (show)

Fixed In Version: glusterfs-3.7.0beta1
Clone Of: 1174659
Environment:
Last Closed: 2015-05-14 17:26:19 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Aravinda VK 2014-12-28 15:12:30 UTC
+++ This bug was initially created as a clone of Bug #1174659 +++

Description of problem:
Input/output error in slave aux-mount and geo-rep worker going to faulty. 

Popen: ssh> tar: .gfid/1b15499f-2ad0-4a45-9429-06281b72c111: Cannot open: Input/output error

^^ this indicates that 'tar' failed to read certain files on slaves, since this is the message on 'ssh' stdout which is thrown back at master as connection lost with a subsequent restart attempt

~~~
finalize] <top>: exiting.
set_state] Monitor: new state: faulty
set_state] Monitor: new state: Initializing...
~~~

Geo-rep somehow has reached stability, but it has failed to copy a lot of files. 

--- Additional comment from Aravinda VK on 2014-12-17 05:43:47 EST ---

tar+ssh doesn't have intelligence to retry and skip if any file failed to sync, but rsync mode can retry and skips if it is unable to sync. That explains why 4500+ files missing in slave. Geo-rep worker is stuck in processing the files for which it is getting I/O error and other good files just queued up from that brick.

If we fix the split brain issue, geo-rep worker will process those files and also syncs all the files which are queued up.

As a workaround, we can switch off tarssh, once it processes all the files by skipping problematic files we can switch back to tarssh.

gluster volume geo-replication <MASTER> <SLAVEHOST>::<SLAVEVOL> config use_tarssh false

Comment 1 Anand Avati 2014-12-28 15:26:11 UTC
REVIEW: http://review.gluster.org/9356 (geo-rep: Error handling in tar+ssh mode) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Anand Avati 2014-12-29 17:11:24 UTC
COMMIT: http://review.gluster.org/9356 committed in master by Venky Shankar (vshankar) 
------
commit c399cec72b9985f120a1495e93e1a380911547d9
Author: Aravinda VK <avishwan>
Date:   Fri Dec 26 19:12:22 2014 +0530

    geo-rep: Error handling in tar+ssh mode
    
    Georep raises exception if tar+ssh fails and worker
    dies due to the exception.
    
    This patch adds resilience to tar+ssh error and geo-rep
    worker retries when error, and skips those changelogs
    after maximum retries.(same as rsync mode)
    
    Removed warning messages for each rsync/tar+ssh failure
    per GFID, since skipped list will be populated after Max
    retry. Retry changelog files log also available, hence
    warning message for each GFID is redundent.
    
    BUG: 1177527
    Change-Id: I3019c5c1ada7fc0822e4b14831512d283755b1ea
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/9356
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kotresh HR <khiremat>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>

Comment 3 Niels de Vos 2015-05-14 17:26:19 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Niels de Vos 2015-05-14 17:28:16 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:35:14 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.