Bug 1177527

Summary: Geo-Replication : many files are missing in slave volume
Product: [Community] GlusterFS Reporter: Aravinda VK <avishwan>
Component: geo-replicationAssignee: Aravinda VK <avishwan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: aavati, avishwan, bhubbard, bugs, csaba, cww, dblack, gluster-bugs, khiremat, nlevinki, nsathyan, storage-qa-internal, vgaikwad, vnosov, vumrao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.0beta1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1174659 Environment:
Last Closed: 2015-05-14 17:26:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1174659    
Bug Blocks:    

Description Aravinda VK 2014-12-28 15:12:30 UTC
+++ This bug was initially created as a clone of Bug #1174659 +++

Description of problem:
Input/output error in slave aux-mount and geo-rep worker going to faulty. 

Popen: ssh> tar: .gfid/1b15499f-2ad0-4a45-9429-06281b72c111: Cannot open: Input/output error

^^ this indicates that 'tar' failed to read certain files on slaves, since this is the message on 'ssh' stdout which is thrown back at master as connection lost with a subsequent restart attempt

~~~
finalize] <top>: exiting.
set_state] Monitor: new state: faulty
set_state] Monitor: new state: Initializing...
~~~

Geo-rep somehow has reached stability, but it has failed to copy a lot of files. 

--- Additional comment from Aravinda VK on 2014-12-17 05:43:47 EST ---

tar+ssh doesn't have intelligence to retry and skip if any file failed to sync, but rsync mode can retry and skips if it is unable to sync. That explains why 4500+ files missing in slave. Geo-rep worker is stuck in processing the files for which it is getting I/O error and other good files just queued up from that brick.

If we fix the split brain issue, geo-rep worker will process those files and also syncs all the files which are queued up.

As a workaround, we can switch off tarssh, once it processes all the files by skipping problematic files we can switch back to tarssh.

gluster volume geo-replication <MASTER> <SLAVEHOST>::<SLAVEVOL> config use_tarssh false

Comment 1 Anand Avati 2014-12-28 15:26:11 UTC
REVIEW: http://review.gluster.org/9356 (geo-rep: Error handling in tar+ssh mode) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Anand Avati 2014-12-29 17:11:24 UTC
COMMIT: http://review.gluster.org/9356 committed in master by Venky Shankar (vshankar) 
------
commit c399cec72b9985f120a1495e93e1a380911547d9
Author: Aravinda VK <avishwan>
Date:   Fri Dec 26 19:12:22 2014 +0530

    geo-rep: Error handling in tar+ssh mode
    
    Georep raises exception if tar+ssh fails and worker
    dies due to the exception.
    
    This patch adds resilience to tar+ssh error and geo-rep
    worker retries when error, and skips those changelogs
    after maximum retries.(same as rsync mode)
    
    Removed warning messages for each rsync/tar+ssh failure
    per GFID, since skipped list will be populated after Max
    retry. Retry changelog files log also available, hence
    warning message for each GFID is redundent.
    
    BUG: 1177527
    Change-Id: I3019c5c1ada7fc0822e4b14831512d283755b1ea
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/9356
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kotresh HR <khiremat>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>

Comment 3 Niels de Vos 2015-05-14 17:26:19 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Niels de Vos 2015-05-14 17:28:16 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:35:14 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user