+++ This bug was initially created as a clone of Bug #1174659 +++ Description of problem: Input/output error in slave aux-mount and geo-rep worker going to faulty. Popen: ssh> tar: .gfid/1b15499f-2ad0-4a45-9429-06281b72c111: Cannot open: Input/output error ^^ this indicates that 'tar' failed to read certain files on slaves, since this is the message on 'ssh' stdout which is thrown back at master as connection lost with a subsequent restart attempt ~~~ finalize] <top>: exiting. set_state] Monitor: new state: faulty set_state] Monitor: new state: Initializing... ~~~ Geo-rep somehow has reached stability, but it has failed to copy a lot of files. --- Additional comment from Aravinda VK on 2014-12-17 05:43:47 EST --- tar+ssh doesn't have intelligence to retry and skip if any file failed to sync, but rsync mode can retry and skips if it is unable to sync. That explains why 4500+ files missing in slave. Geo-rep worker is stuck in processing the files for which it is getting I/O error and other good files just queued up from that brick. If we fix the split brain issue, geo-rep worker will process those files and also syncs all the files which are queued up. As a workaround, we can switch off tarssh, once it processes all the files by skipping problematic files we can switch back to tarssh. gluster volume geo-replication <MASTER> <SLAVEHOST>::<SLAVEVOL> config use_tarssh false
REVIEW: http://review.gluster.org/9356 (geo-rep: Error handling in tar+ssh mode) posted (#1) for review on master by Aravinda VK (avishwan)
COMMIT: http://review.gluster.org/9356 committed in master by Venky Shankar (vshankar) ------ commit c399cec72b9985f120a1495e93e1a380911547d9 Author: Aravinda VK <avishwan> Date: Fri Dec 26 19:12:22 2014 +0530 geo-rep: Error handling in tar+ssh mode Georep raises exception if tar+ssh fails and worker dies due to the exception. This patch adds resilience to tar+ssh error and geo-rep worker retries when error, and skips those changelogs after maximum retries.(same as rsync mode) Removed warning messages for each rsync/tar+ssh failure per GFID, since skipped list will be populated after Max retry. Retry changelog files log also available, hence warning message for each GFID is redundent. BUG: 1177527 Change-Id: I3019c5c1ada7fc0822e4b14831512d283755b1ea Signed-off-by: Aravinda VK <avishwan> Reviewed-on: http://review.gluster.org/9356 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Kotresh HR <khiremat> Reviewed-by: Venky Shankar <vshankar> Tested-by: Venky Shankar <vshankar>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report. glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user