Bug 1313623

Summary: [georep+disperse]: Geo-Rep session went to faulty with errors "[Errno 5] Input/output error"
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: disperseAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.7.8CC: aspandey, asrivast, avishwan, bugs, byarlaga, khiremat, rhinduja, rmekala, sankarshan, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1296496 Environment:
Last Closed: 2016-04-19 06:58:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1294062, 1296496    
Bug Blocks:    

Description Kotresh HR 2016-03-02 05:07:48 UTC
+++ This bug was initially created as a clone of Bug #1296496 +++

+++ This bug was initially created as a clone of Bug #1294062 +++

Description of problem:
=======================

Georep session went to faulty with following errors in geo-rep logs:

[2015-12-24 10:57:45.463694] E [syncdutils(/rhs/brick2/ct-8):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 662, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1439, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 586, in crawlwrap
    '.', '.'.join([str(self.uuid), str(gconf.slave_id)]))
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 323, in ff
    return f(*a)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 489, in stime_mnt
    8)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 55, in lgetxattr
    return cls._query_xattr(path, siz, 'lgetxattr', attr)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error


getfattr on slave mount logs: Input/Output error

[root@dhcp37-133 ~]# getfattr -d -m . -e hex /mnt/test/
getfattr: Removing leading '/' from absolute path names
# file: mnt/test/
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
/mnt/test/: trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime: Input/output error
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x3000
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-133 ~]# 

[root@dhcp37-133 ~]# mount | grep test
10.70.37.165:/tiervolume on /mnt/test type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@dhcp37-133 ~]# 


Client log snippet:
===================

# less /var/log/glusterfs/mnt-test.log 

[2015-12-24 10:28:50.791227] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-tiervolume-disperse-1: Heal failed [Input/output error]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-tiervolume-disperse-1: Heal failed [Input/output error]" repeated 6 times between [2015-12-24 10:28:50.791227] and [2015-12-24 10:28:51.062863]
[2015-12-24 10:39:42.715503] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-tiervolume-disperse-1: Mismatching xdata in answers of 'LOOKUP'
[2015-12-24 10:39:42.718869] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-tiervolume-disperse-1: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2015-12-24 10:39:42.727887] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-tiervolume-disperse-1: Heal failed [Input/output error]
[2015-12-24 10:39:42.919641] N [MSGID: 122031] [ec-generic.c:1133:ec_combine_xattrop] 0-tiervolume-disperse-1: Mismatching dictionary in answers of 'GF_FOP_XATTROP'
[2015-12-24 10:39:42.919750] W [MSGID: 122040] [ec-common.c:907:ec_prepare_update_cbk] 0-tiervolume-disperse-1: Failed to get size and version [Input/output error]
[2015-12-24 10:39:42.926486] W [fuse-bridge.c:3355:fuse_xattr_cbk] 0-glusterfs-fuse: 15: GETXATTR(trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime) / => -1 (Input/output error)
[2015-12-24 10:39:42.925954] N [MSGID: 122031] [ec-generic.c:1133:ec_combine_xattrop] 0-tiervolume-disperse-1: Mismatching dictionary in answers of 'GF_FOP_XATTROP'
[2015-12-24 10:39:42.926445] W [MSGID: 122040] [ec-common.c:907:ec_prepare_update_cbk] 0-tiervolume-disperse-1: Failed to get size and version [Input/output error]
[2015-12-24 10:58:58.908160] W [MSGID: 122056] [ec-combine.c:866:ec_combine_check] 0-tiervolume-disperse-1: Mismatching xdata in answers of 'LOOKUP'
[2015-12-24 10:58:58.909422] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-tiervolume-disperse-1: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3E, bad=1)
[2015-12-24 10:58:58.918637] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-tiervolume-disperse-1: Heal failed [Input/output error]
[2015-12-24 10:58:58.922502] N [MSGID: 122031] [ec-generic.c:1133:ec_combine_xattrop] 0-tiervolume-disperse-1: Mismatching dictionary in answers of 'GF_FOP_XATTROP'
[2015-12-24 10:58:58.924043] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-tiervolume-disperse-1: Operation failed on some subvolumes (up=3F, mask=3E, remaining=0, good=3E, bad=1)
The message "N [MSGID: 122031] [ec-generic.c:1133:ec_combine_xattrop] 0-tiervolume-disperse-1: Mismatching dictionary in answers of 'GF_FOP_XATTROP'" repeated 2 times between [2015-12-24 10:58:58.922502] and [2015-12-24 10:58:58.972485]
[2015-12-24 10:58:58.973055] W [MSGID: 122040] [ec-common.c:907:ec_prepare_update_cbk] 0-tiervolume-disperse-1: Failed to get size and version [Input/output error]
[2015-12-24 10:58:58.973187] W [fuse-bridge.c:3355:fuse_xattr_cbk] 0-glusterfs-fuse: 19: GETXATTR(trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime) / => -1 (Input/output error)
[2015-12-24 10:58:58.989738] N [MSGID: 122031] [ec-generic.c:1133:ec_combine_xattrop] 0-tiervolume-disperse-1: Mismatching dictionary in answers of 'GF_FOP_XATTROP'


Following are the ec.version for disperse subvolume 1:
======================================================

[root@dhcp37-165 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-7/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-7/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.version=0x00000000000000000000000000000011
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34f00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b7704
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ba2cafe3
trusted.tier.tier-dht.commithash=0x3330313736373533323400

[root@dhcp37-165 ~]# 



[root@dhcp37-133 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-8/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-8/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.dirty=0x0000000000000000000000000000000e
trusted.ec.version=0x00000000000000000000000000000020
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34c00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b781e
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-133 ~]#


[root@dhcp37-160 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-9/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-9/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.dirty=0x0000000000000000000000000000000e
trusted.ec.version=0x00000000000000000000000000000020
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34b00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b780d
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-160 ~]# 



[root@dhcp37-158 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-10/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-10/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.dirty=0x0000000000000000000000000000000e
trusted.ec.version=0x00000000000000000000000000000020
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34f00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b7c51
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-158 ~]# 


[root@dhcp37-110 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-11/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-11/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.dirty=0x0000000000000000000000000000000e
trusted.ec.version=0x00000000000000000000000000000020
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34b00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b7714
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-110 ~]# 


[root@dhcp37-155 ~]# getfattr -d -e hex -m . /rhs/brick2/ct-12/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/ct-12/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.ec.dirty=0x0000000000000000000000000000000e
trusted.ec.version=0x00000000000000000000000000000020
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.5fc0f095-23c4-4b96-9d94-69decd14f1d4.stime=0x567bc34f00000000
trusted.glusterfs.1dd75524-8cc9-4d93-9b14-518021c8df3f.xtime=0x567bc348000b7989
trusted.glusterfs.dht=0x00000001000000007ffa1668ffffffff
trusted.glusterfs.quota.dirty=0x3000
trusted.glusterfs.quota.size.1=0x000000000000000000000000000000000000000000000003
trusted.glusterfs.volume-id=0x1dd755248cc94d939b14518021c8df3f
trusted.tier.tier-dht=0x000000010000000000000000ffffffff
trusted.tier.tier-dht.commithash=0x3330313736383334313800

[root@dhcp37-155 ~]# 


[root@dhcp37-165 ~]# gluster volume info tiervolume 
 
Volume Name: tiervolume
Type: Distributed-Disperse
Volume ID: 1dd75524-8cc9-4d93-9b14-518021c8df3f
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.165:/rhs/brick1/ct-1
Brick2: 10.70.37.133:/rhs/brick1/ct-2
Brick3: 10.70.37.160:/rhs/brick1/ct-3
Brick4: 10.70.37.158:/rhs/brick1/ct-4
Brick5: 10.70.37.110:/rhs/brick1/ct-5
Brick6: 10.70.37.155:/rhs/brick1/ct-6
Brick7: 10.70.37.165:/rhs/brick2/ct-7
Brick8: 10.70.37.133:/rhs/brick2/ct-8
Brick9: 10.70.37.160:/rhs/brick2/ct-9
Brick10: 10.70.37.158:/rhs/brick2/ct-10
Brick11: 10.70.37.110:/rhs/brick2/ct-11
Brick12: 10.70.37.155:/rhs/brick2/ct-12
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
[root@dhcp37-165 ~]# 

<Note: It was a tier volume when the original issue has been seen. The output above is after detach>



Version-Release number of selected component (if applicable):
=============================================================


How reproducible:
=================

1/1


Steps Carried:
==============
1. Create Master volume Tiered {HT: 3x2, CT: 2x(4+2)} 
2. Create Slave volume (4x2)
3. Create geo-rep session
4. Start geo-rep session

Actual results:
===============

All the passive bricks went to faulty


Expected results:
=================

Geo-Rep should be ACTIVE

Comment 1 Vijay Bellur 2016-03-02 05:24:48 UTC
REVIEW: http://review.gluster.org/13572 (geo-rep: Mask xtime and stime xattrs) posted (#1) for review on release-3.7 by Kotresh HR (khiremat)

Comment 2 Vijay Bellur 2016-03-07 10:24:35 UTC
COMMIT: http://review.gluster.org/13572 committed in release-3.7 by Aravinda VK (avishwan) 
------
commit fda9270c906d955f735b0fc0f85818a435325f87
Author: Kotresh HR <khiremat>
Date:   Thu Jan 14 17:14:25 2016 +0530

    geo-rep: Mask xtime and stime xattrs
    
    Allow access to xtime and stime xattrs only
    to gsyncd client and mask them for the rest.
    
    This is to prevent afr from performing self
    healing on marker xtime and geo-rep stime
    xattr which is not expected as each of which
    gets updated them from backend brick and
    should not be healed.
    
    BUG: 1313623
    Change-Id: I9b4b3ce30bbc09d300e6d5c6782e2446f2411c6f
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/13242
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Vijaikumar Mallikarjuna <vmallika>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Reviewed-on: http://review.gluster.org/13572
    Reviewed-by: Aravinda VK <avishwan>

Comment 3 Vijay Bellur 2016-03-11 10:08:49 UTC
REVIEW: http://review.gluster.org/13679 (posix: Filter gsyncd stime xattr) posted (#1) for review on release-3.7 by Kotresh HR (khiremat)

Comment 4 Mike McCune 2016-03-28 22:52:26 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Vijay Bellur 2016-03-29 08:36:15 UTC
COMMIT: http://review.gluster.org/13679 committed in release-3.7 by Pranith Kumar Karampuri (pkarampu) 
------
commit f58d2919a63002e4852d0be4d35b66c3ae9cce21
Author: Kotresh HR <khiremat>
Date:   Fri Mar 11 15:07:48 2016 +0530

    posix: Filter gsyncd stime xattr
    
    Filter gsyncd stime xattr in lookup as well.
    The value of stime would be different among
    replica bricks and EC bricks. AFR and EC
    should not take any action on these as it
    could be different.
    
    Change-Id: If577f6115b36e036af2292ea0eaae93110f006ba
    Reviewed on: http://review.gluster.org/13678
    BUG: 1313623
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/13679
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 6 Kaushal 2016-04-19 06:58:43 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.10, please open a new bug report.

glusterfs-3.7.10 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-April/026164.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 7 Kaushal 2016-04-19 07:21:27 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report.

glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user