Bug 1325760
| Summary: | Worker dies with [Errno 5] Input/output error upon creation of entries at slave | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
| Component: | distribute | Assignee: | Raghavendra G <rgowdapp> | |
| Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.1 | CC: | amukherj, asrivast, avishwan, chrisw, csaba, mchangir, nbalacha, nlevinki, rgowdapp | |
| Target Milestone: | --- | Keywords: | Regression, ZStream | |
| Target Release: | RHGS 3.1.3 | Flags: | rgowdapp:
                needinfo+ | |
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.9-5 | Doc Type: | Bug Fix | |
| Doc Text: | An invalid inode that was not yet linked in the inode table could be used for a Distributed Hash Table self-heal operation. This caused self-heal to fail, and an incomplete directory layout to be set in the inode. This meant directory operations such as create, mknod, mkdir, rename, and link failed on that directory. This update ensures that inodes are linked in the inode table prior to performing self-heal operations, so directory operations succeed. | Story Points: | --- | |
| Clone Of: | ||||
| : | 1334164 (view as bug list) | Environment: | ||
| Last Closed: | 2016-06-23 05:16:34 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1311817, 1334164, 1336284, 1336285, 1348060 | |||
| Provided devel_ack+ for BZ with Regression. Upstream patch http://review.gluster.org/14295 is in review. Two patches needed for this bug: 1. https://code.engineering.redhat.com/gerrit/#/c/74412/ 2.Waiting for regression run completion on: http://review.gluster.org/14319 (In reply to Raghavendra G from comment #9) > Two patches needed for this bug: > > 1. https://code.engineering.redhat.com/gerrit/#/c/74412/ > > 2.Waiting for regression run completion on: > http://review.gluster.org/14319 An identical patch passed regression at: http://review.gluster.org/14365 Downstream port of the same can be found at: https://bugzilla.redhat.com/show_bug.cgi?id=1325760 Verified with the build: glusterfs-3.7.9-5
Ran the automated suite which creates upto 20K entries {10k files, 7k symlinks, 3k directories}. 
Sync Type: rsync and tarssh
Client: glusterfs
All cases successfully completed and did not see the "OSError: [Errno 5] Input/output error" in the logs.
Master:
=======
[root@dhcp37-162 master]# grep -ri "Errno 5" *
[root@dhcp37-162 master]# 
[root@dhcp37-40 master]# grep -ri "Errno 5" *
[root@dhcp37-40 master]# 
[root@dhcp37-116 master]# grep -ri "Errno 5" *
[root@dhcp37-116 master]# 
[root@dhcp37-189 master]# grep -ri "Errno 5" *
[root@dhcp37-189 master]# 
[root@dhcp37-121 master]# grep -ri "Errno 5" *
[root@dhcp37-121 master]# 
[root@dhcp37-190 master]# grep -ri "Errno 5" *
[root@dhcp37-190 master]# 
Slave:
======
[root@dhcp37-196 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-196 geo-replication-slaves]#
[root@dhcp37-88 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-88 geo-replication-slaves]# 
[root@dhcp37-200 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-200 geo-replication-slaves]# 
[root@dhcp37-43 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-43 geo-replication-slaves]# 
[root@dhcp37-213 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-213 geo-replication-slaves]# 
[root@dhcp37-52 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-52 geo-replication-slaves]# 
Moving the bug to verified state.
Laura, The doc text is fine. regards, Raghavendra Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240 | 
Description of problem: ======================= With the latest build finding traceback during the sync to slave: Master Log: =========== [2016-04-06 13:40:04.844175] E [syncdutils(/bricks/brick1/master_brick6):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 166, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 663, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1510, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 934, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 5] Input/output error [2016-04-06 13:40:04.846092] I [syncdutils(/bricks/brick1/master_brick6):220:finalize] <top>: exiting. [2016-04-06 13:40:04.854729] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. Slave Log: ========== [2016-04-06 13:39:56.414524] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 779, in entry_ops [ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 5] Input/output error [2016-04-06 13:39:56.431735] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2016-04-06 13:39:56.432101] I [syncdutils(slave):220:finalize] <top>: exiting. Slave Client Logs Reports: ========================== [root@dhcp37-123 geo-replication-slaves]# grep -i "split" 8cabd68c-37f8-4b36-87b1-70bc941d7823\:gluster%3A%2F%2F127.0.0.1%3Aslave.log [root@dhcp37-123 geo-replication-slaves]# grep -i "split" 8cabd68c-37f8-4b36-87b1-70bc941d7823\:gluster%3A%2F%2F127.0.0.1%3Aslave.* 8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:15.735353] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-3: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] 8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:15.735791] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-4: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] 8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:46.110050] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-2: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] [root@dhcp37-123 geo-replication-slaves]# less 8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log [root@dhcp37-123 geo-replication-slaves]# Client logs shows split-brain errors, but neither the shd logs nor the heal info split-brains reports any files in split-brain: [root@dhcp37-123 geo-replication-slaves]# gluster volume heal slave info split-brain Brick 10.70.37.122:/bricks/brick0/slave_brick0 Number of entries in split-brain: 0 Brick 10.70.37.175:/bricks/brick0/slave_brick1 Number of entries in split-brain: 0 Brick 10.70.37.144:/bricks/brick0/slave_brick2 Number of entries in split-brain: 0 Brick 10.70.37.123:/bricks/brick0/slave_brick3 Number of entries in split-brain: 0 Brick 10.70.37.217:/bricks/brick0/slave_brick4 Number of entries in split-brain: 0 Brick 10.70.37.218:/bricks/brick0/slave_brick5 Number of entries in split-brain: 0 Brick 10.70.37.122:/bricks/brick1/slave_brick6 Number of entries in split-brain: 0 Brick 10.70.37.175:/bricks/brick1/slave_brick7 Number of entries in split-brain: 0 Brick 10.70.37.144:/bricks/brick1/slave_brick8 Number of entries in split-brain: 0 Brick 10.70.37.123:/bricks/brick1/slave_brick9 Number of entries in split-brain: 0 Brick 10.70.37.217:/bricks/brick1/slave_brick10 Number of entries in split-brain: 0 Brick 10.70.37.218:/bricks/brick1/slave_brick11 Number of entries in split-brain: 0 [root@dhcp37-123 geo-replication-slaves]# grep -i "split" /var/log/glusterfs/glustershd.log [root@dhcp37-123 geo-replication-slaves]# Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.9-1.el7rhgs.x86_64 How reproducible: ================= 2/2 Steps to Reproduce: =================== 1. Setup geo-rep between master and slave volume 2. Mount master volume (Fuse) 3. Use crefi tool for data set {create, chmod, chown} Actual results: =============== Files do eventually get sync to slave but lots of worker crashes are observed. Expected results: ================= Geo-rep worker shouldn't die during sync Additional info: ================ none of the bricks were brought down.