1325760 – Worker dies with [Errno 5] Input/output error upon creation of entries at slave

Bug 1325760 - Worker dies with [Errno 5] Input/output error upon creation of entries at slave

Summary: Worker dies with [Errno 5] Input/output error upon creation of entries at slave

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Raghavendra G
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311817 1334164 1336284 1336285 1348060
TreeView+	depends on / blocked

Reported:	2016-04-11 06:48 UTC by Rahul Hinduja
Modified:	2016-06-23 05:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.7.9-5
Doc Type:	Bug Fix
Doc Text:	An invalid inode that was not yet linked in the inode table could be used for a Distributed Hash Table self-heal operation. This caused self-heal to fail, and an incomplete directory layout to be set in the inode. This meant directory operations such as create, mknod, mkdir, rename, and link failed on that directory. This update ensures that inodes are linked in the inode table prior to performing self-heal operations, so directory operations succeed.
Clone Of:
Clones:	1334164 (view as bug list)
Environment:
Last Closed:	2016-06-23 05:16:34 UTC
Embargoed:
Dependent Products:
Flags:	rgowdapp: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Rahul Hinduja 2016-04-11 06:48:39 UTC

Description of problem:
=======================

With the latest build finding traceback during the sync to slave:

Master Log:
===========
[2016-04-06 13:40:04.844175] E [syncdutils(/bricks/brick1/master_brick6):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 166, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 663, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1510, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 934, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error
[2016-04-06 13:40:04.846092] I [syncdutils(/bricks/brick1/master_brick6):220:finalize] <top>: exiting.
[2016-04-06 13:40:04.854729] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.


Slave Log:
==========

[2016-04-06 13:39:56.414524] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 779, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error
[2016-04-06 13:39:56.431735] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.
[2016-04-06 13:39:56.432101] I [syncdutils(slave):220:finalize] <top>: exiting.


Slave Client Logs Reports:
==========================

[root@dhcp37-123 geo-replication-slaves]# grep -i "split" 8cabd68c-37f8-4b36-87b1-70bc941d7823\:gluster%3A%2F%2F127.0.0.1%3Aslave.log
[root@dhcp37-123 geo-replication-slaves]# grep -i "split" 8cabd68c-37f8-4b36-87b1-70bc941d7823\:gluster%3A%2F%2F127.0.0.1%3Aslave.*
8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:15.735353] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-3: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:15.735791] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-4: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log:[2016-04-06 13:42:46.110050] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-slave-replicate-2: Failing SETATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[root@dhcp37-123 geo-replication-slaves]# less 8cabd68c-37f8-4b36-87b1-70bc941d7823:gluster%3A%2F%2F127.0.0.1%3Aslave.gluster.log
[root@dhcp37-123 geo-replication-slaves]# 




Client logs shows split-brain errors, but neither the shd logs nor the heal info split-brains reports any files in split-brain:

[root@dhcp37-123 geo-replication-slaves]# gluster volume heal slave info split-brain
Brick 10.70.37.122:/bricks/brick0/slave_brick0
Number of entries in split-brain: 0

Brick 10.70.37.175:/bricks/brick0/slave_brick1
Number of entries in split-brain: 0

Brick 10.70.37.144:/bricks/brick0/slave_brick2
Number of entries in split-brain: 0

Brick 10.70.37.123:/bricks/brick0/slave_brick3
Number of entries in split-brain: 0

Brick 10.70.37.217:/bricks/brick0/slave_brick4
Number of entries in split-brain: 0

Brick 10.70.37.218:/bricks/brick0/slave_brick5
Number of entries in split-brain: 0

Brick 10.70.37.122:/bricks/brick1/slave_brick6
Number of entries in split-brain: 0

Brick 10.70.37.175:/bricks/brick1/slave_brick7
Number of entries in split-brain: 0

Brick 10.70.37.144:/bricks/brick1/slave_brick8
Number of entries in split-brain: 0

Brick 10.70.37.123:/bricks/brick1/slave_brick9
Number of entries in split-brain: 0

Brick 10.70.37.217:/bricks/brick1/slave_brick10
Number of entries in split-brain: 0

Brick 10.70.37.218:/bricks/brick1/slave_brick11
Number of entries in split-brain: 0

[root@dhcp37-123 geo-replication-slaves]# grep -i "split" /var/log/glusterfs/glustershd.log
[root@dhcp37-123 geo-replication-slaves]#

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-1.el7rhgs.x86_64

How reproducible:
=================

2/2


Steps to Reproduce:
===================
1. Setup geo-rep between master and slave volume
2. Mount master volume (Fuse)
3. Use crefi tool for data set {create, chmod, chown}

Actual results:
===============

Files do eventually get sync to slave but lots of worker crashes are observed.


Expected results:
=================

Geo-rep worker shouldn't die during sync


Additional info:
================

none of the bricks were brought down.

Comment 5 Milind Changire 2016-04-19 06:35:31 UTC

Provided devel_ack+ for BZ with Regression.

Comment 8 Atin Mukherjee 2016-05-13 04:38:58 UTC

Upstream patch http://review.gluster.org/14295 is in review.

Comment 9 Raghavendra G 2016-05-16 18:06:04 UTC

Two patches needed for this bug:

1. https://code.engineering.redhat.com/gerrit/#/c/74412/

2.Waiting for regression run completion on:
  http://review.gluster.org/14319

Comment 10 Raghavendra G 2016-05-17 03:19:38 UTC

(In reply to Raghavendra G from comment #9)
> Two patches needed for this bug:
> 
> 1. https://code.engineering.redhat.com/gerrit/#/c/74412/
> 
> 2.Waiting for regression run completion on:
>   http://review.gluster.org/14319

An identical patch passed regression at:
http://review.gluster.org/14365

Downstream port of the same can be found at:
https://bugzilla.redhat.com/show_bug.cgi?id=1325760

Comment 12 Rahul Hinduja 2016-05-24 06:48:43 UTC

Verified with the build: glusterfs-3.7.9-5

Ran the automated suite which creates upto 20K entries {10k files, 7k symlinks, 3k directories}. 
Sync Type: rsync and tarssh
Client: glusterfs

All cases successfully completed and did not see the "OSError: [Errno 5] Input/output error" in the logs.

Master:
=======
[root@dhcp37-162 master]# grep -ri "Errno 5" *
[root@dhcp37-162 master]# 

[root@dhcp37-40 master]# grep -ri "Errno 5" *
[root@dhcp37-40 master]# 

[root@dhcp37-116 master]# grep -ri "Errno 5" *
[root@dhcp37-116 master]# 

[root@dhcp37-189 master]# grep -ri "Errno 5" *
[root@dhcp37-189 master]# 

[root@dhcp37-121 master]# grep -ri "Errno 5" *
[root@dhcp37-121 master]# 

[root@dhcp37-190 master]# grep -ri "Errno 5" *
[root@dhcp37-190 master]# 

Slave:
======

[root@dhcp37-196 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-196 geo-replication-slaves]#

[root@dhcp37-88 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-88 geo-replication-slaves]# 

[root@dhcp37-200 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-200 geo-replication-slaves]# 

[root@dhcp37-43 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-43 geo-replication-slaves]# 

[root@dhcp37-213 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-213 geo-replication-slaves]# 

[root@dhcp37-52 geo-replication-slaves]# grep -ri "Errno 5" *
[root@dhcp37-52 geo-replication-slaves]# 

Moving the bug to verified state.

Comment 14 Raghavendra G 2016-06-10 04:24:19 UTC

Laura,

The doc text is fine.

regards,
Raghavendra

Comment 16 errata-xmlrpc 2016-06-23 05:16:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.