Bug 1412494

Summary: [geo-rep]: Worker crashes with OSError: [Errno 5] Input/output error
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Srijan Sivakumar <ssivakum>
Status: CLOSED ERRATA QA Contact: Vinayak Papnoi <vpapnoi>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: bkunal, bturner, csaba, khiremat, puebele, rhs-bugs, sheggodu, ssivakum, storage-qa-internal, sunkumar
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.5.z Batch Update 4   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.0-50 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-29 07:20:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rahul Hinduja 2017-01-12 07:44:47 UTC
Description of problem:
=======================

While running the automated geo-rep test suite, found the worker crashes with the following at the master:

[2017-01-07 08:30:34.106961] E [syncdutils(/bricks/brick2/master_brick8):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1134, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1109, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error
[2017-01-07 08:30:34.218364] I [syncdutils(/bricks/brick2/master_brick8):237:finalize] <top>: exiting.

At Slave:
=========

[2017-01-07 07:29:32.11006] I [resource(slave):932:service_loop] GLUSTER: slave listening
[2017-01-07 08:30:34.98694] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error
[2017-01-07 08:30:34.232213] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.


Client mount log indicates that setxattr returned -1 which caused input/ouput error.

Client-Mount log
=================

[2017-01-07 08:30:34.097085] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = 9c2b8e07-1e7e-47c5-befd-b71ff7d647ba). Holes=1 overlaps=0
[2017-01-07 08:30:34.098509] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 129685: SETXATTR() /.gfid/9c2b8e07-1e7e-47c5-befd-b71ff7d647ba => -1 (Input/output error)
[2017-01-07 08:30:34.266996] I [fuse-bridge.c:5082:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-HIyMWE



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-10.el7rhgs.x86_64


Steps to Reproduce:
===================

No concrete steps available at the moment other than the indication that it is caused post rmdir (rm -rf *). Will update the bug incase could reproduce.

Comment 6 Rochelle 2017-06-06 08:44:09 UTC

Hitting the same issue on version : glusterfs-geo-replication-3.8.4-18.4.el6rhs.x86_64

While running the automated tests for Geo-Replication, found the following at the master :

Master:
=======

[2017-06-05 16:01:53.595813] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error


At the slave:
=============

[2017-06-05 15:38:13.710927] I [resource(slave):932:service_loop] GLUSTER: slave listening
[2017-06-05 16:01:53.543239] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error

Comment 7 Kotresh HR 2017-06-06 11:18:47 UTC
Anything in slave mount logs ? EIO is coming during entry operation on slave.
Is the setup available to debug?

Comment 8 Aravinda VK 2017-06-06 11:43:29 UTC
(In reply to Kotresh HR from comment #7)
> Anything in slave mount logs ? EIO is coming during entry operation on slave.
> Is the setup available to debug?

May be split brain in Slave

Comment 10 Rahul Hinduja 2017-07-24 05:18:33 UTC
(In reply to Kotresh HR from comment #7)
> Anything in slave mount logs ? EIO is coming during entry operation on slave.
> Is the setup available to debug?

Have hit the similar issue again at rmdir... No split-brain from logs or heal info split-brain at slave. 

Also, if it was a split brain and failing because of I/O error, the subsequent fops including rmdir wouldn't succeed. Which is not the case here, eventually we do not see the file.

Comment 11 Kotresh HR 2017-07-24 05:50:32 UTC
(In reply to Rahul Hinduja from comment #10)
> (In reply to Kotresh HR from comment #7)
> > Anything in slave mount logs ? EIO is coming during entry operation on slave.
> > Is the setup available to debug?
> 
> Have hit the similar issue again at rmdir... No split-brain from logs or
> heal info split-brain at slave. 
> 
> Also, if it was a split brain and failing because of I/O error, the
> subsequent fops including rmdir wouldn't succeed. Which is not the case
> here, eventually we do not see the file.

EIO is not only during split brains, if I remember correctlyl AFR throws EIO when it fails to acquire locks as well. Could you get the client log messages at the corresponding time.

Comment 12 Rahul Hinduja 2017-07-24 05:53:38 UTC
[2017-07-24 00:43:13.793227] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-7: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793611] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-9: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793988] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-11: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793994] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-10: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.806002] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584). Holes=1 overlaps=0
[2017-07-24 00:43:13.811313] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 74757: SETXATTR() /.gfid/cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584 => -1 (Input/output error)
[2017-07-24 00:43:13.856382] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/gsyncd-aux-mount-VHnkv9
[2017-07-24 00:43:13.856752] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x33ed607aa1) [0x7fe715e1baa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fe71722f845] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fe71722f2b6] ) 0-: received signum (15), shutting down
[2017-07-24 00:43:13.856822] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-VHnkv9'.
[2017-07-24 00:43:13.856894] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-VHnkv9'.
[2017-07-24 00:46:16.601092] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication-slaves/3d9a6462-47f2-4c64-8000-c372ed348b60:10.70.42.33.%2Fbricks%2Fbrick0%2Fmaster_brick2.slave.gluster.log --volfile-server=localhost --volfile-id=slave --client-pid=-1 /tmp/gsyncd-aux-mount-SXm8XG)
[2017-07-24 00:46:16.633850] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-07-24 00:46:16.638433] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-slave-md-cache: adding option 'cache-posix-acl' for volume 'slave-md-cache' with value 'true'
[2017-07-24 00:46:16.639648] I [afr.c:94:fix_quorum_options] 0-slave-replicate-5: reindeer: incoming qtype = none

Comment 23 Amar Tumballi 2019-03-14 06:02:29 UTC
Would like this to be tested for similar scenario with RHGS3.5 and depending on the results, we will take this up.

Comment 45 errata-xmlrpc 2021-04-29 07:20:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1462