Bug 1412494 - [geo-rep]: Worker crashes with OSError: [Errno 5] Input/output error
Summary: [geo-rep]: Worker crashes with OSError: [Errno 5] Input/output error
Keywords:
Status: POST
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Sunny Kumar
QA Contact: Prasanth
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-12 07:44 UTC by Rahul Hinduja
Modified: 2020-09-24 02:15 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Rahul Hinduja 2017-01-12 07:44:47 UTC
Description of problem:
=======================

While running the automated geo-rep test suite, found the worker crashes with the following at the master:

[2017-01-07 08:30:34.106961] E [syncdutils(/bricks/brick2/master_brick8):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1134, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1109, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error
[2017-01-07 08:30:34.218364] I [syncdutils(/bricks/brick2/master_brick8):237:finalize] <top>: exiting.

At Slave:
=========

[2017-01-07 07:29:32.11006] I [resource(slave):932:service_loop] GLUSTER: slave listening
[2017-01-07 08:30:34.98694] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error
[2017-01-07 08:30:34.232213] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF.


Client mount log indicates that setxattr returned -1 which caused input/ouput error.

Client-Mount log
=================

[2017-01-07 08:30:34.097085] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = 9c2b8e07-1e7e-47c5-befd-b71ff7d647ba). Holes=1 overlaps=0
[2017-01-07 08:30:34.098509] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 129685: SETXATTR() /.gfid/9c2b8e07-1e7e-47c5-befd-b71ff7d647ba => -1 (Input/output error)
[2017-01-07 08:30:34.266996] I [fuse-bridge.c:5082:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-HIyMWE



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-10.el7rhgs.x86_64


Steps to Reproduce:
===================

No concrete steps available at the moment other than the indication that it is caused post rmdir (rm -rf *). Will update the bug incase could reproduce.

Comment 6 Rochelle 2017-06-06 08:44:09 UTC

Hitting the same issue on version : glusterfs-geo-replication-3.8.4-18.4.el6rhs.x86_64

While running the automated tests for Geo-Replication, found the following at the master :

Master:
=======

[2017-06-05 16:01:53.595813] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error


At the slave:
=============

[2017-06-05 15:38:13.710927] I [resource(slave):932:service_loop] GLUSTER: slave listening
[2017-06-05 16:01:53.543239] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops
    [ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 5] Input/output error

Comment 7 Kotresh HR 2017-06-06 11:18:47 UTC
Anything in slave mount logs ? EIO is coming during entry operation on slave.
Is the setup available to debug?

Comment 8 Aravinda VK 2017-06-06 11:43:29 UTC
(In reply to Kotresh HR from comment #7)
> Anything in slave mount logs ? EIO is coming during entry operation on slave.
> Is the setup available to debug?

May be split brain in Slave

Comment 10 Rahul Hinduja 2017-07-24 05:18:33 UTC
(In reply to Kotresh HR from comment #7)
> Anything in slave mount logs ? EIO is coming during entry operation on slave.
> Is the setup available to debug?

Have hit the similar issue again at rmdir... No split-brain from logs or heal info split-brain at slave. 

Also, if it was a split brain and failing because of I/O error, the subsequent fops including rmdir wouldn't succeed. Which is not the case here, eventually we do not see the file.

Comment 11 Kotresh HR 2017-07-24 05:50:32 UTC
(In reply to Rahul Hinduja from comment #10)
> (In reply to Kotresh HR from comment #7)
> > Anything in slave mount logs ? EIO is coming during entry operation on slave.
> > Is the setup available to debug?
> 
> Have hit the similar issue again at rmdir... No split-brain from logs or
> heal info split-brain at slave. 
> 
> Also, if it was a split brain and failing because of I/O error, the
> subsequent fops including rmdir wouldn't succeed. Which is not the case
> here, eventually we do not see the file.

EIO is not only during split brains, if I remember correctlyl AFR throws EIO when it fails to acquire locks as well. Could you get the client log messages at the corresponding time.

Comment 12 Rahul Hinduja 2017-07-24 05:53:38 UTC
[2017-07-24 00:43:13.793227] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-7: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793611] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-9: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793988] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-11: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.793994] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-10: remote operation failed [No such file or directory]
[2017-07-24 00:43:13.806002] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584). Holes=1 overlaps=0
[2017-07-24 00:43:13.811313] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 74757: SETXATTR() /.gfid/cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584 => -1 (Input/output error)
[2017-07-24 00:43:13.856382] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/gsyncd-aux-mount-VHnkv9
[2017-07-24 00:43:13.856752] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x33ed607aa1) [0x7fe715e1baa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fe71722f845] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fe71722f2b6] ) 0-: received signum (15), shutting down
[2017-07-24 00:43:13.856822] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-VHnkv9'.
[2017-07-24 00:43:13.856894] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-VHnkv9'.
[2017-07-24 00:46:16.601092] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication-slaves/3d9a6462-47f2-4c64-8000-c372ed348b60:10.70.42.33.%2Fbricks%2Fbrick0%2Fmaster_brick2.slave.gluster.log --volfile-server=localhost --volfile-id=slave --client-pid=-1 /tmp/gsyncd-aux-mount-SXm8XG)
[2017-07-24 00:46:16.633850] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-07-24 00:46:16.638433] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-slave-md-cache: adding option 'cache-posix-acl' for volume 'slave-md-cache' with value 'true'
[2017-07-24 00:46:16.639648] I [afr.c:94:fix_quorum_options] 0-slave-replicate-5: reindeer: incoming qtype = none

Comment 23 Amar Tumballi 2019-03-14 06:02:29 UTC
Would like this to be tested for similar scenario with RHGS3.5 and depending on the results, we will take this up.


Note You need to log in before you can comment on or make changes to this bug.