Bug 1412494
Summary: | [geo-rep]: Worker crashes with OSError: [Errno 5] Input/output error | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> |
Component: | geo-replication | Assignee: | Srijan Sivakumar <ssivakum> |
Status: | CLOSED ERRATA | QA Contact: | Vinayak Papnoi <vpapnoi> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rhgs-3.2 | CC: | bkunal, bturner, csaba, khiremat, puebele, rhs-bugs, sheggodu, ssivakum, storage-qa-internal, sunkumar |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.5.z Batch Update 4 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-6.0-50 | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-04-29 07:20:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rahul Hinduja
2017-01-12 07:44:47 UTC
Hitting the same issue on version : glusterfs-geo-replication-3.8.4-18.4.el6rhs.x86_64 While running the automated tests for Geo-Replication, found the following at the master : Master: ======= [2017-06-05 16:01:53.595813] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 5] Input/output error At the slave: ============= [2017-06-05 15:38:13.710927] I [resource(slave):932:service_loop] GLUSTER: slave listening [2017-06-05 16:01:53.543239] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops [ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 5] Input/output error Anything in slave mount logs ? EIO is coming during entry operation on slave. Is the setup available to debug? (In reply to Kotresh HR from comment #7) > Anything in slave mount logs ? EIO is coming during entry operation on slave. > Is the setup available to debug? May be split brain in Slave (In reply to Kotresh HR from comment #7) > Anything in slave mount logs ? EIO is coming during entry operation on slave. > Is the setup available to debug? Have hit the similar issue again at rmdir... No split-brain from logs or heal info split-brain at slave. Also, if it was a split brain and failing because of I/O error, the subsequent fops including rmdir wouldn't succeed. Which is not the case here, eventually we do not see the file. (In reply to Rahul Hinduja from comment #10) > (In reply to Kotresh HR from comment #7) > > Anything in slave mount logs ? EIO is coming during entry operation on slave. > > Is the setup available to debug? > > Have hit the similar issue again at rmdir... No split-brain from logs or > heal info split-brain at slave. > > Also, if it was a split brain and failing because of I/O error, the > subsequent fops including rmdir wouldn't succeed. Which is not the case > here, eventually we do not see the file. EIO is not only during split brains, if I remember correctlyl AFR throws EIO when it fails to acquire locks as well. Could you get the client log messages at the corresponding time. [2017-07-24 00:43:13.793227] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-7: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793611] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-9: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793988] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-11: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793994] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-10: remote operation failed [No such file or directory] [2017-07-24 00:43:13.806002] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584). Holes=1 overlaps=0 [2017-07-24 00:43:13.811313] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 74757: SETXATTR() /.gfid/cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584 => -1 (Input/output error) [2017-07-24 00:43:13.856382] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/gsyncd-aux-mount-VHnkv9 [2017-07-24 00:43:13.856752] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x33ed607aa1) [0x7fe715e1baa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fe71722f845] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fe71722f2b6] ) 0-: received signum (15), shutting down [2017-07-24 00:43:13.856822] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-VHnkv9'. [2017-07-24 00:43:13.856894] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-VHnkv9'. [2017-07-24 00:46:16.601092] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication-slaves/3d9a6462-47f2-4c64-8000-c372ed348b60:10.70.42.33.%2Fbricks%2Fbrick0%2Fmaster_brick2.slave.gluster.log --volfile-server=localhost --volfile-id=slave --client-pid=-1 /tmp/gsyncd-aux-mount-SXm8XG) [2017-07-24 00:46:16.633850] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-07-24 00:46:16.638433] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-slave-md-cache: adding option 'cache-posix-acl' for volume 'slave-md-cache' with value 'true' [2017-07-24 00:46:16.639648] I [afr.c:94:fix_quorum_options] 0-slave-replicate-5: reindeer: incoming qtype = none Would like this to be tested for similar scenario with RHGS3.5 and depending on the results, we will take this up. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1462 |