Description of problem: ======================= While running the automated geo-rep test suite, found the worker crashes with the following at the master: [2017-01-07 08:30:34.106961] E [syncdutils(/bricks/brick2/master_brick8):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1134, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1109, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 5] Input/output error [2017-01-07 08:30:34.218364] I [syncdutils(/bricks/brick2/master_brick8):237:finalize] <top>: exiting. At Slave: ========= [2017-01-07 07:29:32.11006] I [resource(slave):932:service_loop] GLUSTER: slave listening [2017-01-07 08:30:34.98694] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops [ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 5] Input/output error [2017-01-07 08:30:34.232213] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. Client mount log indicates that setxattr returned -1 which caused input/ouput error. Client-Mount log ================= [2017-01-07 08:30:34.097085] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = 9c2b8e07-1e7e-47c5-befd-b71ff7d647ba). Holes=1 overlaps=0 [2017-01-07 08:30:34.098509] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 129685: SETXATTR() /.gfid/9c2b8e07-1e7e-47c5-befd-b71ff7d647ba => -1 (Input/output error) [2017-01-07 08:30:34.266996] I [fuse-bridge.c:5082:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-HIyMWE Version-Release number of selected component (if applicable): ============================================================= glusterfs-geo-replication-3.8.4-10.el7rhgs.x86_64 Steps to Reproduce: =================== No concrete steps available at the moment other than the indication that it is caused post rmdir (rm -rf *). Will update the bug incase could reproduce.
Hitting the same issue on version : glusterfs-geo-replication-3.8.4-18.4.el6rhs.x86_64 While running the automated tests for Geo-Replication, found the following at the master : Master: ======= [2017-06-05 16:01:53.595813] E [syncdutils(/bricks/brick0/master_brick0):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 5] Input/output error At the slave: ============= [2017-06-05 15:38:13.710927] I [resource(slave):932:service_loop] GLUSTER: slave listening [2017-06-05 16:01:53.543239] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 809, in entry_ops [ESTALE, EINVAL]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap return call(*arg) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 79, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 5] Input/output error
Anything in slave mount logs ? EIO is coming during entry operation on slave. Is the setup available to debug?
(In reply to Kotresh HR from comment #7) > Anything in slave mount logs ? EIO is coming during entry operation on slave. > Is the setup available to debug? May be split brain in Slave
(In reply to Kotresh HR from comment #7) > Anything in slave mount logs ? EIO is coming during entry operation on slave. > Is the setup available to debug? Have hit the similar issue again at rmdir... No split-brain from logs or heal info split-brain at slave. Also, if it was a split brain and failing because of I/O error, the subsequent fops including rmdir wouldn't succeed. Which is not the case here, eventually we do not see the file.
(In reply to Rahul Hinduja from comment #10) > (In reply to Kotresh HR from comment #7) > > Anything in slave mount logs ? EIO is coming during entry operation on slave. > > Is the setup available to debug? > > Have hit the similar issue again at rmdir... No split-brain from logs or > heal info split-brain at slave. > > Also, if it was a split brain and failing because of I/O error, the > subsequent fops including rmdir wouldn't succeed. Which is not the case > here, eventually we do not see the file. EIO is not only during split brains, if I remember correctlyl AFR throws EIO when it fails to acquire locks as well. Could you get the client log messages at the corresponding time.
[2017-07-24 00:43:13.793227] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-7: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793611] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-9: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793988] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-11: remote operation failed [No such file or directory] [2017-07-24 00:43:13.793994] W [MSGID: 114031] [client-rpc-fops.c:1031:client3_3_setxattr_cbk] 0-slave-client-10: remote operation failed [No such file or directory] [2017-07-24 00:43:13.806002] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584). Holes=1 overlaps=0 [2017-07-24 00:43:13.811313] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 74757: SETXATTR() /.gfid/cb18a4c3-21b0-4f8f-a9bb-4f1f53fb4584 => -1 (Input/output error) [2017-07-24 00:43:13.856382] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/gsyncd-aux-mount-VHnkv9 [2017-07-24 00:43:13.856752] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x33ed607aa1) [0x7fe715e1baa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fe71722f845] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fe71722f2b6] ) 0-: received signum (15), shutting down [2017-07-24 00:43:13.856822] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-VHnkv9'. [2017-07-24 00:43:13.856894] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-VHnkv9'. [2017-07-24 00:46:16.601092] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication-slaves/3d9a6462-47f2-4c64-8000-c372ed348b60:10.70.42.33.%2Fbricks%2Fbrick0%2Fmaster_brick2.slave.gluster.log --volfile-server=localhost --volfile-id=slave --client-pid=-1 /tmp/gsyncd-aux-mount-SXm8XG) [2017-07-24 00:46:16.633850] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2017-07-24 00:46:16.638433] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-slave-md-cache: adding option 'cache-posix-acl' for volume 'slave-md-cache' with value 'true' [2017-07-24 00:46:16.639648] I [afr.c:94:fix_quorum_options] 0-slave-replicate-5: reindeer: incoming qtype = none
Would like this to be tested for similar scenario with RHGS3.5 and depending on the results, we will take this up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1462