Description of problem: Geo-rep worker goes faulty on some bricks (not all bricks) if there is file rotation inside GlusterFS mount Version-Release number of selected component (if applicable): 4.1.5 on CentOS 7.5 (I have not tested on different version and OS) How reproducible: Always Steps to Reproduce: 1. Mount a geo-replicated volume from Master node 2. Create a file (such as log file) 3. Do some file rotation to that file Actual results: Geo-rep worker goes faulty on some bricks (not all bricks) gsyncd.log on Master -------------------- [2018-12-01 20:39:49.653356] E [repce(worker /mnt/BRICK3):197:__call__] RepceClient: call failed call=25197:139717822179136:1543696787.31 method=entry_ops error=OSError [2018-12-01 20:39:49.653767] E [syncdutils(worker /mnt/BRICK3):332:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 311, in main func(args) File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py", line 72, in subcmd_worker local.service_loop(remote) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1295, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 615, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1545, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1445, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1280, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1179, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 216, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 198, in __call__ raise res OSError: [Errno 116] Stale file handle gsyncd.log on Slave ------------------- [2018-12-01 20:59:52.571860] W [syncdutils(slave gluster-eadmin-data.vm/mnt/BRICK3):552:errno_wrap] <top>: reached maximum retries args=['.gfid/86ba8c38-5ab0-417e-9130-64dd2d7cf4aa/glue_app_debug_log.log.82', '.gfid/86ba8c38-5ab0-417e-9130-64dd2d7cf4aa/glue_app_debug_log.log.83'] error=[Errno 116] Stale file handle [2018-12-01 20:59:52.572635] E [repce(slave gluster-eadmin-data.vm/mnt/BRICK3):105:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 101, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 675, in entry_ops uid, gid) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 526, in rename_with_disk_gfid_confirmation [ENOENT, EEXIST], [ESTALE, EBUSY]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 540, in errno_wrap return call(*arg) OSError: [Errno 116] Stale file handle Expected results: Geo-rep worker goes normal Additional info: Those error are gone if I move rotated files (glue_app_debug_log.log.82 and glue_app_debug_log.log.83 in above log) from Gluster mount to temporary place and move back to origin place of Gluster mount.
I confirm that this error still happen on versio 4.1.6. I have try invalidate related volume config too but still no luck.
I think this should solve this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1694820. Patch is merged upstream. Can you please verify and let us know whether this solved this problem. -Sunny
Closing this bug as fix is available upstream.