Description of problem: ======================= While renaming directories in a loop, I am seeing worker crash with the following traceback: Master: ======= [2017-03-01 07:34:23.844472] E [master(/rhs/brick3/b5):785:log_failures] _GMaster: ENTRY FAILED: ({'stat': {'atime': 1488353577.9969134, 'gid': 0, 'mtime': 1488353577.9 969134, 'mode': 16877, 'uid': 0}, 'entry1': '.gfid/00000000-0000-0000-0000-000000000001/rename_dir.124', 'gfid': 'a9adc254-3ec0-402d-945d-f1dcddbe411d', 'link': None, ' entry': '.gfid/00000000-0000-0000-0000-000000000001/dir.124', 'op': 'RENAME'}, 2) [2017-03-01 07:34:28.105679] E [repce(/rhs/brick3/b5):207:__call__] RepceClient: call 21221:140592415500096:1488353664.61 (entry_ops) failed on peer with OSError [2017-03-01 07:34:28.109591] E [syncdutils(/rhs/brick3/b5):296:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 757, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1555, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 573, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1136, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166' [2017-03-01 07:34:28.117834] I [syncdutils(/rhs/brick3/b5):237:finalize] <top>: exiting. [2017-03-01 07:34:28.138552] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2017-03-01 07:34:28.141488] I [syncdutils(agent):237:finalize] <top>: exiting. [2017-03-01 07:34:36.280246] E [master(/rhs/brick1/b1):785:log_failures] _GMaster: ENTRY FAILED: ({'stat': {'atime': 1488353579.1139069, 'gid': 0, 'mtime': 1488353579.1139069, 'mode': 16877, 'uid': 0}, 'entry1': '.gfid/00000000-0000-0000-0000-000000000001/rename_dir.135', 'gfid': 'e15667ad-e647-4253-a84e-0a0c6143e730', 'link': None, 'entry': '.gfid/00000000-0000-0000-0000-000000000001/dir.135', 'op': 'RENAME'}, 2) Slave: ====== [2017-03-01 07:20:33.380264] I [resource(slave):932:service_loop] GLUSTER: slave listening [2017-03-01 07:34:28.50796] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 766, in entry_ops st = lstat(entry) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 512, in lstat return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE]) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap return call(*arg) OSError: [Errno 16] Device or resource busy: '.gfid/00000000-0000-0000-0000-000000000001/dir.166' [2017-03-01 07:34:28.146219] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2017-03-01 07:34:28.147622] I [syncdutils(slave):237:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= mainline How reproducible: ================= Always Steps to Reproduce: =================== Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried: 1. Create Master (2 nodes) and Slave Cluster (4 nodes) 2. Create and Start Master and 2 Slave Volumes (Each 2x2) 3. Create mount-broker geo-rep session between master and 2 slave volumes 4. Mount the Master and Slave Volume (NFS and Fuse) 5. Create dir on master and rename it. From one client: for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done From second client: for i in {1..1000}; do mv dir.$i rename_dir.$i; done From third client: for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done Actual results: =============== Multiple crashes seen during rename
REVIEW: https://review.gluster.org/17049 (geo-rep: Retry on EBUSY) posted (#1) for review on release-3.10 by Kotresh HR (khiremat)
REVIEW: https://review.gluster.org/17050 (geo-rep: Fix EBUSY traceback) posted (#1) for review on release-3.10 by Kotresh HR (khiremat)
COMMIT: https://review.gluster.org/17050 committed in release-3.10 by Raghavendra Talur (rtalur) ------ commit fd6f5725a9fdbd6544548285d0853bdba83aeaff Author: Kotresh HR <khiremat> Date: Fri Apr 7 05:33:34 2017 -0400 geo-rep: Fix EBUSY traceback EBUSY was added to retry list of errno_wrap without importing. Fixing the same. > BUG: 1434018 > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17011 > NetBSD-regression: NetBSD Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Smoke: Gluster Build System <jenkins.org> > Reviewed-by: Aravinda VK <avishwan> Change-Id: Ide81a9ccc9b948a96265b6890da078b722b45d51 BUG: 1441927 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17050 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan>
COMMIT: https://review.gluster.org/17049 committed in release-3.10 by Raghavendra Talur (rtalur) ------ commit eb1e3aebc152aa6ec2123376d479730185f3a031 Author: Kotresh HR <khiremat> Date: Mon Mar 20 05:21:59 2017 -0400 geo-rep: Retry on EBUSY Do not crash on EBUSY error. Add EBUSY retry errno list. Crash only if the error persists even after max retries. > BUG: 1434018 > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/16924 > Smoke: Gluster Build System <jenkins.org> > NetBSD-regression: NetBSD Build System <jenkins.org> > Reviewed-by: Aravinda VK <avishwan> > CentOS-regression: Gluster Build System <jenkins.org> Change-Id: Ia067ccc6547731f28f2a315d400705e616cbf662 BUG: 1441927 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17049 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.2, please open a new bug report.