+++ This bug was initially created as a clone of Bug #1385589 +++ Description of problem: ======================= While Testing the create and rename of directories in a loop, found multiple crashes as follows: [root@dhcp37-177 Master]# grep -ri "OSError: " * ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22' [root@dhcp37-177 Master]# Master: ======= Crash 1: [Errno 2] No such file or directory: ============================================= [2016-10-16 17:35:06.867371] E [syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' Crash 2: [Errno 21] Is a directory ================================== Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' These crashes are propagated from slave as: =========================================== [2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957' [2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting. [2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1 [2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER: slave listening [2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' [2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting. [2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1 [2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER: slave listening [2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' [2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer: terminating on Version-Release number of selected component (if applicable): ============================================================= glusterfs-server-3.8.4-2.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried: 1. Create Master (2 nodes) and Slave Cluster (4 nodes) 2. Create and Start Master and 2 Slave Volumes (Each 2x2) 3. Create mount-broker geo-rep session between master and 2 slave volumes 4. Mount the Master and Slave Volume (NFS and Fuse) 5. Create dir on master and rename it. for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done for i in {1..1000}; do mv dir.$i rename_dir.$i; done for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done Actual results: =============== Worker Crashes seen with Errno 2 and 21 Master: ======= [root@dhcp37-58 ~]# gluster v info Volume Name: Master Type: Distributed-Replicate Volume ID: a4dc4c5c-95d7-4c71-ad52-3bbe70fc7240 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.58:/rhs/brick1/b1 Brick2: 10.70.37.177:/rhs/brick1/b2 Brick3: 10.70.37.58:/rhs/brick2/b3 Brick4: 10.70.37.177:/rhs/brick2/b4 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: off cluster.enable-shared-storage: enable Volume Name: gluster_shared_storage Type: Replicate Volume ID: cb7be148-8b85-43a2-837b-bb9d7de41a20 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.37.177:/var/lib/glusterd/ss_brick Brick2: dhcp37-58.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on cluster.enable-shared-storage: enable [root@dhcp37-58 ~]# Slave; ====== [root@dhcp37-214 ~]# gluster v info Volume Name: Slave1 Type: Distributed-Replicate Volume ID: 928051ec-0177-4d13-b1cc-71d7783bfd95 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.214:/rhs/brick1/b1 Brick2: 10.70.37.62:/rhs/brick1/b2 Brick3: 10.70.37.214:/rhs/brick2/b3 Brick4: 10.70.37.62:/rhs/brick2/b4 Options Reconfigured: nfs.disable: off performance.readdir-ahead: on transport.address-family: inet cluster.enable-shared-storage: enable Volume Name: Slave2 Type: Distributed-Replicate Volume ID: 72c1006b-135f-4641-b2a1-a10a5a1ac12b Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.178:/rhs/brick1/b1 Brick2: 10.70.37.59:/rhs/brick1/b2 Brick3: 10.70.37.178:/rhs/brick2/b3 Brick4: 10.70.37.59:/rhs/brick2/b4 Options Reconfigured: nfs.disable: off performance.readdir-ahead: on transport.address-family: inet cluster.enable-shared-storage: enable [root@dhcp37-214 ~]# --- Additional comment from Aravinda VK on 2016-10-17 07:55:22 EDT --- Crash 1: [Errno 2] No such file or directory: This looks like two workers trying unlink at the same time. (As part of rename, while Changelog reprocessing) Solution: Handle the ENOENT and ESTALE errors during unlink. Crash 2: [Errno 21] Is a directory: "Is a Directory" issue is fixed in upstream http://review.gluster.org/15132 Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3
REVIEW: http://review.gluster.org/15868 (geo-rep: Handle ENOENT during unlink) posted (#1) for review on master by Aravinda VK (avishwan)
COMMIT: http://review.gluster.org/15868 committed in master by Vijay Bellur (vbellur) ------ commit ecd6da0a754f21909dbbd8189228f5a27a15df3e Author: Aravinda VK <avishwan> Date: Thu Nov 17 17:07:36 2016 +0530 geo-rep: Handle ENOENT during unlink Do not raise traceback if a file/dir not exists during unlink or rmdir BUG: 1396062 Change-Id: Idd43ca1fa6ae6056c3cd493f0e2f151880a3968c Signed-off-by: Aravinda VK <avishwan> Reviewed-on: http://review.gluster.org/15868 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Vijay Bellur <vbellur>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report. glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html [2] https://www.gluster.org/pipermail/gluster-users/