Description of problem: ======================= While Testing the create and rename of directories in a loop, found multiple crashes as follows: [root@dhcp37-177 Master]# grep -ri "OSError: " * ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave1.log:OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/00000000-0000-0000-0000-000000000001/nfs_dir.426' ssh%3A%2F%2Fgeoaccount%4010.70.37.214%3Agluster%3A%2F%2F127.0.0.1%3ASlave2.log:OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/22' [root@dhcp37-177 Master]# Master: ======= Crash 1: [Errno 2] No such file or directory: ============================================= [2016-10-16 17:35:06.867371] E [syncdutils(/rhs/brick2/b4):289:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' Crash 2: [Errno 21] Is a directory ================================== Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 203, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 743, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1532, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 992, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 933, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' These crashes are propagated from slave as: =========================================== [2016-10-16 17:31:06.800229] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/957' [2016-10-16 17:31:06.825957] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2016-10-16 17:31:06.826287] I [syncdutils(slave):230:finalize] <top>: exiting. [2016-10-16 17:31:18.37847] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1 [2016-10-16 17:31:23.391367] I [resource(slave):914:service_loop] GLUSTER: slave listening [2016-10-16 17:35:06.864521] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 2] No such file or directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1189' [2016-10-16 17:35:06.884804] I [repce(slave):92:service_loop] RepceServer: terminating on reaching EOF. [2016-10-16 17:35:06.885364] I [syncdutils(slave):230:finalize] <top>: exiting. [2016-10-16 17:35:17.967597] I [gsyncd(slave):733:main_i] <top>: syncing: gluster://localhost:Slave1 [2016-10-16 17:35:23.303258] I [resource(slave):914:service_loop] GLUSTER: slave listening [2016-10-16 17:46:21.666467] E [repce(slave):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 784, in entry_ops os.unlink(entry) OSError: [Errno 21] Is a directory: '.gfid/38b9984f-d602-40b9-8c6e-06d33204627e/1823' [2016-10-16 17:46:21.687004] I [repce(slave):92:service_loop] RepceServer: terminating on Version-Release number of selected component (if applicable): ============================================================= glusterfs-server-3.8.4-2.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-2.el7rhgs.x86_64 How reproducible: ================= Always Steps to Reproduce: =================== Seen this on non-root fanout setup, but should also see on normal setup. Writing the exact steps as carried: 1. Create Master (2 nodes) and Slave Cluster (4 nodes) 2. Create and Start Master and 2 Slave Volumes (Each 2x2) 3. Create mount-broker geo-rep session between master and 2 slave volumes 4. Mount the Master and Slave Volume (NFS and Fuse) 5. Create dir on master and rename it. for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done for i in {1..1000}; do mv dir.$i rename_dir.$i; done for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done Actual results: =============== Worker Crashes seen with Errno 2 and 21
Crash 1: [Errno 2] No such file or directory: This looks like two workers trying unlink at the same time. (As part of rename, while Changelog reprocessing) Solution: Handle the ENOENT and ESTALE errors during unlink. Crash 2: [Errno 21] Is a directory: "Is a Directory" issue is fixed in upstream http://review.gluster.org/15132 Related BZ https://bugzilla.redhat.com/show_bug.cgi?id=1365694#c3
Upstream Patches: http://review.gluster.org/15132 http://review.gluster.org/15868
Patches sent to downstream: https://code.engineering.redhat.com/gerrit/91362 https://code.engineering.redhat.com/gerrit/91363
Release 3.8 Patch: http://review.gluster.org/15939 Release 3.9 patch: http://review.gluster.org/15940
Given the verification of this BZ is blocked on BZ 1427870 and considering both of these BZs are not release blockers all the stakeholder as part of blocker bug triage and rhgs-3.2.0 bug status check exercise agreed to drop this bug from 3.2.0 release and consider the verification of this BZ as well as BZ 1427870 in rhgs-3.3.0. With that, resetting the flags.
Verified with build: glusterfs-geo-replication-3.8.4-32.el7rhgs.x86_64 Use case mentioned in the description is carried with the following data set: Set 1: ------ #client 1 for i in {1..1999}; do mkdir $i ; sleep 1 ; mv $i rs.$i ; done #client 2 mkdir dir.{1..1000} for i in {1..1000}; do mv dir.$i rename_dir.$i; done #client 3 for i in {1..500}; do mkdir h.$i ; mv h.$i rsh.$i ; done #client 4 for i in {1..1999}; do mkdir rochelle.$i ; sleep 1 ; mv rochelle.$i allan.$i ; done Set 2: ------ #client 1 for i in {1..1999}; do mkdir volks.$i ; sleep 1 ; mv volks.$i weagan.$i ; done #client 2 touch Sun{1..1000} for i in {1..1000}; do mv Sun.$i Moon.$i; done #client 3 for i in {1..500}; do mkdir Flash.$i ; mv Flash.$i Red.$i ; done #client 4 for i in {1..1999}; do touch brother.$i ; sleep 1 ; mv brother.$i sister.$i ; done No worker crash is seen, moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774