Description of problem: ======================= While performing rm -rf on cascaded setup, found a worker crash on the primary master and intermittent master volume with traceback as: Master Volume: ============== [2016-06-11 09:41:17.359086] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process self.datas_in_batch.remove(unlinked_gfid) KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988' Intermittent Master: ==================== [2016-06-11 09:41:51.681622] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in process self.datas_in_batch.remove(unlinked_gfid) KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988' [2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.9-10 How reproducible: ================= Always, on cascaded setup upon remove (rm -rf) Steps to Reproduce: =================== 1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1, vol1=>vol2 2. Mount the vol0 volume and perform fops like (cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0 3. Let it sync to slave (vol1) and (vol2) 4. Calculate arequal checksum after every fop. It should match. 5. perform rm -rf on vol0 Actual results: =============== Worker crashed on vol1 and vol0 with keyerror. Expected results: ================= Worker shouldn't crash Additional info: ================ Performed rm -rf on non cascaded setup and didn't see the crash. Also, eventually files are removed from all Master and slaves.
Upstream Patch posted. http://review.gluster.org/#/c/14706/
Hello Aravinda, The customer is still saying that the files are still not renamed. From them: It looks like whatever rename process should have taken place, did not. The files are still in the limbo state. What are some next steps I can take. If I mount the slave brinks RW and rename the files to match the master, will I create an inconsistent state that cannot be recovered from? Thanks & Regards Oonkwee Emerging Technologies RedHat Global Support
(In reply to Oonkwee Lim_ from comment #8) > Hello Aravinda, > > The customer is still saying that the files are still not renamed. > > From them: > > It looks like whatever rename process should have taken place, did not. > > The files are still in the limbo state. What are some next steps I can take. > > If I mount the slave brinks RW and rename the files to match the master, > will I create an inconsistent state that cannot be recovered from? > > Thanks & Regards > > Oonkwee > Emerging Technologies > RedHat Global Support Looks like the files which are in limbo state are due to errors previously(before upgrade). Safe workaround is, - Delete the problematic file in Slave - Trigger resync for the file using a virtual setxattr in Master mount. cd $MASTER_MOUNT/ setfattr -n glusterfs.geo-rep.trigger-sync -v "1" <file-path-in-master-mount>
Virtual Setxattr(glusterfs.geo-rep.trigger-sync) is similar to touch command which Geo-replication can understand. This should be set on each files or directory which needs resync. If the problematic files are not deleted from Slave Volume, resyncing may face errors.(In both the options)
Post glusterfs.geo-rep.trigger-sync update: The geo-repl status since performing this operation has been in a Crawl Status of 'History Crawl' and I can see that LAST_SYNCED is advancing, albeit at a snail's pace. Is there any way to gauge where in the process it might be?
(In reply to Oonkwee Lim_ from comment #13) > Post glusterfs.geo-rep.trigger-sync update: > > The geo-repl status since performing this operation has been in a Crawl > Status of 'History Crawl' and I can see that LAST_SYNCED is advancing, > albeit at a snail's pace. > > Is there any way to gauge where in the process it might be? History Crawl will process historical changelogs till it reaches worker start time(Worker register time can be found in respective worker's log). Once it crosses the register time then it starts consuming live changelogs. We do not have a way to estimate the pending sync time since Geo-rep has to reprocess all the changelogs till current time.
Upstream mainline : http://review.gluster.org/14706 Upstream 3.8 : http://review.gluster.org/14767 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
*** Bug 1400765 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html