Created attachment 863569 [details] geo-rep logs from one Active nore Description of problem: While doing geo-replication one of the gsyncd in master crashed with OsError. And later few file deletes are not synced to slave and I believe this is the reason for the same. Version-Release number of selected component (if applicable): [root@mustang ~]# gluster --version glusterfs 3.5.0beta3 built on Feb 15 2014 13:24:29 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. How reproducible: Once out of one try Steps to Reproduce: 1. Create and start geo-rep from 2*2 distributed-replicated master node to 282 distributed-replicated slave nodes. 2. Now create some files (untar linux kernel) 3. Wait for the crash to happen in one of the nodes. Actual results: [2014-02-15 14:01:47.166320] W [master(/rhs/bricks/brick2):877:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1392452987 [2014-02-15 14:02:56.410913] E [syncdutils(/rhs/bricks/brick2):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 542, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1176, in service_loop g1.crawlwrap(oneshot=True) File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 467, in crawlwrap self.crawl() File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 1137, in crawl self.upd_stime(item[1][1], item[1][0]) File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 884, in upd_stime self.sendmark(path, stime) File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 658, in sendmark self.set_slave_xtime(path, mark) File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 152, in set_slave_xtime self.slave.server.set_stime(path, self.uuid, mark) File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1163, in <lambda> slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server) File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff return f(*a) File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 496, in set_stime Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark)) File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr cls.raise_oserr() File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 2] No such file or directory [2014-02-15 14:02:56.412710] I [syncdutils(/rhs/bricks/brick2):192:finalize] <top>: exiting. [2014-02-15 14:02:56.428134] I [monitor(monitor):81:set_state] Monitor: new state: faulty Expected results: There should be no crash. Additional info: I saw this crash twice in one node and once in other node. Also they recovered themselves, but after it came back online, I did some deletes on master which didn't get synced to slave. I guess this might be the reason. I will do some more checks and log a different bug if I think that's caused by different bug. Will upload those files.
Created attachment 863570 [details] gsync log from another "Active" node
MS, Anything in geo-rep mount logs? Please attach those too.
MS, From the logs, geo-replication is running in xsync mode as the main crawl loop. It seems that a brick got killed (or a crash), due to which geo-replication daemon started running in xsync mode. As this mode cannot handle deletes (and renames), that's why the deletes were not propagated to the slave. For the OSError backtrace, I would need the geo-rep mount logs.
The deletes were done once all data were synced. And at that point all the gsyncd were up. The testcase is very simple and reproducible every time. I don't have the setup now. You should be able to reproduce is in the dev environment itself.
pre-release version is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.