Created attachment 863569 [details]
geo-rep logs from one Active nore
Description of problem:
While doing geo-replication one of the gsyncd in master crashed with OsError. And later few file deletes are not synced to slave and I believe this is the reason for the same.
Version-Release number of selected component (if applicable):
[root@mustang ~]# gluster --version
glusterfs 3.5.0beta3 built on Feb 15 2014 13:24:29
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
Once out of one try
Steps to Reproduce:
1. Create and start geo-rep from 2*2 distributed-replicated master node to 282 distributed-replicated slave nodes.
2. Now create some files (untar linux kernel)
3. Wait for the crash to happen in one of the nodes.
[2014-02-15 14:01:47.166320] W [master(/rhs/bricks/brick2):877:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1392452987
[2014-02-15 14:02:56.410913] E [syncdutils(/rhs/bricks/brick2):240:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
File "/usr/local/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 542, in main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1176, in service_loop
File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 467, in crawlwrap
File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 1137, in crawl
File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 884, in upd_stime
File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 658, in sendmark
File "/usr/local/libexec/glusterfs/python/syncdaemon/master.py", line 152, in set_slave_xtime
self.slave.server.set_stime(path, self.uuid, mark)
File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 1163, in <lambda>
slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server)
File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff
File "/usr/local/libexec/glusterfs/python/syncdaemon/resource.py", line 496, in set_stime
Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark))
File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr
File "/usr/local/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2014-02-15 14:02:56.412710] I [syncdutils(/rhs/bricks/brick2):192:finalize] <top>: exiting.
[2014-02-15 14:02:56.428134] I [monitor(monitor):81:set_state] Monitor: new state: faulty
There should be no crash.
I saw this crash twice in one node and once in other node. Also they recovered themselves, but after it came back online, I did some deletes on master which didn't get synced to slave. I guess this might be the reason. I will do some more checks and log a different bug if I think that's caused by different bug.
Will upload those files.
Created attachment 863570 [details]
gsync log from another "Active" node
Anything in geo-rep mount logs?
Please attach those too.
From the logs, geo-replication is running in xsync mode as the main crawl loop.
It seems that a brick got killed (or a crash), due to which geo-replication daemon started running in xsync mode. As this mode cannot handle deletes (and renames), that's why the deletes were not propagated to the slave.
For the OSError backtrace, I would need the geo-rep mount logs.
The deletes were done once all data were synced. And at that point all the gsyncd were up.
The testcase is very simple and reproducible every time. I don't have the setup now. You should be able to reproduce is in the dev environment itself.
pre-release version is ambiguous and about to be removed as a choice.
If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.