Hide Forgot
rsync is used by geo-replication the following way: rsync -sS -aR ./changed_file /proc/{pid}/cwd This leads rsync to the assumption that source and target are on the same host. Therefore the delta-transfer algorithm is disabled which means that if only one block of a file has changed rsync has to transfer the whole file again. In some cases this behaviour is not tolerable, e.g. when storing encrypted containers inside of glusterfs. Every small change inside of this container would lead to massive traffic. rsync has a parameter "--no-whole-file" which forces the use of delta-transfer algorithm. I added this parameter to the default value of --rsync-extra in gsyncd.py, but it made things even worse. The reason is that on the target host there is no rsync running. Instead I think the source host accesses the file via glusterfs/NFS/whatever and now rsync on the source host tries to find differences. Therefore the whole file has to be transfered to the source host and the differences have to be transfered back to the target host. So this was no good idea... I think a way should be found how geo-replication can use rsyncs delta-transfer algorithm.
(In reply to comment #0) > rsync is used by geo-replication the following way: > > rsync -sS -aR ./changed_file /proc/{pid}/cwd > > This leads rsync to the assumption that source and target are on the same host. > Therefore the delta-transfer algorithm is disabled which means that if only one > block of a file has changed rsync has to transfer the whole file again. > > In some cases this behaviour is not tolerable, e.g. when storing encrypted > containers inside of glusterfs. Every small change inside of this container > would lead to massive traffic. > > rsync has a parameter "--no-whole-file" which forces the use of delta-transfer > algorithm. I added this parameter to the default value of --rsync-extra in > gsyncd.py, but it made things even worse. > > The reason is that on the target host there is no rsync running. Instead I > think the source host accesses the file via glusterfs/NFS/whatever and now > rsync on the source host tries to find differences. Therefore the whole file > has to be transfered to the source host and the differences have to be > transfered back to the target host. So this was no good idea... > > I think a way should be found how geo-replication can use rsyncs delta-transfer > algorithm. Parametrizing rsync is matter of configuration. You are suggested to not fiddle with gsyncd.py, but use gluster volume geo-replication <master> <slave> config <key> <value> (understandably, you can be annoyed by the fact that this way you tune only a particular geo-rep session, while hacking into the code has a system-wide effect; however, 3.3 will support glob-style wildcards for master and slave, so system-wide configuration will be possible in this manner). To make geo-rep effective, local resources (gluster volume, paths) involved in the setup should really be "localish", ie. either be hosted on the machine you use or if it is a networked resource, the link should be fast (intranet link or better). If your geo-rep slave is a local path which in fact is a network mount over a not sufficiently fast link, then you are suggested to change to a slave which can access that resource fast and access that slave through ssh (ssh:// url). Our philosophy is reflected in etimology: name of gsyncd, the daemon behind geo-rep, is obtanined from rsync by changing "r" to "g", and appending a "d". First change refers to the fact of gsyncd being gluster-aware (uses some glusterfs-specific hints in the fs to efficiently localize unsynced changes), the trailing "d" refers to being a daemon, ie. not doing a one-shot but continuous synchronization. In all other matters, our operation logic is that of rsync. If doing rsync from /foo to /bar is ineffective because /bar is a network mount over a slow link, then we do not expect geo-rep to be any better; and the same fix applies: specify the target url so that it explicitly instructs rsync/gsyncd to access the resource behind /bar through ssh.
Thank you for your answer. You're absolutely right, it is all a question of configuration. My fault was that I created a volume on the slave host and started the replication by doing: gluster volume geo-replication vol-name slave:repl-vol-name start This leads to the described behavior and to the fact that rsync's delta-replication cannot be used efficiently. Instead I now started the replication by using: gluster volume geo-replication vol-name slave:/path/to/replication start Now it connects using ssh and transfers the files using rsync on master and slave. So this bug can be closed as invalid. But I do not really know how to parametrize rsync without changing the code. "gluster volume geo-replication <master> <slave> config" shows the following keys that can be changed: - gluster_log_file - ssh_command - session_owner - remote_gsyncd - state_file - pid_file - log_file - gluster_command I need to change the parameter --rsync-extra of gsyncd on the master host, but there is only the config key remote_gsync. I know this is a bugtracker and no support forum, but maybe you can clarify that.
(In reply to comment #2) > Thank you for your answer. You're absolutely right, it is all a question of > configuration. My fault was that I created a volume on the slave host and > started the replication by doing: > > gluster volume geo-replication vol-name slave:repl-vol-name start > > This leads to the described behavior and to the fact that rsync's > delta-replication cannot be used efficiently. > > Instead I now started the replication by using: > > gluster volume geo-replication vol-name slave:/path/to/replication start > > Now it connects using ssh and transfers the files using rsync on master and > slave. So this bug can be closed as invalid. You can do better! slave:repl-vol-name is shorthand for gluster://slave:repl-vol-name ie. use volume 'repl-vol-name' of the gluster service on 'slave' via gluster protocol. slave:/path/to/replication is shorthand for ssh://root@slave:/path/to/replication ie. use path '/path/to/replication' accessed through an ssh tunnel to 'slave' (and thus it presupposes 'repl-vol-name' be mounted on '/path/to/replication' to make these two match). Indeed, you can wrap gluster protocol as such in ssh, no need to set up a mount manually -- then the full url would look like: ssh://root@slave:gluster://localhost:repl-vol-name ie. on the remote side you use gluster proto to access 'repl-vol-name' (which is local access over there, thus we get the inner url 'gluster://localhost:repl-vol-name') and wrap it in ssh (hence it is to be prefixed with 'ssh://root@slave:'). Now the inner url can be shortened to ':repl-vol-name' (by omitting proto and localhost [the default host to connect to]), the outer part can be shortened to 'slave:' (by omitting proto and username [as it's the same on both ends]), thus we arrive to the abbreviated form: slave::repl-vol-name I suggest use this as the slave url. (Cf. http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Managing_GlusterFS_Geo-replication on urls used in geo-rep; yes, shortened forms are unambiguous, but now I don't digress on to formal syntax [referred doc neither does].) > But I do not really know how to parametrize rsync without changing the code. > "gluster volume geo-replication <master> <slave> config" shows the following > keys that can be changed: > > - gluster_log_file > - ssh_command > - session_owner > - remote_gsyncd > - state_file > - pid_file > - log_file > - gluster_command > > I need to change the parameter --rsync-extra of gsyncd on the master host, but > there is only the config key remote_gsync. I know this is a bugtracker and no > support forum, but maybe you can clarify that. The parameters you cite are the ones which do have a value set. The whole set of tunables which can be set by user is different: some of the above, like 'pid-file' [*], are read-only; some of the user tunables are not listed because they are not set. Those tunables which are part of the command line interface are listed in the documentation: http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_gluster_Command Now there are some tunables -- like 'rsync-extra' -- which are actually possible to set, but is considered internal (we can change the way they work, or omit them at any time), so you are not encouraged to rely on them. If you want to add an option to rsync, supported way is to use 'rsync-command', eg. gluster volume geo-replication <master> <slave> config rsync-command 'rsync --no-whole-file' [*] '-' and '_' are interchangeable in names of tunables
Thank you very much for that detailed explanation. I should have read that note in the documentation more carefully... Just another short note, if somebody ever tries: If you add the --inplace parameter to rsync-command, replication fails because you cannot combine --inplace and -S. Of course you can solve that problem by changing rsync-extra.
(In reply to comment #4) > Just another short note, if somebody ever tries: > > If you add the --inplace parameter to rsync-command, replication fails because > you cannot combine --inplace and -S. Of course you can solve that problem by > changing rsync-extra. A valid point indeed!