Description of problem: ======================= had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume. I see below issues in my test-bed 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF. Version-Release number of selected component (if applicable): ===================== 6.0.7 rsync-3.1.2-6.el7_6.1.x86_64 Steps to Reproduce: ==================== note: no brickmux enabled 1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep Volume Name: nonfuncvol Type: Distributed-Replicate Volume ID: 4d44936f-312d-431a-905d-813e8ee63668 Status: Started Snapshot Count: 1 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02 Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03 Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.barrier: disable cluster.shd-max-threads: 24 client.event-threads: 8 server.event-threads: 8 transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable 2. mounted the volume on 10 clients, started capturing resource info on clients 3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave 4. started IOs on clients of master, just linux untar 50 times from all clients 5. setup georep from master->slave 6. started georep only after about 4hrs so that master has some data to propagate. 7. left the setup for weekend. Actual results: =================== seen below issues 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.
REVIEW: https://review.gluster.org/23580 (geo-rep: Fix Config Get Race) posted (#1) for review on release-7 by Kotresh HR
REVIEW: https://review.gluster.org/23580 (geo-rep: Fix Config Get Race) merged (#2) on release-7 by Kotresh HR