++ This bug was initially created as a clone of Bug #1737484 +++ +++ This bug was initially created as a clone of Bug #1729915 +++ Description of problem: ======================= had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume. I see below issues in my test-bed 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF. Version-Release number of selected component (if applicable): ===================== 6.0.7 rsync-3.1.2-6.el7_6.1.x86_64 Steps to Reproduce: ==================== note: no brickmux enabled 1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep Volume Name: nonfuncvol Type: Distributed-Replicate Volume ID: 4d44936f-312d-431a-905d-813e8ee63668 Status: Started Snapshot Count: 1 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02 Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03 Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.barrier: disable cluster.shd-max-threads: 24 client.event-threads: 8 server.event-threads: 8 transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable 2. mounted the volume on 10 clients, started capturing resource info on clients 3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave 4. started IOs on clients of master, just linux untar 50 times from all clients 5. setup georep from master->slave 6. started georep only after about 4hrs so that master has some data to propagate. 7. left the setup for weekend. Actual results: =================== seen below issues 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF. --- Additional comment from RHEL Product and Program Management on 2019-07-15 10:18:53 UTC --- This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from nchilaka on 2019-07-15 10:21:43 UTC --- proposing as blocker, as syncing is falling behind significantly, and also tracebacks seen. Can revisit based on RC from dev. ^C [root@rhs-gp-srv5 bricks]# date;gluster volume geo-replication nonfuncvol rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status Mon Jul 15 15:50:57 IST 2019 MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Active Hybrid Crawl N/A [root@rhs-gp-srv5 bricks]# slave volinfo Volume Name: gluster_shared_storage Type: Replicate Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Options Reconfigured: transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable Volume Name: nonfuncvol-slave Type: Distributed-Disperse Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced Status: Started Snapshot Count: 0 Number of Bricks: 4 x (4 + 2) = 24 Transport-type: tcp Bricks: Brick1: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick3: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick4: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick5: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick6: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick7: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick8: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick9: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick10: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick11: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick12: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick13: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick14: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick15: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick16: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick17: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick18: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick19: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick20: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick21: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick22: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Brick23: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Brick24: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Options Reconfigured: features.read-only: on performance.quick-read: off transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on cluster.enable-shared-storage: enable --- Additional comment from nchilaka on 2019-07-15 10:46:07 UTC --- sosreports and logs @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1729915/ --- Additional comment from nchilaka on 2019-07-15 11:37:59 UTC --- do let me know if you need the setup, ASAP{can wait till EOD} else I would go ahead with further testing which may involve reconfiguring part/complete of testbed. --- Additional comment from Atin Mukherjee on 2019-07-15 12:21:25 UTC --- Hari - since Sunny is on PTO for this week and this is proposed as a blocker, can you please work on this bug? Please don't hesitate to contact Kotresh/Aravinda should you need any help. Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in picking this up, till then don't destroy the setup. --- Additional comment from Atin Mukherjee on 2019-07-15 13:37:05 UTC --- (In reply to Atin Mukherjee from comment #5) > Hari - since Sunny is on PTO for this week and this is proposed as a > blocker, can you please work on this bug? Please don't hesitate to contact > Kotresh/Aravinda should you need any help. I see that Hari is on PTO as well till 17th. Aravinda - would you be able to assist here? Kotresh has couple of bugs in his plate which he's focusing on and hence requested for your help. > > Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in > picking this up, till then don't destroy the setup. --- Additional comment from Rochelle on 2019-07-16 05:30:24 UTC --- I'm seeing this while running automation on the latest builds as well: [2019-07-15 10:31:26.713311] E [syncdutils(worker /bricks/brick0/master_brick0):338:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' There was no functionality impact in my case. However there were additional 'No such file or directory' messages in brick logs: mnt-bricks-brick0-master_brick1.log-20190716:[2019-07-15 11:19:49.950747] E [fuse-bridge.c:220:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fac24305b3b] (-->
REVIEW: https://review.gluster.org/23594 (geo-rep: Fix Config Get Race) posted (#1) for review on release-6 by Kotresh HR
REVIEW: https://review.gluster.org/23594 (geo-rep: Fix Config Get Race) merged (#1) on release-6 by Kotresh HR