1764015 – geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Bug 1764015 - geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Summary: geo-rep syncing significantly behind and also only one of the directories are...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729915 1737484 1764174
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-22 07:29 UTC by Kotresh HR
Modified:	2019-11-11 08:56 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:	1737484
Environment:
Last Closed:	2019-11-11 08:56:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23580	0	None	Merged	geo-rep: Fix Config Get Race	2019-11-11 08:56:06 UTC

Comment 1 Kotresh HR 2019-10-22 07:34:55 UTC

Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.


Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64



Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable



2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to propagate.
7. left the setup for weekend.
 


Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.

Comment 2 Worker Ant 2019-10-22 07:54:58 UTC

REVIEW: https://review.gluster.org/23580 (geo-rep: Fix Config Get Race) posted (#1) for review on release-7 by Kotresh HR

Comment 3 Worker Ant 2019-11-11 08:56:07 UTC

REVIEW: https://review.gluster.org/23580 (geo-rep: Fix Config Get Race) merged (#2) on release-7 by Kotresh HR

Note You need to log in before you can comment on or make changes to this bug.