1764174 – geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Bug 1764174 - geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Summary: geo-rep syncing significantly behind and also only one of the directories are...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729915 1737484
Blocks:	1764015
TreeView+	depends on / blocked

Reported:	2019-10-22 12:28 UTC by Kotresh HR
Modified:	2019-10-23 04:08 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:	1737484
Environment:
Last Closed:	2019-10-23 04:08:02 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23594	0	None	Merged	geo-rep: Fix Config Get Race	2019-10-23 04:08:01 UTC

Description Kotresh HR 2019-10-22 12:28:30 UTC

++ This bug was initially created as a clone of Bug #1737484 +++

+++ This bug was initially created as a clone of Bug #1729915 +++

Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.


Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64



Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable



2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to propagate.
7. left the setup for weekend.
 


Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.

--- Additional comment from RHEL Product and Program Management on 2019-07-15 10:18:53 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from nchilaka on 2019-07-15 10:21:43 UTC ---

proposing as blocker, as syncing is falling behind significantly, and also tracebacks seen. Can revisit based on RC from dev.


^C
[root@rhs-gp-srv5 bricks]#  date;gluster volume geo-replication  nonfuncvol rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status
Mon Jul 15 15:50:57 IST 2019
 
MASTER NODE                           MASTER VOL    MASTER BRICK                       SLAVE USER    SLAVE                                                    SLAVE NODE                             STATUS     CRAWL STATUS    LAST_SYNCED          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
[root@rhs-gp-srv5 bricks]# 





slave volinfo

 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable
 
Volume Name: nonfuncvol-slave
Type: Distributed-Disperse
Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick3: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick4: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick5: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick6: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick7: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick8: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick9: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick10: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick11: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick12: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick13: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick14: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick15: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick16: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick17: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick18: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick19: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick20: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick21: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick22: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick23: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick24: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Options Reconfigured:
features.read-only: on
performance.quick-read: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
cluster.enable-shared-storage: enable

--- Additional comment from nchilaka on 2019-07-15 10:46:07 UTC ---

sosreports and logs @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1729915/

--- Additional comment from nchilaka on 2019-07-15 11:37:59 UTC ---

do let me know if you need the setup, ASAP{can wait till EOD} else I would go ahead with further testing which may involve reconfiguring part/complete of testbed.

--- Additional comment from Atin Mukherjee on 2019-07-15 12:21:25 UTC ---

Hari - since Sunny is on PTO for this week and this is proposed as a blocker, can you please work on this bug? Please don't hesitate to contact Kotresh/Aravinda should you need any help.

Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in picking this up, till then don't destroy the setup.

--- Additional comment from Atin Mukherjee on 2019-07-15 13:37:05 UTC ---

(In reply to Atin Mukherjee from comment #5)
> Hari - since Sunny is on PTO for this week and this is proposed as a
> blocker, can you please work on this bug? Please don't hesitate to contact
> Kotresh/Aravinda should you need any help.

I see that Hari is on PTO as well till 17th.

Aravinda - would you be able to assist here? Kotresh has couple of bugs in his plate which he's focusing on and hence requested for your help.

> 
> Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in
> picking this up, till then don't destroy the setup.

--- Additional comment from Rochelle on 2019-07-16 05:30:24 UTC ---

I'm seeing this while running automation on the latest builds as well:

[2019-07-15 10:31:26.713311] E [syncdutils(worker /bricks/brick0/master_brick0):338:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'


There was no functionality impact in my case.

However there were additional 'No such file or directory' messages in brick logs:
mnt-bricks-brick0-master_brick1.log-20190716:[2019-07-15 11:19:49.950747] E [fuse-bridge.c:220:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fac24305b3b] (-->

Comment 1 Worker Ant 2019-10-22 12:56:27 UTC

REVIEW: https://review.gluster.org/23594 (geo-rep: Fix Config Get Race) posted (#1) for review on release-6 by Kotresh HR

Comment 2 Worker Ant 2019-10-23 04:08:02 UTC

REVIEW: https://review.gluster.org/23594 (geo-rep: Fix Config Get Race) merged (#1) on release-6 by Kotresh HR

Note You need to log in before you can comment on or make changes to this bug.