1737484 – geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Bug 1737484 - geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Summary: geo-rep syncing significantly behind and also only one of the directories are...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729915
Blocks:	1764015 1764174
TreeView+	depends on / blocked

Reported:	2019-08-05 13:35 UTC by Aravinda VK
Modified:	2020-01-09 12:49 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:	1729915
Clones:	1764015 1764174 (view as bug list)
Environment:
Last Closed:	2020-01-09 12:49:26 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23158	0	None	Open	geo-rep: Fix Config Get Race	2019-08-06 09:34:46 UTC
Gluster.org Gerrit	23247	0	None	Open	geo-rep: Fix worker connection issue	2019-08-20 06:21:14 UTC

Description Aravinda VK 2019-08-05 13:35:15 UTC

+++ This bug was initially created as a clone of Bug #1729915 +++

Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.


Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64



Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable



2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to propagate.
7. left the setup for weekend.
 


Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.

--- Additional comment from RHEL Product and Program Management on 2019-07-15 10:18:53 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from nchilaka on 2019-07-15 10:21:43 UTC ---

proposing as blocker, as syncing is falling behind significantly, and also tracebacks seen. Can revisit based on RC from dev.


^C
[root@rhs-gp-srv5 bricks]#  date;gluster volume geo-replication  nonfuncvol rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status
Mon Jul 15 15:50:57 IST 2019
 
MASTER NODE                           MASTER VOL    MASTER BRICK                       SLAVE USER    SLAVE                                                    SLAVE NODE                             STATUS     CRAWL STATUS    LAST_SYNCED          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
[root@rhs-gp-srv5 bricks]# 





slave volinfo

 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable
 
Volume Name: nonfuncvol-slave
Type: Distributed-Disperse
Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick3: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick4: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick5: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick6: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick7: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick8: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick9: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick10: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick11: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick12: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick13: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick14: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick15: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick16: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick17: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick18: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick19: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick20: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick21: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick22: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick23: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick24: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Options Reconfigured:
features.read-only: on
performance.quick-read: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
cluster.enable-shared-storage: enable

--- Additional comment from nchilaka on 2019-07-15 10:46:07 UTC ---

sosreports and logs @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1729915/

--- Additional comment from nchilaka on 2019-07-15 11:37:59 UTC ---

do let me know if you need the setup, ASAP{can wait till EOD} else I would go ahead with further testing which may involve reconfiguring part/complete of testbed.

--- Additional comment from Atin Mukherjee on 2019-07-15 12:21:25 UTC ---

Hari - since Sunny is on PTO for this week and this is proposed as a blocker, can you please work on this bug? Please don't hesitate to contact Kotresh/Aravinda should you need any help.

Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in picking this up, till then don't destroy the setup.

--- Additional comment from Atin Mukherjee on 2019-07-15 13:37:05 UTC ---

(In reply to Atin Mukherjee from comment #5)
> Hari - since Sunny is on PTO for this week and this is proposed as a
> blocker, can you please work on this bug? Please don't hesitate to contact
> Kotresh/Aravinda should you need any help.

I see that Hari is on PTO as well till 17th.

Aravinda - would you be able to assist here? Kotresh has couple of bugs in his plate which he's focusing on and hence requested for your help.

> 
> Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in
> picking this up, till then don't destroy the setup.

--- Additional comment from Rochelle on 2019-07-16 05:30:24 UTC ---

I'm seeing this while running automation on the latest builds as well:

[2019-07-15 10:31:26.713311] E [syncdutils(worker /bricks/brick0/master_brick0):338:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'


There was no functionality impact in my case.

However there were additional 'No such file or directory' messages in brick logs:
mnt-bricks-brick0-master_brick1.log-20190716:[2019-07-15 11:19:49.950747] E [fuse-bridge.c:220:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fac24305b3b] (--> /usr/lib64/glusterfs/6.0/xlator/mount/fuse.so(+0x81c1)[0x7fac1b6b01c1] (--> /usr/lib64/glusterfs/6.0/xlator/mount/fuse.so(+0x8a9a)[0x7fac1b6b0a9a] (--> /lib64/libpthread.so.0(+0x7ea5)[0x7fac23142ea5] (--> /lib64/libc.so.6(clone+0x6d)[0x7fac22a088cd] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory
mnt-bricks-brick0-master_brick1.log-20190716:[2019-07-15 12:29:27.262096] E [fuse-bridge.c:220:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fe169d97b3b] (--> /usr/lib64/glusterfs/6.0/xlator/mount/fuse.so(+0x81c1)[0x7fe1611421c1] (--> /usr/lib64/glusterfs/6.0/xlator/mount/fuse.so(+0x8a9a)[0x7fe161142a9a] (--> /lib64/libpthread.so.0(+0x7ea5)[0x7fe168bd4ea5] (--> /lib64/libc.so.6(clone+0x6d)[0x7fe16849a8cd] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory


gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-6.0-7.el7rhgs.x86_64
vdsm-gluster-4.30.18-1.0.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-23.el7.x86_64
glusterfs-client-xlators-6.0-7.el7rhgs.x86_64
glusterfs-6.0-7.el7rhgs.x86_64
glusterfs-server-6.0-7.el7rhgs.x86_64
glusterfs-events-6.0-7.el7rhgs.x86_64
glusterfs-api-6.0-7.el7rhgs.x86_64
glusterfs-geo-replication-6.0-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
python2-gluster-6.0-7.el7rhgs.x86_64
glusterfs-libs-6.0-7.el7rhgs.x86_64
glusterfs-fuse-6.0-7.el7rhgs.x86_64
glusterfs-rdma-6.0-7.el7rhgs.x86_64



The volume config was 1x3 on master and slave

--- Additional comment from Rochelle on 2019-07-16 05:32:22 UTC ---

I updated this BZ as the same traceback is seen in my case as well. 
Though wrt syncing, there was no issue

--- Additional comment from Aravinda VK on 2019-07-16 06:02:17 UTC ---

Getting a Forbidden error while accessing sosreport

Forbidden

You don't have permission to access /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-xiqycze.tar.xz on this server.

--- Additional comment from nchilaka on 2019-07-16 06:04:13 UTC ---

(In reply to Aravinda VK from comment #9)
> Getting a Forbidden error while accessing sosreport
> 
> Forbidden
> 
> You don't have permission to access
> /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.
> blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-
> xiqycze.tar.xz on this server.

you can retry now

--- Additional comment from nchilaka on 2019-07-16 06:05:04 UTC ---

(In reply to Atin Mukherjee from comment #6)
> (In reply to Atin Mukherjee from comment #5)
> > Hari - since Sunny is on PTO for this week and this is proposed as a
> > blocker, can you please work on this bug? Please don't hesitate to contact
> > Kotresh/Aravinda should you need any help.
> 
> I see that Hari is on PTO as well till 17th.
> 
> Aravinda - would you be able to assist here? Kotresh has couple of bugs in
> his plate which he's focusing on and hence requested for your help.
> 
> > 
> > Nag - Please note that Sunny is on PTO, so we'd have to expect some delay in
> > picking this up, till then don't destroy the setup.

understood, acked

--- Additional comment from Aravinda VK on 2019-07-16 06:06:14 UTC ---

(In reply to nchilaka from comment #10)
> (In reply to Aravinda VK from comment #9)
> > Getting a Forbidden error while accessing sosreport
> > 
> > Forbidden
> > 
> > You don't have permission to access
> > /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.
> > blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-
> > xiqycze.tar.xz on this server.
> 
> you can retry now

I can access now. Thanks

--- Additional comment from Aravinda VK on 2019-07-17 13:09:48 UTC ---

Hi Nag/Rochelle,

I failed to reproduce in my setup. Also looked at the setup to find any issue but couldn't find any. Please provide the steps if you have reproducer.

One suspect is that gconf.get() is called before gconf is loaded. Still looking for the possible window when this can happen.

--- Additional comment from Aravinda VK on 2019-07-17 13:41:44 UTC ---

Another possible race condition between gconf.getr and gconf.get

gconf.getr gets the realtime config value, that means it checks the config file on disk and gets the value. While gconf.getr loads the latest config, if other thread calls `gconf.get`, then it may get `None` as value.

```
getr:
    values = {}
    load_again()
```

If this is the issue, the fix should be to add thread lock for the `gconf._load` function.

--- Additional comment from Sunil Kumar Acharya on 2019-07-18 06:51:48 UTC ---

Placing needinfo on the comment 13

--- Additional comment from Aravinda VK on 2019-07-18 08:11:17 UTC ---

Example program to reproduce the issue: Three threads, one thread
calls `gconf.getr` config and two threads call `gconf.get`. Create
this file in `/usr/libexec/glusterfs/python/syncdaemon/` directory

```
from threading import Thread
import gsyncdconfig as gconf
import time


def worker():
    while True:
        o = gconf.get("rsync-ssh-options")
        print("Worker 1 rsync-ssh-options=%s" % o)
        o.split()
        time.sleep(0.25)


def worker2():
    while True:
        o = gconf.getr("log-rsync-performance")
        print("Worker 2 log-rsync-performance=%s" % o)
        time.sleep(0.5)


gconf.load("/etc/glusterfs/gsyncd.conf", "sample.conf")

t1 = Thread(target=worker)
t2 = Thread(target=worker)
t3 = Thread(target=worker2)

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()
```

Run the script using `python3 crash_reproduce.py` and in another
terminal, run the following

```
for i in {1..100}; do echo "[vars]
log-rsync-performance=True
rsync-ssh-options=" > /usr/libexec/glusterfs/python/syncdaemon/sample.conf; sleep 1; done
```

In between, it produces the following traceback,

```
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "crash_reproduce.py", line 10, in worker
    o.split()
AttributeError: 'NoneType' object has no attribute 'split'
```

I am working on adding thread lock to fix the issue. I will send upstream patch soon.

--- Additional comment from Aravinda VK on 2019-07-18 14:43:05 UTC ---

Number of worker restarts:

```
for i in {5..8};
do
    echo rhs-gp-srv${i}.lab.eng.blr.redhat.com;
    ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..." /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log  | awk '{print $5}' | sed 's/):72:__init__]//' | sort | uniq -c;
done
```

rhs-gp-srv5.lab.eng.blr.redhat.com
     84 /gluster/brick1/nonfuncvol-sv01
     61 /gluster/brick3/nonfuncvol-sv03
rhs-gp-srv6.lab.eng.blr.redhat.com
     48 /gluster/brick1/nonfuncvol-sv01
     48 /gluster/brick2/nonfuncvol-sv02
     71 /gluster/brick3/nonfuncvol-sv04
rhs-gp-srv7.lab.eng.blr.redhat.com
      3 /gluster/brick1/nonfuncvol-sv01
     68 /gluster/brick2/nonfuncvol-sv03
     34 /gluster/brick3/nonfuncvol-sv04
rhs-gp-srv8.lab.eng.blr.redhat.com
     38 /gluster/brick1/nonfuncvol-sv02
      2 /gluster/brick3/nonfuncvol-sv04

Number of times worker failed with Python traceback:

```
for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R /var/log/glusterfs/geo-replication/ | sort | uniq -c; done
```

rhs-gp-srv5.lab.eng.blr.redhat.com
      1 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv6.lab.eng.blr.redhat.com
      3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv7.lab.eng.blr.redhat.com
      3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv8.lab.eng.blr.redhat.com

Out of many restarts(400+) only 7 times restarted due to Python traceback.

Other reasons observed:

Workers failed with ENOTCONN

```
for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN" /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done
```
rhs-gp-srv5.lab.eng.blr.redhat.com
43
rhs-gp-srv6.lab.eng.blr.redhat.com
54
rhs-gp-srv7.lab.eng.blr.redhat.com
41
rhs-gp-srv8.lab.eng.blr.redhat.com
24

Many errors found in master mount logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-*)

[2019-07-18 01:41:24.349009] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for (null)
[2019-07-18 01:41:24.369058] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0
[2019-07-18 01:41:24.372697] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs
[2019-07-18 01:41:24.372755] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.372805] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c44c6cef-6c76-4d3e-a56c-db21c71450be
[2019-07-18 01:41:24.374539] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel
[2019-07-18 01:41:24.384280] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.384340] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c9076528-fea7-459c-b29a-2364b1f43bdc
[2019-07-18 01:41:24.386233] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com
[2019-07-18 01:41:24.386270] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.386302] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d
[2019-07-18 01:41:24.386419] W [MSGID: 109011] [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash (value) = 1716939520
[2019-07-18 01:41:24.387535] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24
[2019-07-18 01:41:24.388206] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.388267] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = d89f373a-08f4-42ab-b1ad-c60c92ee5181


I will send a patch to fix the Python traceback But need help from DHT/AFR team to root cause ENOTCONN and Directory self heal failed errors.

--- Additional comment from Atin Mukherjee on 2019-07-19 03:58:42 UTC ---

Susant, Karthik - please have a look at comment 17.

--- Additional comment from Susant Kumar Palai on 2019-07-19 06:23:27 UTC ---

(In reply to Aravinda VK from comment #17)
> Number of worker restarts:
> 
> ```
> for i in {5..8};
> do
>     echo rhs-gp-srv${i}.lab.eng.blr.redhat.com;
>     ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..."
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log  | awk '{print $5}' | sed
> 's/):72:__init__]//' | sort | uniq -c;
> done
> ```
> 
> rhs-gp-srv5.lab.eng.blr.redhat.com
>      84 /gluster/brick1/nonfuncvol-sv01
>      61 /gluster/brick3/nonfuncvol-sv03
> rhs-gp-srv6.lab.eng.blr.redhat.com
>      48 /gluster/brick1/nonfuncvol-sv01
>      48 /gluster/brick2/nonfuncvol-sv02
>      71 /gluster/brick3/nonfuncvol-sv04
> rhs-gp-srv7.lab.eng.blr.redhat.com
>       3 /gluster/brick1/nonfuncvol-sv01
>      68 /gluster/brick2/nonfuncvol-sv03
>      34 /gluster/brick3/nonfuncvol-sv04
> rhs-gp-srv8.lab.eng.blr.redhat.com
>      38 /gluster/brick1/nonfuncvol-sv02
>       2 /gluster/brick3/nonfuncvol-sv04
> 
> Number of times worker failed with Python traceback:
> 
> ```
> for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh
> root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R
> /var/log/glusterfs/geo-replication/ | sort | uniq -c; done
> ```
> 
> rhs-gp-srv5.lab.eng.blr.redhat.com
>       1
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType'
> object has no attribute 'split'
> rhs-gp-srv6.lab.eng.blr.redhat.com
>       3
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has
> no attribute 'split'
> rhs-gp-srv7.lab.eng.blr.redhat.com
>       3
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has
> no attribute 'split'
> rhs-gp-srv8.lab.eng.blr.redhat.com
> 
> Out of many restarts(400+) only 7 times restarted due to Python traceback.
> 
> Other reasons observed:
> 
> Workers failed with ENOTCONN
> 
> ```
> for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh
> root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN"
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done
> ```
> rhs-gp-srv5.lab.eng.blr.redhat.com
> 43
> rhs-gp-srv6.lab.eng.blr.redhat.com
> 54
> rhs-gp-srv7.lab.eng.blr.redhat.com
> 41
> rhs-gp-srv8.lab.eng.blr.redhat.com
> 24
> 
> Many errors found in master mount
> logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/mnt-gluster-*)
> 
> [2019-07-18 01:41:24.349009] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for (null)
> [2019-07-18 01:41:24.369058] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0
> [2019-07-18 01:41:24.372697] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs
> [2019-07-18 01:41:24.372755] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.372805] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs:
> Directory selfheal failed: 1 subvolumes down.Not fixing. gfid =
> c44c6cef-6c76-4d3e-a56c-db21c71450be
> [2019-07-18 01:41:24.374539] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel
> [2019-07-18 01:41:24.384280] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.384340] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel:
> Directory selfheal failed: 1 subvolumes down.Not fixing. gfid =
> c9076528-fea7-459c-b29a-2364b1f43bdc
> [2019-07-18 01:41:24.386233] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com
> [2019-07-18 01:41:24.386270] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid =
> 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.386302] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht:
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1
> subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d
> [2019-07-18 01:41:24.386419] W [MSGID: 109011]
> [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash
> (value) = 1716939520
> [2019-07-18 01:41:24.387535] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24
> [2019-07-18 01:41:24.388206] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid =
> 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.388267] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht:
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal
> failed: 1 subvolumes down.Not fixing. gfid =
> d89f373a-08f4-42ab-b1ad-c60c92ee5181
> 
> 
> I will send a patch to fix the Python traceback But need help from DHT/AFR
> team to root cause ENOTCONN and Directory self heal failed errors.

DHT does not heal directories when there is a subvolume down. But here is the thing, if DHT is seeing a ENOTCONN error that means the entire replica set was disconnected. Either the bricks were really down(possibly crash) or there was a genuine network issue.

Susant

--- Additional comment from nchilaka on 2019-07-24 11:12:22 UTC ---

(In reply to Aravinda VK from comment #13)
> Hi Nag/Rochelle,
> 
> I failed to reproduce in my setup. Also looked at the setup to find any
> issue but couldn't find any. Please provide the steps if you have reproducer.
> 
> One suspect is that gconf.get() is called before gconf is loaded. Still
> looking for the possible window when this can happen.

The steps to reproduce have been mentioned in c#0 in detail
nothing additional to add.

--- Additional comment from nchilaka on 2019-07-24 11:12:56 UTC ---

adding back needinfo on other requestees

--- Additional comment from nchilaka on 2019-07-24 11:13:18 UTC ---

adding back needinfo on other requestees

--- Additional comment from Karthik U S on 2019-07-24 11:30:56 UTC ---

The need-info was raised for the ENOTCONN errors seen during directory self heal, for which Susant had responded in comment #19. @Aravinda did you get a chance to check this?
AFR was failing to set the read subvols which indicates either there is no readable copy available due to heal pending on all the bricks for the entry or all the bricks being down or n/w issue as Susant mentioned.

--- Additional comment from Sunil Kumar Acharya on 2019-07-25 09:30:24 UTC ---

Mohit,

Setting needinfo to check ENOTCONN errors we are seeing.

--- Additional comment from Kotresh HR on 2019-07-30 15:16:23 UTC ---

This looks to me unstable setup. I could not find the geo-rep slave logs and few gluster mount logs are missing.
But from what ever logs available from master, I could see lot number of ping timer expiration and disconnects.

Sample log output of disconnects:

[root@kotresh Master $]find . | grep geo-replication | xargs grep "last 42 seconds, disconnecting" 2>/dev/null | tail -f
./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15 08:29:11.761442] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the last 42 seconds, disconnecting.
./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15 08:32:09.811456] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the last 42 seconds, disconnecting.
./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15 08:33:26.833158] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the last 42 seconds, disconnecting.

Number of disconnects from available logs on Master cluster:
[root@kotresh Master $]find . | grep geo-replication | xargs grep "last 42 seconds, disconnecting" 2>/dev/null | wc -l
51729
[root@kotresh Master $]

51729 is a huge number of disconnects.

Please redo the test with same volume type combination and see if you still observe the issue.

--- Additional comment from nchilaka on 2019-07-31 06:33:37 UTC ---

(In reply to Kotresh HR from comment #25)
> This looks to me unstable setup. I could not find the geo-rep slave logs and
> few gluster mount logs are missing.
> But from what ever logs available from master, I could see lot number of
> ping timer expiration and disconnects.
> 
> Sample log output of disconnects:
> 
> [root@kotresh Master $]find . | grep geo-replication | xargs grep "last 42
> seconds, disconnecting" 2>/dev/null | tail -f
> ./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/
> glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.
> com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15
> 08:29:11.761442] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
> 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the
> last 42 seconds, disconnecting.
> ./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/
> glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.
> com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15
> 08:32:09.811456] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
> 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the
> last 42 seconds, disconnecting.
> ./sosreport-rhs-gp-srv8-georep-issues-15jul2019-2019-07-15-nsbwhbp/var/log/
> glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.
> com_nonfuncvol-slave/mnt-gluster-brick2-nonfuncvol-sv03.log:[2019-07-15
> 08:33:26.833158] C [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]
> 0-nonfuncvol-client-7: server 10.70.36.12:49153 has not responded in the
> last 42 seconds, disconnecting.
> 
> Number of disconnects from available logs on Master cluster:
> [root@kotresh Master $]find . | grep geo-replication | xargs grep "last 42
> seconds, disconnecting" 2>/dev/null | wc -l
> 51729
> [root@kotresh Master $]
> 
> 51729 is a huge number of disconnects.
> 
> Please redo the test with same volume type combination and see if you still
> observe the issue.

have restarted the tests

--- Additional comment from nchilaka on 2019-08-01 07:40:57 UTC ---

I have reproduced the issue where one of the main parent directory is not yet synced and also the slave sync is lagging behind a lot

as previously one of the parent directory is not yet created even after about close to 4days
eg: I had 2 IO patterns running, linux untar and small files
however smallfiles dir is still not created on salve
[root@rhs-gp-srv11 IOs]# pwd
/mnt/slave-mount/IOs
[root@rhs-gp-srv11 IOs]# ls
kernel
also data on master is about 1.4tb and on slave is less than 300gb
cluster details:
master node: rhs-gp-srv5
slave:rhs-gp-srv11 (mounted the slave vol locally on this itself)

Comment 1 Worker Ant 2019-08-05 13:44:12 UTC

REVIEW: https://review.gluster.org/23158 (geo-rep: Fix Config Get Race) posted (#1) for review on master by Aravinda VK

Comment 2 Worker Ant 2019-08-06 09:34:48 UTC

REVIEW: https://review.gluster.org/23158 (geo-rep: Fix Config Get Race) merged (#2) on master by Amar Tumballi

Comment 3 Worker Ant 2019-08-16 10:16:42 UTC

REVIEW: https://review.gluster.org/23247 (geo-rep: Fix worker connection issue) posted (#1) for review on master by Kotresh HR

Comment 4 Worker Ant 2019-08-20 06:21:15 UTC

REVIEW: https://review.gluster.org/23247 (geo-rep: Fix worker connection issue) merged (#4) on master by Amar Tumballi

Comment 6 Sunny Kumar 2020-01-09 12:49:26 UTC

Closing this bug as patches are merged.

Note You need to log in before you can comment on or make changes to this bug.