1729915 – geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Bug 1729915 - geo-rep syncing significantly behind and also only one of the directories are synced with tracebacks seen

Summary: geo-rep syncing significantly behind and also only one of the directories are...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.5.0
Assignee:	Kotresh HR
QA Contact:	Kshithij Iyer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1696809 1737484 1764015 1764174
TreeView+	depends on / blocked

Reported:	2019-07-15 10:18 UTC by Nag Pavan Chilakam
Modified:	2019-11-06 10:19 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-6.0-13
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1737484 (view as bug list)
Environment:
Last Closed:	2019-11-06 10:19:04 UTC
Embargoed:
Dependent Products:
Flags:	khiremat: needinfo-

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2019-07-15 10:18:52 UTC

Description of problem:
=======================
had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume.
I see below issues in my test-bed
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.


Version-Release number of selected component (if applicable):
=====================
6.0.7
rsync-3.1.2-6.el7_6.1.x86_64



Steps to Reproduce:
====================
note: no brickmux enabled
1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep
Volume Name: nonfuncvol
Type: Distributed-Replicate
Volume ID: 4d44936f-312d-431a-905d-813e8ee63668
Status: Started
Snapshot Count: 1
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01
Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02
Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02
Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03
Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03
Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
features.barrier: disable
cluster.shd-max-threads: 24
client.event-threads: 8
server.event-threads: 8
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable



2. mounted the volume on 10 clients, started capturing resource info on clients
3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave
4. started IOs on clients of master, just linux untar 50 times from all clients
5. setup georep from master->slave
6. started georep only after about 4hrs so that master has some data to propagate.
7. left the setup for weekend.
 


Actual results:
===================
seen below issues
1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created
2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data
3) I have seen some tracebacks in gsync log as below

/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714

[2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty
[2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...



[root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log


[2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync
 changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster-
brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888
[2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA
IL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob
    po = self.sync_engine(pb, self.log_err)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync
    rconf.ssh_ctl_args + \
AttributeError: 'NoneType' object has no attribute 'split'
[2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.

Comment 2 Nag Pavan Chilakam 2019-07-15 10:21:43 UTC

proposing as blocker, as syncing is falling behind significantly, and also tracebacks seen. Can revisit based on RC from dev.


^C
[root@rhs-gp-srv5 bricks]#  date;gluster volume geo-replication  nonfuncvol rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status
Mon Jul 15 15:50:57 IST 2019
 
MASTER NODE                           MASTER VOL    MASTER BRICK                       SLAVE USER    SLAVE                                                    SLAVE NODE                             STATUS     CRAWL STATUS    LAST_SYNCED          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv5.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv7.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv01    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv6.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick1/nonfuncvol-sv02    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv11.lab.eng.blr.redhat.com    Passive    N/A             N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick2/nonfuncvol-sv03    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv16.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
rhs-gp-srv8.lab.eng.blr.redhat.com    nonfuncvol    /gluster/brick3/nonfuncvol-sv04    root          rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave    rhs-gp-srv13.lab.eng.blr.redhat.com    Active     Hybrid Crawl    N/A                  
[root@rhs-gp-srv5 bricks]# 





slave volinfo

 
Volume Name: gluster_shared_storage
Type: Replicate
Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable
 
Volume Name: nonfuncvol-slave
Type: Distributed-Disperse
Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick3: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1
Brick4: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick5: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick6: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1
Brick7: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick8: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick9: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2
Brick10: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick11: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick12: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2
Brick13: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick14: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick15: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3
Brick16: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick17: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick18: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3
Brick19: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick20: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick21: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4
Brick22: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick23: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Brick24: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4
Options Reconfigured:
features.read-only: on
performance.quick-read: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
cluster.enable-shared-storage: enable

Comment 9 Aravinda VK 2019-07-16 06:02:17 UTC

Getting a Forbidden error while accessing sosreport

Forbidden

You don't have permission to access /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-xiqycze.tar.xz on this server.

Comment 10 Nag Pavan Chilakam 2019-07-16 06:04:13 UTC

(In reply to Aravinda VK from comment #9)
> Getting a Forbidden error while accessing sosreport
> 
> Forbidden
> 
> You don't have permission to access
> /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.
> blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-
> xiqycze.tar.xz on this server.

you can retry now

Comment 12 Aravinda VK 2019-07-16 06:06:14 UTC

(In reply to nchilaka from comment #10)
> (In reply to Aravinda VK from comment #9)
> > Getting a Forbidden error while accessing sosreport
> > 
> > Forbidden
> > 
> > You don't have permission to access
> > /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.
> > blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-
> > xiqycze.tar.xz on this server.
> 
> you can retry now

I can access now. Thanks

Comment 13 Aravinda VK 2019-07-17 13:09:48 UTC

Hi Nag/Rochelle,

I failed to reproduce in my setup. Also looked at the setup to find any issue but couldn't find any. Please provide the steps if you have reproducer.

One suspect is that gconf.get() is called before gconf is loaded. Still looking for the possible window when this can happen.

Comment 14 Aravinda VK 2019-07-17 13:41:44 UTC

Another possible race condition between gconf.getr and gconf.get

gconf.getr gets the realtime config value, that means it checks the config file on disk and gets the value. While gconf.getr loads the latest config, if other thread calls `gconf.get`, then it may get `None` as value.

```
getr:
    values = {}
    load_again()
```

If this is the issue, the fix should be to add thread lock for the `gconf._load` function.

Comment 16 Aravinda VK 2019-07-18 08:11:17 UTC

Example program to reproduce the issue: Three threads, one thread
calls `gconf.getr` config and two threads call `gconf.get`. Create
this file in `/usr/libexec/glusterfs/python/syncdaemon/` directory

```
from threading import Thread
import gsyncdconfig as gconf
import time


def worker():
    while True:
        o = gconf.get("rsync-ssh-options")
        print("Worker 1 rsync-ssh-options=%s" % o)
        o.split()
        time.sleep(0.25)


def worker2():
    while True:
        o = gconf.getr("log-rsync-performance")
        print("Worker 2 log-rsync-performance=%s" % o)
        time.sleep(0.5)


gconf.load("/etc/glusterfs/gsyncd.conf", "sample.conf")

t1 = Thread(target=worker)
t2 = Thread(target=worker)
t3 = Thread(target=worker2)

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()
```

Run the script using `python3 crash_reproduce.py` and in another
terminal, run the following

```
for i in {1..100}; do echo "[vars]
log-rsync-performance=True
rsync-ssh-options=" > /usr/libexec/glusterfs/python/syncdaemon/sample.conf; sleep 1; done
```

In between, it produces the following traceback,

```
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "crash_reproduce.py", line 10, in worker
    o.split()
AttributeError: 'NoneType' object has no attribute 'split'
```

I am working on adding thread lock to fix the issue. I will send upstream patch soon.

Comment 17 Aravinda VK 2019-07-18 14:43:05 UTC

Number of worker restarts:

```
for i in {5..8};
do
    echo rhs-gp-srv${i}.lab.eng.blr.redhat.com;
    ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..." /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log  | awk '{print $5}' | sed 's/):72:__init__]//' | sort | uniq -c;
done
```

rhs-gp-srv5.lab.eng.blr.redhat.com
     84 /gluster/brick1/nonfuncvol-sv01
     61 /gluster/brick3/nonfuncvol-sv03
rhs-gp-srv6.lab.eng.blr.redhat.com
     48 /gluster/brick1/nonfuncvol-sv01
     48 /gluster/brick2/nonfuncvol-sv02
     71 /gluster/brick3/nonfuncvol-sv04
rhs-gp-srv7.lab.eng.blr.redhat.com
      3 /gluster/brick1/nonfuncvol-sv01
     68 /gluster/brick2/nonfuncvol-sv03
     34 /gluster/brick3/nonfuncvol-sv04
rhs-gp-srv8.lab.eng.blr.redhat.com
     38 /gluster/brick1/nonfuncvol-sv02
      2 /gluster/brick3/nonfuncvol-sv04

Number of times worker failed with Python traceback:

```
for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R /var/log/glusterfs/geo-replication/ | sort | uniq -c; done
```

rhs-gp-srv5.lab.eng.blr.redhat.com
      1 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv6.lab.eng.blr.redhat.com
      3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv7.lab.eng.blr.redhat.com
      3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split'
rhs-gp-srv8.lab.eng.blr.redhat.com

Out of many restarts(400+) only 7 times restarted due to Python traceback.

Other reasons observed:

Workers failed with ENOTCONN

```
for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN" /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done
```
rhs-gp-srv5.lab.eng.blr.redhat.com
43
rhs-gp-srv6.lab.eng.blr.redhat.com
54
rhs-gp-srv7.lab.eng.blr.redhat.com
41
rhs-gp-srv8.lab.eng.blr.redhat.com
24

Many errors found in master mount logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-*)

[2019-07-18 01:41:24.349009] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for (null)
[2019-07-18 01:41:24.369058] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0
[2019-07-18 01:41:24.372697] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs
[2019-07-18 01:41:24.372755] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.372805] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c44c6cef-6c76-4d3e-a56c-db21c71450be
[2019-07-18 01:41:24.374539] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel
[2019-07-18 01:41:24.384280] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.384340] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c9076528-fea7-459c-b29a-2364b1f43bdc
[2019-07-18 01:41:24.386233] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com
[2019-07-18 01:41:24.386270] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.386302] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d
[2019-07-18 01:41:24.386419] W [MSGID: 109011] [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash (value) = 1716939520
[2019-07-18 01:41:24.387535] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24
[2019-07-18 01:41:24.388206] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2019-07-18 01:41:24.388267] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = d89f373a-08f4-42ab-b1ad-c60c92ee5181


I will send a patch to fix the Python traceback But need help from DHT/AFR team to root cause ENOTCONN and Directory self heal failed errors.

Comment 18 Atin Mukherjee 2019-07-19 03:58:42 UTC

Susant, Karthik - please have a look at comment 17.

Comment 19 Susant Kumar Palai 2019-07-19 06:23:27 UTC

(In reply to Aravinda VK from comment #17)
> Number of worker restarts:
> 
> ```
> for i in {5..8};
> do
>     echo rhs-gp-srv${i}.lab.eng.blr.redhat.com;
>     ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..."
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log  | awk '{print $5}' | sed
> 's/):72:__init__]//' | sort | uniq -c;
> done
> ```
> 
> rhs-gp-srv5.lab.eng.blr.redhat.com
>      84 /gluster/brick1/nonfuncvol-sv01
>      61 /gluster/brick3/nonfuncvol-sv03
> rhs-gp-srv6.lab.eng.blr.redhat.com
>      48 /gluster/brick1/nonfuncvol-sv01
>      48 /gluster/brick2/nonfuncvol-sv02
>      71 /gluster/brick3/nonfuncvol-sv04
> rhs-gp-srv7.lab.eng.blr.redhat.com
>       3 /gluster/brick1/nonfuncvol-sv01
>      68 /gluster/brick2/nonfuncvol-sv03
>      34 /gluster/brick3/nonfuncvol-sv04
> rhs-gp-srv8.lab.eng.blr.redhat.com
>      38 /gluster/brick1/nonfuncvol-sv02
>       2 /gluster/brick3/nonfuncvol-sv04
> 
> Number of times worker failed with Python traceback:
> 
> ```
> for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh
> root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R
> /var/log/glusterfs/geo-replication/ | sort | uniq -c; done
> ```
> 
> rhs-gp-srv5.lab.eng.blr.redhat.com
>       1
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType'
> object has no attribute 'split'
> rhs-gp-srv6.lab.eng.blr.redhat.com
>       3
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has
> no attribute 'split'
> rhs-gp-srv7.lab.eng.blr.redhat.com
>       3
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has
> no attribute 'split'
> rhs-gp-srv8.lab.eng.blr.redhat.com
> 
> Out of many restarts(400+) only 7 times restarted due to Python traceback.
> 
> Other reasons observed:
> 
> Workers failed with ENOTCONN
> 
> ```
> for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh
> root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN"
> /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done
> ```
> rhs-gp-srv5.lab.eng.blr.redhat.com
> 43
> rhs-gp-srv6.lab.eng.blr.redhat.com
> 54
> rhs-gp-srv7.lab.eng.blr.redhat.com
> 41
> rhs-gp-srv8.lab.eng.blr.redhat.com
> 24
> 
> Many errors found in master mount
> logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.
> redhat.com_nonfuncvol-slave/mnt-gluster-*)
> 
> [2019-07-18 01:41:24.349009] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for (null)
> [2019-07-18 01:41:24.369058] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0
> [2019-07-18 01:41:24.372697] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs
> [2019-07-18 01:41:24.372755] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.372805] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs:
> Directory selfheal failed: 1 subvolumes down.Not fixing. gfid =
> c44c6cef-6c76-4d3e-a56c-db21c71450be
> [2019-07-18 01:41:24.374539] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel
> [2019-07-18 01:41:24.384280] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.384340] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel:
> Directory selfheal failed: 1 subvolumes down.Not fixing. gfid =
> c9076528-fea7-459c-b29a-2364b1f43bdc
> [2019-07-18 01:41:24.386233] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com
> [2019-07-18 01:41:24.386270] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid =
> 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.386302] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht:
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1
> subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d
> [2019-07-18 01:41:24.386419] W [MSGID: 109011]
> [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash
> (value) = 1716939520
> [2019-07-18 01:41:24.387535] W [MSGID: 108027]
> [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no
> read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24
> [2019-07-18 01:41:24.388206] I [MSGID: 109063]
> [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid =
> 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
> [2019-07-18 01:41:24.388267] W [MSGID: 109005]
> [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht:
> /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal
> failed: 1 subvolumes down.Not fixing. gfid =
> d89f373a-08f4-42ab-b1ad-c60c92ee5181
> 
> 
> I will send a patch to fix the Python traceback But need help from DHT/AFR
> team to root cause ENOTCONN and Directory self heal failed errors.

DHT does not heal directories when there is a subvolume down. But here is the thing, if DHT is seeing a ENOTCONN error that means the entire replica set was disconnected. Either the bricks were really down(possibly crash) or there was a genuine network issue.

Susant

Comment 20 Nag Pavan Chilakam 2019-07-24 11:12:22 UTC

(In reply to Aravinda VK from comment #13)
> Hi Nag/Rochelle,
> 
> I failed to reproduce in my setup. Also looked at the setup to find any
> issue but couldn't find any. Please provide the steps if you have reproducer.
> 
> One suspect is that gconf.get() is called before gconf is loaded. Still
> looking for the possible window when this can happen.

The steps to reproduce have been mentioned in c#0 in detail
nothing additional to add.

Note You need to log in before you can comment on or make changes to this bug.