Description of problem: ======================= had setup a georep session b/w a 4x3 master volume and 4x(4+2) ec volume. I see below issues in my test-bed 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF. Version-Release number of selected component (if applicable): ===================== 6.0.7 rsync-3.1.2-6.el7_6.1.x86_64 Steps to Reproduce: ==================== note: no brickmux enabled 1. created a 4x3 volume on 4 nodes , with below volume settings, which will act as master in georep Volume Name: nonfuncvol Type: Distributed-Replicate Volume ID: 4d44936f-312d-431a-905d-813e8ee63668 Status: Started Snapshot Count: 1 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick2: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick3: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv01 Brick4: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-sv02 Brick5: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick6: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv02 Brick7: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick8: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-sv03 Brick9: rhs-gp-srv5.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv03 Brick10: rhs-gp-srv6.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick11: rhs-gp-srv7.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Brick12: rhs-gp-srv8.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-sv04 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.barrier: disable cluster.shd-max-threads: 24 client.event-threads: 8 server.event-threads: 8 transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable 2. mounted the volume on 10 clients, started capturing resource info on clients 3. created another 3node cluster to be used as slave, with a 4x(4+2) ecvol as slave 4. started IOs on clients of master, just linux untar 50 times from all clients 5. setup georep from master->slave 6. started georep only after about 4hrs so that master has some data to propagate. 7. left the setup for weekend. Actual results: =================== seen below issues 1) the volume has two main directories, called IOs and logs, with IOs directory being the place where all the workloads related IOs are happpening. logs directory is hosting a dedicated file for each client which is collecting the resource output every few minutes in append mode. The problem is till now, ie after about 3 days, the logs directory hasn't even been created 2) the syncing has been very slow paced, even after 3 days, slave is yet to catch up. Master had about 1.1TB data while slave has just about 350gb of data 3) I have seen some tracebacks in gsync log as below /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714 [2019-07-13 12:26:53.408348] E [syncdutils(worker /gluster/brick1/nonfuncvol-sv01):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 12:26:53.490714] I [repce(agent /gluster/brick1/nonfuncvol-sv01):97:service_loop] RepceServer: terminating on reaching EOF. [2019-07-13 12:26:53.494467] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [2019-07-13 12:27:03.508502] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [root@rhs-gp-srv7 glusterfs]# #less geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log [2019-07-13 13:33:48.859147] I [master(worker /gluster/brick1/nonfuncvol-sv01):1682:crawl] _GMaster: processing xsync changelog path=/var/lib/misc/gluster/gsyncd/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gluster- brick1-nonfuncvol-sv01/xsync/XSYNC-CHANGELOG.1563020888 [2019-07-13 13:40:39.412694] E [syncdutils(worker /gluster/brick3/nonfuncvol-sv04):338:log_raise_exception] <top>: FA IL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap tf(*aargs) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1987, in syncjob po = self.sync_engine(pb, self.log_err) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1444, in rsync rconf.ssh_ctl_args + \ AttributeError: 'NoneType' object has no attribute 'split' [2019-07-13 13:40:39.484643] I [repce(agent /gluster/brick3/nonfuncvol-sv04):97:service_loop] RepceServer: terminating on reaching EOF.
proposing as blocker, as syncing is falling behind significantly, and also tracebacks seen. Can revisit based on RC from dev. ^C [root@rhs-gp-srv5 bricks]# date;gluster volume geo-replication nonfuncvol rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave status Mon Jul 15 15:50:57 IST 2019 MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv5.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv7.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv01 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv6.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick1/nonfuncvol-sv02 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv11.lab.eng.blr.redhat.com Passive N/A N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick2/nonfuncvol-sv03 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv16.lab.eng.blr.redhat.com Active Hybrid Crawl N/A rhs-gp-srv8.lab.eng.blr.redhat.com nonfuncvol /gluster/brick3/nonfuncvol-sv04 root rhs-gp-srv13.lab.eng.blr.redhat.com::nonfuncvol-slave rhs-gp-srv13.lab.eng.blr.redhat.com Active Hybrid Crawl N/A [root@rhs-gp-srv5 bricks]# slave volinfo Volume Name: gluster_shared_storage Type: Replicate Volume ID: 34c52663-f47b-42e5-a33c-abe5d16382a8 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: rhs-gp-srv16.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Brick3: rhs-gp-srv13.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick Options Reconfigured: transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable Volume Name: nonfuncvol-slave Type: Distributed-Disperse Volume ID: b5753c86-ea76-4e0e-8306-acc1d5237ced Status: Started Snapshot Count: 0 Number of Bricks: 4 x (4 + 2) = 24 Transport-type: tcp Bricks: Brick1: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick3: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick1/nonfuncvol-slave-sv1 Brick4: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick5: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick6: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick2/nonfuncvol-slave-sv1 Brick7: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick8: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick9: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick3/nonfuncvol-slave-sv2 Brick10: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick11: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick12: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick4/nonfuncvol-slave-sv2 Brick13: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick14: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick15: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick5/nonfuncvol-slave-sv3 Brick16: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick17: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick18: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick6/nonfuncvol-slave-sv3 Brick19: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick20: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick21: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick7/nonfuncvol-slave-sv4 Brick22: rhs-gp-srv13.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Brick23: rhs-gp-srv11.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Brick24: rhs-gp-srv16.lab.eng.blr.redhat.com:/gluster/brick8/nonfuncvol-slave-sv4 Options Reconfigured: features.read-only: on performance.quick-read: off transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on cluster.enable-shared-storage: enable
Getting a Forbidden error while accessing sosreport Forbidden You don't have permission to access /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng.blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15-xiqycze.tar.xz on this server.
(In reply to Aravinda VK from comment #9) > Getting a Forbidden error while accessing sosreport > > Forbidden > > You don't have permission to access > /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng. > blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15- > xiqycze.tar.xz on this server. you can retry now
(In reply to nchilaka from comment #10) > (In reply to Aravinda VK from comment #9) > > Getting a Forbidden error while accessing sosreport > > > > Forbidden > > > > You don't have permission to access > > /sosreports/nchilaka/bug.1729915/servers/master-nodes/rhs-gp-srv6.lab.eng. > > blr.redhat.com/sosreport-rhs-gp-srv6-georep-issues-15jul2019-2019-07-15- > > xiqycze.tar.xz on this server. > > you can retry now I can access now. Thanks
Hi Nag/Rochelle, I failed to reproduce in my setup. Also looked at the setup to find any issue but couldn't find any. Please provide the steps if you have reproducer. One suspect is that gconf.get() is called before gconf is loaded. Still looking for the possible window when this can happen.
Another possible race condition between gconf.getr and gconf.get gconf.getr gets the realtime config value, that means it checks the config file on disk and gets the value. While gconf.getr loads the latest config, if other thread calls `gconf.get`, then it may get `None` as value. ``` getr: values = {} load_again() ``` If this is the issue, the fix should be to add thread lock for the `gconf._load` function.
Example program to reproduce the issue: Three threads, one thread calls `gconf.getr` config and two threads call `gconf.get`. Create this file in `/usr/libexec/glusterfs/python/syncdaemon/` directory ``` from threading import Thread import gsyncdconfig as gconf import time def worker(): while True: o = gconf.get("rsync-ssh-options") print("Worker 1 rsync-ssh-options=%s" % o) o.split() time.sleep(0.25) def worker2(): while True: o = gconf.getr("log-rsync-performance") print("Worker 2 log-rsync-performance=%s" % o) time.sleep(0.5) gconf.load("/etc/glusterfs/gsyncd.conf", "sample.conf") t1 = Thread(target=worker) t2 = Thread(target=worker) t3 = Thread(target=worker2) t1.start() t2.start() t3.start() t1.join() t2.join() t3.join() ``` Run the script using `python3 crash_reproduce.py` and in another terminal, run the following ``` for i in {1..100}; do echo "[vars] log-rsync-performance=True rsync-ssh-options=" > /usr/libexec/glusterfs/python/syncdaemon/sample.conf; sleep 1; done ``` In between, it produces the following traceback, ``` Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib64/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/usr/lib64/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "crash_reproduce.py", line 10, in worker o.split() AttributeError: 'NoneType' object has no attribute 'split' ``` I am working on adding thread lock to fix the issue. I will send upstream patch soon.
Number of worker restarts: ``` for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..." /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log | awk '{print $5}' | sed 's/):72:__init__]//' | sort | uniq -c; done ``` rhs-gp-srv5.lab.eng.blr.redhat.com 84 /gluster/brick1/nonfuncvol-sv01 61 /gluster/brick3/nonfuncvol-sv03 rhs-gp-srv6.lab.eng.blr.redhat.com 48 /gluster/brick1/nonfuncvol-sv01 48 /gluster/brick2/nonfuncvol-sv02 71 /gluster/brick3/nonfuncvol-sv04 rhs-gp-srv7.lab.eng.blr.redhat.com 3 /gluster/brick1/nonfuncvol-sv01 68 /gluster/brick2/nonfuncvol-sv03 34 /gluster/brick3/nonfuncvol-sv04 rhs-gp-srv8.lab.eng.blr.redhat.com 38 /gluster/brick1/nonfuncvol-sv02 2 /gluster/brick3/nonfuncvol-sv04 Number of times worker failed with Python traceback: ``` for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R /var/log/glusterfs/geo-replication/ | sort | uniq -c; done ``` rhs-gp-srv5.lab.eng.blr.redhat.com 1 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType' object has no attribute 'split' rhs-gp-srv6.lab.eng.blr.redhat.com 3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split' rhs-gp-srv7.lab.eng.blr.redhat.com 3 /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has no attribute 'split' rhs-gp-srv8.lab.eng.blr.redhat.com Out of many restarts(400+) only 7 times restarted due to Python traceback. Other reasons observed: Workers failed with ENOTCONN ``` for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN" /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done ``` rhs-gp-srv5.lab.eng.blr.redhat.com 43 rhs-gp-srv6.lab.eng.blr.redhat.com 54 rhs-gp-srv7.lab.eng.blr.redhat.com 41 rhs-gp-srv8.lab.eng.blr.redhat.com 24 Many errors found in master mount logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr.redhat.com_nonfuncvol-slave/mnt-gluster-*) [2019-07-18 01:41:24.349009] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for (null) [2019-07-18 01:41:24.369058] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0 [2019-07-18 01:41:24.372697] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs [2019-07-18 01:41:24.372755] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2019-07-18 01:41:24.372805] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c44c6cef-6c76-4d3e-a56c-db21c71450be [2019-07-18 01:41:24.374539] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel [2019-07-18 01:41:24.384280] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2019-07-18 01:41:24.384340] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = c9076528-fea7-459c-b29a-2364b1f43bdc [2019-07-18 01:41:24.386233] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com [2019-07-18 01:41:24.386270] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2019-07-18 01:41:24.386302] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d [2019-07-18 01:41:24.386419] W [MSGID: 109011] [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash (value) = 1716939520 [2019-07-18 01:41:24.387535] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 [2019-07-18 01:41:24.388206] I [MSGID: 109063] [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2019-07-18 01:41:24.388267] W [MSGID: 109005] [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = d89f373a-08f4-42ab-b1ad-c60c92ee5181 I will send a patch to fix the Python traceback But need help from DHT/AFR team to root cause ENOTCONN and Directory self heal failed errors.
Susant, Karthik - please have a look at comment 17.
(In reply to Aravinda VK from comment #17) > Number of worker restarts: > > ``` > for i in {5..8}; > do > echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; > ssh root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "Agent listining..." > /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/gsyncd.log | awk '{print $5}' | sed > 's/):72:__init__]//' | sort | uniq -c; > done > ``` > > rhs-gp-srv5.lab.eng.blr.redhat.com > 84 /gluster/brick1/nonfuncvol-sv01 > 61 /gluster/brick3/nonfuncvol-sv03 > rhs-gp-srv6.lab.eng.blr.redhat.com > 48 /gluster/brick1/nonfuncvol-sv01 > 48 /gluster/brick2/nonfuncvol-sv02 > 71 /gluster/brick3/nonfuncvol-sv04 > rhs-gp-srv7.lab.eng.blr.redhat.com > 3 /gluster/brick1/nonfuncvol-sv01 > 68 /gluster/brick2/nonfuncvol-sv03 > 34 /gluster/brick3/nonfuncvol-sv04 > rhs-gp-srv8.lab.eng.blr.redhat.com > 38 /gluster/brick1/nonfuncvol-sv02 > 2 /gluster/brick3/nonfuncvol-sv04 > > Number of times worker failed with Python traceback: > > ``` > for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh > root@rhs-gp-srv${i}.lab.eng.blr.redhat.com grep "split" -R > /var/log/glusterfs/geo-replication/ | sort | uniq -c; done > ``` > > rhs-gp-srv5.lab.eng.blr.redhat.com > 1 > /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/gsyncd.log-20190714:AttributeError: 'NoneType' > object has no attribute 'split' > rhs-gp-srv6.lab.eng.blr.redhat.com > 3 > /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has > no attribute 'split' > rhs-gp-srv7.lab.eng.blr.redhat.com > 3 > /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/gsyncd.log:AttributeError: 'NoneType' object has > no attribute 'split' > rhs-gp-srv8.lab.eng.blr.redhat.com > > Out of many restarts(400+) only 7 times restarted due to Python traceback. > > Other reasons observed: > > Workers failed with ENOTCONN > > ``` > for i in {5..8}; do echo rhs-gp-srv${i}.lab.eng.blr.redhat.com; ssh > root@rhs-gp-srv${i}.lab.eng.blr.redhat.com 'grep -R "error=ENOTCONN" > /var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/gsyncd* | wc -l'; done > ``` > rhs-gp-srv5.lab.eng.blr.redhat.com > 43 > rhs-gp-srv6.lab.eng.blr.redhat.com > 54 > rhs-gp-srv7.lab.eng.blr.redhat.com > 41 > rhs-gp-srv8.lab.eng.blr.redhat.com > 24 > > Many errors found in master mount > logs(/var/log/glusterfs/geo-replication/nonfuncvol_rhs-gp-srv13.lab.eng.blr. > redhat.com_nonfuncvol-slave/mnt-gluster-*) > > [2019-07-18 01:41:24.349009] W [MSGID: 108027] > [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no > read subvols for (null) > [2019-07-18 01:41:24.369058] I [MSGID: 109063] > [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in > (null) (gfid = ec3d7d53-51ae-4cdf-bbef-e3de147b4e75). Holes=1 overlaps=0 > [2019-07-18 01:41:24.372697] W [MSGID: 108027] > [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no > read subvols for /IOs > [2019-07-18 01:41:24.372755] I [MSGID: 109063] > [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in > /IOs (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2019-07-18 01:41:24.372805] W [MSGID: 109005] > [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs: > Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = > c44c6cef-6c76-4d3e-a56c-db21c71450be > [2019-07-18 01:41:24.374539] W [MSGID: 108027] > [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no > read subvols for /IOs/kernel > [2019-07-18 01:41:24.384280] I [MSGID: 109063] > [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in > /IOs/kernel (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2019-07-18 01:41:24.384340] W [MSGID: 109005] > [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: /IOs/kernel: > Directory selfheal failed: 1 subvolumes down.Not fixing. gfid = > c9076528-fea7-459c-b29a-2364b1f43bdc > [2019-07-18 01:41:24.386233] W [MSGID: 108027] > [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no > read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com > [2019-07-18 01:41:24.386270] I [MSGID: 109063] > [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in > /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com (gfid = > 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2019-07-18 01:41:24.386302] W [MSGID: 109005] > [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: > /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com: Directory selfheal failed: 1 > subvolumes down.Not fixing. gfid = be25fe8e-501a-485b-9b39-7caf2255ff6d > [2019-07-18 01:41:24.386419] W [MSGID: 109011] > [dht-layout.c:152:dht_layout_search] 0-nonfuncvol-dht: no subvolume for hash > (value) = 1716939520 > [2019-07-18 01:41:24.387535] W [MSGID: 108027] > [afr-common.c:2268:afr_attempt_readsubvol_set] 0-nonfuncvol-replicate-2: no > read subvols for /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 > [2019-07-18 01:41:24.388206] I [MSGID: 109063] > [dht-layout.c:650:dht_layout_normalize] 0-nonfuncvol-dht: Found anomalies in > /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24 (gfid = > 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2019-07-18 01:41:24.388267] W [MSGID: 109005] > [dht-selfheal.c:2143:dht_selfheal_directory] 0-nonfuncvol-dht: > /IOs/kernel/dhcp42-223.lab.eng.blr.redhat.com/dir.24: Directory selfheal > failed: 1 subvolumes down.Not fixing. gfid = > d89f373a-08f4-42ab-b1ad-c60c92ee5181 > > > I will send a patch to fix the Python traceback But need help from DHT/AFR > team to root cause ENOTCONN and Directory self heal failed errors. DHT does not heal directories when there is a subvolume down. But here is the thing, if DHT is seeing a ENOTCONN error that means the entire replica set was disconnected. Either the bricks were really down(possibly crash) or there was a genuine network issue. Susant
(In reply to Aravinda VK from comment #13) > Hi Nag/Rochelle, > > I failed to reproduce in my setup. Also looked at the setup to find any > issue but couldn't find any. Please provide the steps if you have reproducer. > > One suspect is that gconf.get() is called before gconf is loaded. Still > looking for the possible window when this can happen. The steps to reproduce have been mentioned in c#0 in detail nothing additional to add.