Bug 1668118

Summary: Failure to start geo-replication for tiered volume.
Product: [Community] GlusterFS Reporter: vnosov <vnosov>
Component: geo-replicationAssignee: bugs <bugs>
Status: CLOSED CANTFIX QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 5CC: atumball, bugs, vnosov
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-27 16:20:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description vnosov 2019-01-21 23:21:09 UTC
Description of problem: Status of geo-replication workers on master nodes is "inconsistent" if master volume is tiered. 


Version-Release number of selected component (if applicable):

GlusterFS 5.2 installation from source code TAR file


How reproducible:  100%


Steps to Reproduce:

1. Set up two nodes. One will host geo-replication master volume. Master volume has to be tiered. Other node will host geo-replication slave volume.

[root@SC-10-10-63-182 log]# glusterfsd --version
glusterfs 5.2

[root@SC-10-10-63-183 log]# glusterfsd --version
glusterfs 5.2

 
2. On master node create tiered volume:

[root@SC-10-10-63-182 log]# gluster volume info master-volume-1

Volume Name: master-volume-1
Type: Tier
Volume ID: aa95df34-f181-456c-aa26-9756b68ed679
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distribute
Number of Bricks: 1
Brick1: 10.10.60.182:/exports/master-hot-tier/master-volume-1
Cold Tier:
Cold Tier Type : Distribute
Number of Bricks: 1
Brick2: 10.10.60.182:/exports/master-segment-1/master-volume-1
Options Reconfigured:
features.ctr-sql-db-wal-autocheckpoint: 25000
features.ctr-sql-db-cachesize: 12500
cluster.tier-mode: cache
features.ctr-enabled: on
server.allow-insecure: on
performance.quick-read: off
performance.stat-prefetch: off
nfs.addr-namelookup: off
transport.address-family: inet
nfs.disable: on
cluster.enable-shared-storage: disable
snap-activate-on-create: enable

[root@SC-10-10-63-182 log]# gluster volume status master-volume-1
Status of volume: master-volume-1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.10.60.182:/exports/master-hot-tier
/master-volume-1                            62001     0          Y       15690
Cold Bricks:
Brick 10.10.60.182:/exports/master-segment-
1/master-volume-1                           62000     0          Y       9762
Tier Daemon on localhost                    N/A       N/A        Y       15713

Task Status of Volume master-volume-1
------------------------------------------------------------------------------
There are no active volume tasks

[root@SC-10-10-63-182 log]# gluster volume tier master-volume-1 status
Node                 Promoted files       Demoted files        Status               run time in h:m:s
---------            ---------            ---------            ---------            ---------
localhost            0                    0                    in progress          0:3:40
Tiering Migration Functionality: master-volume-1: success



3. On slave node create slave volume:

[root@SC-10-10-63-183 log]# gluster volume info slave-volume-1

Volume Name: slave-volume-1
Type: Distribute
Volume ID: 569a340b-35f8-4109-8816-720982b11806
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.10.60.183:/exports/slave-segment-1/slave-volume-1
Options Reconfigured:
server.allow-insecure: on
performance.quick-read: off
performance.stat-prefetch: off
nfs.addr-namelookup: off
transport.address-family: inet
nfs.disable: on
cluster.enable-shared-storage: disable
snap-activate-on-create: enable

[root@SC-10-10-63-183 log]# gluster volume status slave-volume-1
Status of volume: slave-volume-1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.10.60.183:/exports/slave-segment-1
/slave-volume-1                             62000     0          Y       2532

Task Status of Volume slave-volume-1
------------------------------------------------------------------------------
There are no active volume tasks

4. Set up SSH access to slave node:

SSH from 182 to 183:

20660 01/21/2019 13:58:54.930122501 1548107934 command: /usr/bin/ssh nasgorep.60.183 /bin/pwd
20660 01/21/2019 13:58:55.021906148 1548107935 status=0 /usr/bin/ssh nasgorep.60.183 /bin/pwd
20694 01/21/2019 13:58:56.169890800 1548107936 command: /usr/bin/ssh -q -oConnectTimeout=5 nasgorep.60.183 /bin/pwd 2>&1
20694 01/21/2019 13:58:56.256032202 1548107936 status=0 /usr/bin/ssh -q -oConnectTimeout=5 nasgorep.60.183 /bin/pwd 2>&1


5. Initialize geo-replication from master volume to slave volume:

[root@SC-10-10-63-182 log]# vi /var/log/glusterfs/cmd_history.log

[2019-01-21 21:59:08.942567]  : system:: execute gsec_create : SUCCESS
[2019-01-21 21:59:42.722194]  : volume geo-replication master-volume-1 nasgorep.60.183::slave-volume-1 create push-pem : SUCCESS
[2019-01-21 21:59:49.527353]  : volume geo-replication master-volume-1 nasgorep.60.183::slave-volume-1 start : SUCCESS
[2019-01-21 21:59:55.636198]  : volume geo-replication master-volume-1 nasgorep.60.183::slave-volume-1 status detail : SUCCESS

6. Check status of the geo-replication:

Actual results:

[root@SC-10-10-63-183 log]# /usr/sbin/gluster-mountbroker status
+-----------+-------------+---------------------------+--------------+---------------------------+
|    NODE   | NODE STATUS |         MOUNT ROOT        |    GROUP     |           USERS           |
+-----------+-------------+---------------------------+--------------+---------------------------+
| localhost |          UP | /var/mountbroker-root(OK) | nasgorep(OK) | nasgorep(slave-volume-1)  |
+-----------+-------------+---------------------------+--------------+---------------------------+

[root@SC-10-10-63-182 log]# gluster volume geo-replication master-volume-1 nasgorep.60.183::slave-volume-1 status

MASTER NODE     MASTER VOL         MASTER BRICK                                 SLAVE USER    SLAVE                                    SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.10.60.182    master-volume-1    /exports/master-hot-tier/master-volume-1     nasgorep      nasgorep.60.183::slave-volume-1    N/A           Stopped    N/A             N/A
10.10.60.182    master-volume-1    /exports/master-segment-1/master-volume-1    nasgorep      nasgorep.60.183::slave-volume-1    N/A           Stopped    N/A             N/A


Expected results:

Status of the geo-replication workers on master node has to be "Active".


Additional info:

Contents of file /var/log/glusterfs/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.log on master node has explanation what is wrong:

[root@SC-10-10-63-182 log]# vi /var/log/glusterfs/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.log

[2019-01-21 21:59:39.347943] W [gsyncd(config-get):304:main] <top>: Session config file not exists, using the default config    path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:42.438145] I [gsyncd(monitor-status):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:42.454929] I [subcmds(monitor-status):29:subcmd_monitor_status] <top>: Monitor Status Change  status=Created
[2019-01-21 21:59:48.756702] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.4720] I [gsyncd(config-get):308:main] <top>: Using session config file path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.239733] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.475193] I [gsyncd(monitor):308:main] <top>: Using session config file  path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:49.868150] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...
[2019-01-21 21:59:49.868396] I [monitor(monitor):157:monitor] Monitor: starting gsyncd worker   slave_node=10.10.60.183 brick=/exports/master-segment-1/master-volume-1
[2019-01-21 21:59:49.871593] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing...
[2019-01-21 21:59:49.871963] I [monitor(monitor):157:monitor] Monitor: starting gsyncd worker   slave_node=10.10.60.183 brick=/exports/master-hot-tier/master-volume-1
[2019-01-21 21:59:50.4395] I [monitor(monitor):268:monitor] Monitor: worker died before establishing connection brick=/exports/master-segment-1/master-volume-1
[2019-01-21 21:59:50.7447] I [monitor(monitor):268:monitor] Monitor: worker died before establishing connection brick=/exports/master-hot-tier/master-volume-1
[2019-01-21 21:59:50.8415] I [gsyncd(agent /exports/master-segment-1/master-volume-1):308:main] <top>: Using session config file    path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:50.10383] I [gsyncd(agent /exports/master-hot-tier/master-volume-1):308:main] <top>: Using session config file    path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:50.14039] I [repce(agent /exports/master-segment-1/master-volume-1):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-01-21 21:59:50.15556] I [changelogagent(agent /exports/master-hot-tier/master-volume-1):72:__init__] ChangelogAgent: Agent listining...
[2019-01-21 21:59:50.15964] I [repce(agent /exports/master-hot-tier/master-volume-1):97:service_loop] RepceServer: terminating on reaching EOF.
[2019-01-21 21:59:55.141768] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:55.380496] I [gsyncd(status):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 21:59:55.625045] I [gsyncd(status):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/master-volume-1_10.10.60.183_slave-volume-1/gsyncd.conf
[2019-01-21 22:00:00.66032] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change  status=inconsistent
[2019-01-21 22:00:00.66289] E [syncdutils(monitor):338:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 368, in twrap
    tf(*aargs)
  File "/usr/libexec/glusterfs/python/syncdaemon/monitor.py", line 339, in wmon
    slave_host, master, suuid, slavenodes)
TypeError: 'int' object is not iterable


Similar test on GlusterFS 3.12.14 does not show the same failure.

Comment 1 Amar Tumballi 2019-05-27 16:20:50 UTC
We have deprecated 'tier' feature of glusterfs. Hence not possible to fix it in future.