1286587 – [geo-rep]: Attaching tier breaks the existing geo-rep session

Bug 1286587 - [geo-rep]: Attaching tier breaks the existing geo-rep session

Summary: [geo-rep]: Attaching tier breaks the existing geo-rep session

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Saravanakumar
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1268895 1299184
TreeView+	depends on / blocked

Reported:	2015-11-30 09:46 UTC by Rahul Hinduja
Modified:	2016-06-13 07:41 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When geo-replication is in use alongside tiering, bricks attached as part of a tier are incorrectly set to passive. If geo-replication is subsequently restarted, these bricks can become faulty. Workaround: Stop geo-replication session prior to attaching or detaching bricks that are part of a tier. To attach a tier: 1. Stop geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol stop 2. Attach the tier: # gluster volume attach-tier master_vol replica 2 <server1>:/path/to/brick1 <server2>:/path/to/brick2 [force] 3. Restart geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol start 4. Verify that bricks in tier are available in geo-replication session: # gluster volume geo-replication master_vol slave_host::slave_vol status To detach a tier: 1. Detach the tier: # gluster volume detach-tier master_vol start 2. Ensure all data in that tier is synced to the slave: # gluster volume geo-replication master_vol slave_host::slave_vol config checkpoint now 3. Monitor checkpoint until displayed status is 'checkpoint as of <time of checkpoint creation> is completed at time'. # gluster volume geo-replication master_vol slave_host::slave_vol status detail 4. Verify that detachment is complete: # gluster volume detach-tier master_vol status 5. Stop geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol stop 6. Commit tier detachment: # gluster volume detach-tier master_vol commit 7. Verify tier is detached: # gluster volume info master_vol 8. Restart geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol start
Clone Of:
Environment:
Last Closed:	2016-05-10 03:55:29 UTC
Embargoed:
Dependent Products:
Flags:	sarumuga: needinfo+

Attachments	(Terms of Use)

Description Rahul Hinduja 2015-11-30 09:46:31 UTC

Description of problem:
=======================

On a geo-rep setup, if a bricks are added from the existing node then the subvolume is added at the end of volume --xml and geo-rep correctly picks and makes them ACTIVE and PASSIVE accordingly.

But if the bricks are attached as part of tier then the bricks are added at top of volume --xml which reorders the existing subvolume and makes the newly added bricks only PASSIVE.

Restarting geo-rep causes the changelog exception.


Existing geo-rep setup:
=======================

 [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK         SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1    root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    N/A                  
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7    root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    N/A                  
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4    root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                  
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6    root          10.70.37.99::slave    10.70.37.88     Active     Changelog Crawl    N/A                  
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3    root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                  
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5    root          10.70.37.99::slave    10.70.37.112    Passive    N/A                N/A                  
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                  
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                  
[root@dhcp37-165 ~]# 


After attaching a tier (2x2):
=============================

[root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1     root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7     root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b4    root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6     root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b1    root          10.70.37.99::slave    10.70.37.87     Active     Changelog Crawl    2015-11-30 13:49:37          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3     root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b3    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5     root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b2    root          10.70.37.99::slave    10.70.37.162    Passive    N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A            



After stopping and starting geo-rep session:
============================================

[root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1     root          10.70.37.99::slave    10.70.37.112    Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7     root          10.70.37.99::slave    10.70.37.112    Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b3    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5     root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2     root          10.70.37.99::slave    10.70.37.88     Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8     root          10.70.37.99::slave    10.70.37.88     Passive    N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b2    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4     root          10.70.37.99::slave    10.70.37.199    Active     Hybrid Crawl       N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b4    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6     root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b1    root          10.70.37.99::slave    10.70.37.87     Active     History Crawl      2015-11-30 13:49:37          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3     root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
[root@dhcp37-165 ~]#


Changelog Exception as:
=======================

[2015-11-30 13:52:25.885133] I [resource(/rhs/brick3/hot-b3):1432:service_loop] GLUSTER: Register time: 1448871745
[2015-11-30 13:52:25.923217] I [master(/rhs/brick3/hot-b3):532:crawlwrap] _GMaster: primary master with volume id 282f2070-1821-411a-9c99-a4d34fe7e1f8 ...
[2015-11-30 13:52:25.992933] I [master(/rhs/brick3/hot-b3):541:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-11-30 13:52:26.1095] I [master(/rhs/brick3/hot-b3):488:mgmt_lock] _GMaster: Got lock : /rhs/brick3/hot-b3 : Becoming ACTIVE
[2015-11-30 13:52:26.121913] I [master(/rhs/brick3/hot-b3):1155:crawl] _GMaster: starting history crawl... turns: 1, stime: (1448868535, 0), etime: 1448871746
[2015-11-30 13:52:26.123167] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-11-30 13:52:26.134814] E [repce(/rhs/brick3/hot-b3):207:__call__] RepceClient: call 19467:140712638633792:1448871746.12 (history) failed on peer with ChangelogException
[2015-11-30 13:52:26.135267] E [resource(/rhs/brick3/hot-b3):1452:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory
[2015-11-30 13:52:26.135981] I [syncdutils(/rhs/brick3/hot-b3):220:finalize] <top>: exiting.
[2015-11-30 13:52:26.142213] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-11-30 13:52:26.142409] I [syncdutils(agent):220:finalize] <top>: exiting.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.5-7.el7rhgs.x86_64


How reproducible:
=================

1/1

Steps to Reproduce:
===================
1. Establish geo-rep session between master and slave volume
2. Attach a tier in the existing master volume
3. Look in for geo-rep status for newly attached bricks.
4. Restart the geo-rep session.

Actual results:
===============

At step 3 all the newly attached bricks becomes PASSIVE only
At step 4 some of the newly attached bricks becomes FAULTY 

Expected results:
=================

All the newly attached bricks should correctly get the lock and become ACTIVE and PASSIVE

Comment 5 Aravinda VK 2015-11-30 16:17:37 UTC

I think documentation changes required for Attach Tier. Since Geo-rep worker behaves differently if it is worker for Cold brick or worker for Hot brick. If Geo-rep is not restarted after attaching Tier, already started workers will not know hot/cold brick unless it is restarted. 

Stop Geo-replication before attaching Tier.

Comment 13 Saravanakumar 2016-01-25 07:31:05 UTC

Changes looks fine.

Comment 16 Aravinda VK 2016-05-03 06:48:22 UTC

As per comment 10, We need to Stop Geo-replication before attaching Tier, documentation available in 
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696

Moving this bug to ON_QA

Comment 19 Aravinda VK 2016-05-10 03:55:29 UTC

When we attach Tier, Gluster rearranges Bricks details in Volume info to show Hot Tier bricks first, due to this Geo-replication will not work as expected when Tier is attached while Geo-rep is running. We need to stop Geo-replication before attach-tier. (Same is available in documentation https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696).

Please open new RFE to support attach tier while Geo-rep is running.

Closing this bug as "NOTABUG" as discussed, please reopen if this requires fix. Thanks.

Note You need to log in before you can comment on or make changes to this bug.