Bug 1286587

Summary: [geo-rep]: Attaching tier breaks the existing geo-rep session
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Saravanakumar <sarumuga>
Status: CLOSED NOTABUG QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.1CC: asriram, asrivast, avishwan, chrisw, csaba, lbailey, nchilaka, nlevinki, rcyriac, sankarshan, sarumuga
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: sarumuga: needinfo+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When geo-replication is in use alongside tiering, bricks attached as part of a tier are incorrectly set to passive. If geo-replication is subsequently restarted, these bricks can become faulty. Workaround: Stop geo-replication session prior to attaching or detaching bricks that are part of a tier. To attach a tier: 1. Stop geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol stop 2. Attach the tier: # gluster volume attach-tier master_vol replica 2 <server1>:/path/to/brick1 <server2>:/path/to/brick2 [force] 3. Restart geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol start 4. Verify that bricks in tier are available in geo-replication session: # gluster volume geo-replication master_vol slave_host::slave_vol status To detach a tier: 1. Detach the tier: # gluster volume detach-tier master_vol start 2. Ensure all data in that tier is synced to the slave: # gluster volume geo-replication master_vol slave_host::slave_vol config checkpoint now 3. Monitor checkpoint until displayed status is 'checkpoint as of <time of checkpoint creation> is completed at time'. # gluster volume geo-replication master_vol slave_host::slave_vol status detail 4. Verify that detachment is complete: # gluster volume detach-tier master_vol status 5. Stop geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol stop 6. Commit tier detachment: # gluster volume detach-tier master_vol commit 7. Verify tier is detached: # gluster volume info master_vol 8. Restart geo-replication: # gluster volume geo-replication master_vol slave_host::slave_vol start
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-10 03:55:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1268895, 1299184    

Description Rahul Hinduja 2015-11-30 09:46:31 UTC
Description of problem:
=======================

On a geo-rep setup, if a bricks are added from the existing node then the subvolume is added at the end of volume --xml and geo-rep correctly picks and makes them ACTIVE and PASSIVE accordingly.

But if the bricks are attached as part of tier then the bricks are added at top of volume --xml which reorders the existing subvolume and makes the newly added bricks only PASSIVE.

Restarting geo-rep causes the changelog exception.


Existing geo-rep setup:
=======================

 [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK         SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1    root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    N/A                  
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7    root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    N/A                  
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4    root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                  
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6    root          10.70.37.99::slave    10.70.37.88     Active     Changelog Crawl    N/A                  
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3    root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                  
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5    root          10.70.37.99::slave    10.70.37.112    Passive    N/A                N/A                  
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                  
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                  
[root@dhcp37-165 ~]# 


After attaching a tier (2x2):
=============================

[root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1     root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7     root          10.70.37.99::slave    10.70.37.99     Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b4    root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6     root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b1    root          10.70.37.99::slave    10.70.37.87     Active     Changelog Crawl    2015-11-30 13:49:37          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3     root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b3    root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5     root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b2    root          10.70.37.99::slave    10.70.37.162    Passive    N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4     root          10.70.37.99::slave    10.70.37.199    Passive    N/A                N/A            



After stopping and starting geo-rep session:
============================================

[root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status
 
MASTER NODE                          MASTER VOL    MASTER BRICK          SLAVE USER    SLAVE                 SLAVE NODE      STATUS     CRAWL STATUS       LAST_SYNCED                  
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b1     root          10.70.37.99::slave    10.70.37.112    Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-165.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b7     root          10.70.37.99::slave    10.70.37.112    Active     Changelog Crawl    2015-11-30 13:49:28          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b3    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-110.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b5     root          10.70.37.99::slave    10.70.37.162    Active     Changelog Crawl    N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b2     root          10.70.37.99::slave    10.70.37.88     Passive    N/A                N/A                          
dhcp37-133.lab.eng.blr.redhat.com    master        /rhs/brick2/ct-b8     root          10.70.37.99::slave    10.70.37.88     Passive    N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b2    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-158.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b4     root          10.70.37.99::slave    10.70.37.199    Active     Hybrid Crawl       N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b4    root          10.70.37.99::slave    N/A             Faulty     N/A                N/A                          
dhcp37-155.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b6     root          10.70.37.99::slave    10.70.37.87     Passive    N/A                N/A                          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick3/hot-b1    root          10.70.37.99::slave    10.70.37.87     Active     History Crawl      2015-11-30 13:49:37          
dhcp37-160.lab.eng.blr.redhat.com    master        /rhs/brick1/ct-b3     root          10.70.37.99::slave    10.70.37.99     Passive    N/A                N/A                          
[root@dhcp37-165 ~]#


Changelog Exception as:
=======================

[2015-11-30 13:52:25.885133] I [resource(/rhs/brick3/hot-b3):1432:service_loop] GLUSTER: Register time: 1448871745
[2015-11-30 13:52:25.923217] I [master(/rhs/brick3/hot-b3):532:crawlwrap] _GMaster: primary master with volume id 282f2070-1821-411a-9c99-a4d34fe7e1f8 ...
[2015-11-30 13:52:25.992933] I [master(/rhs/brick3/hot-b3):541:crawlwrap] _GMaster: crawl interval: 1 seconds
[2015-11-30 13:52:26.1095] I [master(/rhs/brick3/hot-b3):488:mgmt_lock] _GMaster: Got lock : /rhs/brick3/hot-b3 : Becoming ACTIVE
[2015-11-30 13:52:26.121913] I [master(/rhs/brick3/hot-b3):1155:crawl] _GMaster: starting history crawl... turns: 1, stime: (1448868535, 0), etime: 1448871746
[2015-11-30 13:52:26.123167] E [repce(agent):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history
    num_parallel)
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog
    cls.raise_changelog_err()
  File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err
    raise ChangelogException(errn, os.strerror(errn))
ChangelogException: [Errno 2] No such file or directory
[2015-11-30 13:52:26.134814] E [repce(/rhs/brick3/hot-b3):207:__call__] RepceClient: call 19467:140712638633792:1448871746.12 (history) failed on peer with ChangelogException
[2015-11-30 13:52:26.135267] E [resource(/rhs/brick3/hot-b3):1452:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory
[2015-11-30 13:52:26.135981] I [syncdutils(/rhs/brick3/hot-b3):220:finalize] <top>: exiting.
[2015-11-30 13:52:26.142213] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-11-30 13:52:26.142409] I [syncdutils(agent):220:finalize] <top>: exiting.


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.5-7.el7rhgs.x86_64


How reproducible:
=================

1/1

Steps to Reproduce:
===================
1. Establish geo-rep session between master and slave volume
2. Attach a tier in the existing master volume
3. Look in for geo-rep status for newly attached bricks.
4. Restart the geo-rep session.

Actual results:
===============

At step 3 all the newly attached bricks becomes PASSIVE only
At step 4 some of the newly attached bricks becomes FAULTY 

Expected results:
=================

All the newly attached bricks should correctly get the lock and become ACTIVE and PASSIVE

Comment 5 Aravinda VK 2015-11-30 16:17:37 UTC
I think documentation changes required for Attach Tier. Since Geo-rep worker behaves differently if it is worker for Cold brick or worker for Hot brick. If Geo-rep is not restarted after attaching Tier, already started workers will not know hot/cold brick unless it is restarted. 

Stop Geo-replication before attaching Tier.

Comment 13 Saravanakumar 2016-01-25 07:31:05 UTC
Changes looks fine.

Comment 16 Aravinda VK 2016-05-03 06:48:22 UTC
As per comment 10, We need to Stop Geo-replication before attaching Tier, documentation available in 
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696

Moving this bug to ON_QA

Comment 19 Aravinda VK 2016-05-10 03:55:29 UTC
When we attach Tier, Gluster rearranges Bricks details in Volume info to show Hot Tier bricks first, due to this Geo-replication will not work as expected when Tier is attached while Geo-rep is running. We need to stop Geo-replication before attach-tier. (Same is available in documentation https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696).

Please open new RFE to support attach tier while Geo-rep is running.

Closing this bug as "NOTABUG" as discussed, please reopen if this requires fix. Thanks.