When geo-replication is in use alongside tiering, bricks attached as part of a tier are incorrectly set to passive. If geo-replication is subsequently restarted, these bricks can become faulty.
Workaround:
Stop geo-replication session prior to attaching or detaching bricks that are part of a tier.
To attach a tier:
1. Stop geo-replication:
# gluster volume geo-replication master_vol slave_host::slave_vol stop
2. Attach the tier:
# gluster volume attach-tier master_vol replica 2 <server1>:/path/to/brick1 <server2>:/path/to/brick2 [force]
3. Restart geo-replication:
# gluster volume geo-replication master_vol slave_host::slave_vol start
4. Verify that bricks in tier are available in geo-replication session:
# gluster volume geo-replication master_vol slave_host::slave_vol status
To detach a tier:
1. Detach the tier:
# gluster volume detach-tier master_vol start
2. Ensure all data in that tier is synced to the slave:
# gluster volume geo-replication master_vol slave_host::slave_vol config checkpoint now
3. Monitor checkpoint until displayed status is 'checkpoint as of <time of checkpoint creation> is completed at time'.
# gluster volume geo-replication master_vol slave_host::slave_vol status detail
4. Verify that detachment is complete:
# gluster volume detach-tier master_vol status
5. Stop geo-replication:
# gluster volume geo-replication master_vol slave_host::slave_vol stop
6. Commit tier detachment:
# gluster volume detach-tier master_vol commit
7. Verify tier is detached:
# gluster volume info master_vol
8. Restart geo-replication:
# gluster volume geo-replication master_vol slave_host::slave_vol start
I think documentation changes required for Attach Tier. Since Geo-rep worker behaves differently if it is worker for Cold brick or worker for Hot brick. If Geo-rep is not restarted after attaching Tier, already started workers will not know hot/cold brick unless it is restarted.
Stop Geo-replication before attaching Tier.
When we attach Tier, Gluster rearranges Bricks details in Volume info to show Hot Tier bricks first, due to this Geo-replication will not work as expected when Tier is attached while Geo-rep is running. We need to stop Geo-replication before attach-tier. (Same is available in documentation https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696).
Please open new RFE to support attach tier while Geo-rep is running.
Closing this bug as "NOTABUG" as discussed, please reopen if this requires fix. Thanks.
Description of problem: ======================= On a geo-rep setup, if a bricks are added from the existing node then the subvolume is added at the end of volume --xml and geo-rep correctly picks and makes them ACTIVE and PASSIVE accordingly. But if the bricks are attached as part of tier then the bricks are added at top of volume --xml which reorders the existing subvolume and makes the newly added bricks only PASSIVE. Restarting geo-rep causes the changelog exception. Existing geo-rep setup: ======================= [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl N/A dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.88 Active Changelog Crawl N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.112 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A [root@dhcp37-165 ~]# After attaching a tier (2x2): ============================= [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick3/hot-b4 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick3/hot-b1 root 10.70.37.99::slave 10.70.37.87 Active Changelog Crawl 2015-11-30 13:49:37 dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick3/hot-b3 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick3/hot-b2 root 10.70.37.99::slave 10.70.37.162 Passive N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A After stopping and starting geo-rep session: ============================================ [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.112 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.112 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick3/hot-b3 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.88 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.88 Passive N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick3/hot-b2 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.199 Active Hybrid Crawl N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick3/hot-b4 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick3/hot-b1 root 10.70.37.99::slave 10.70.37.87 Active History Crawl 2015-11-30 13:49:37 dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A [root@dhcp37-165 ~]# Changelog Exception as: ======================= [2015-11-30 13:52:25.885133] I [resource(/rhs/brick3/hot-b3):1432:service_loop] GLUSTER: Register time: 1448871745 [2015-11-30 13:52:25.923217] I [master(/rhs/brick3/hot-b3):532:crawlwrap] _GMaster: primary master with volume id 282f2070-1821-411a-9c99-a4d34fe7e1f8 ... [2015-11-30 13:52:25.992933] I [master(/rhs/brick3/hot-b3):541:crawlwrap] _GMaster: crawl interval: 1 seconds [2015-11-30 13:52:26.1095] I [master(/rhs/brick3/hot-b3):488:mgmt_lock] _GMaster: Got lock : /rhs/brick3/hot-b3 : Becoming ACTIVE [2015-11-30 13:52:26.121913] I [master(/rhs/brick3/hot-b3):1155:crawl] _GMaster: starting history crawl... turns: 1, stime: (1448868535, 0), etime: 1448871746 [2015-11-30 13:52:26.123167] E [repce(agent):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history num_parallel) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2015-11-30 13:52:26.134814] E [repce(/rhs/brick3/hot-b3):207:__call__] RepceClient: call 19467:140712638633792:1448871746.12 (history) failed on peer with ChangelogException [2015-11-30 13:52:26.135267] E [resource(/rhs/brick3/hot-b3):1452:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory [2015-11-30 13:52:26.135981] I [syncdutils(/rhs/brick3/hot-b3):220:finalize] <top>: exiting. [2015-11-30 13:52:26.142213] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-11-30 13:52:26.142409] I [syncdutils(agent):220:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-7.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Establish geo-rep session between master and slave volume 2. Attach a tier in the existing master volume 3. Look in for geo-rep status for newly attached bricks. 4. Restart the geo-rep session. Actual results: =============== At step 3 all the newly attached bricks becomes PASSIVE only At step 4 some of the newly attached bricks becomes FAULTY Expected results: ================= All the newly attached bricks should correctly get the lock and become ACTIVE and PASSIVE