Description of problem: ======================= On a geo-rep setup, if a bricks are added from the existing node then the subvolume is added at the end of volume --xml and geo-rep correctly picks and makes them ACTIVE and PASSIVE accordingly. But if the bricks are attached as part of tier then the bricks are added at top of volume --xml which reorders the existing subvolume and makes the newly added bricks only PASSIVE. Restarting geo-rep causes the changelog exception. Existing geo-rep setup: ======================= [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl N/A dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.88 Active Changelog Crawl N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.112 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A [root@dhcp37-165 ~]# After attaching a tier (2x2): ============================= [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.99 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick3/hot-b4 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick3/hot-b1 root 10.70.37.99::slave 10.70.37.87 Active Changelog Crawl 2015-11-30 13:49:37 dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick3/hot-b3 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick3/hot-b2 root 10.70.37.99::slave 10.70.37.162 Passive N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.199 Passive N/A N/A After stopping and starting geo-rep session: ============================================ [root@dhcp37-165 ~]# gluster volume geo-replication master 10.70.37.99::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick1/ct-b1 root 10.70.37.99::slave 10.70.37.112 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-165.lab.eng.blr.redhat.com master /rhs/brick2/ct-b7 root 10.70.37.99::slave 10.70.37.112 Active Changelog Crawl 2015-11-30 13:49:28 dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick3/hot-b3 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-110.lab.eng.blr.redhat.com master /rhs/brick1/ct-b5 root 10.70.37.99::slave 10.70.37.162 Active Changelog Crawl N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick1/ct-b2 root 10.70.37.99::slave 10.70.37.88 Passive N/A N/A dhcp37-133.lab.eng.blr.redhat.com master /rhs/brick2/ct-b8 root 10.70.37.99::slave 10.70.37.88 Passive N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick3/hot-b2 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-158.lab.eng.blr.redhat.com master /rhs/brick1/ct-b4 root 10.70.37.99::slave 10.70.37.199 Active Hybrid Crawl N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick3/hot-b4 root 10.70.37.99::slave N/A Faulty N/A N/A dhcp37-155.lab.eng.blr.redhat.com master /rhs/brick1/ct-b6 root 10.70.37.99::slave 10.70.37.87 Passive N/A N/A dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick3/hot-b1 root 10.70.37.99::slave 10.70.37.87 Active History Crawl 2015-11-30 13:49:37 dhcp37-160.lab.eng.blr.redhat.com master /rhs/brick1/ct-b3 root 10.70.37.99::slave 10.70.37.99 Passive N/A N/A [root@dhcp37-165 ~]# Changelog Exception as: ======================= [2015-11-30 13:52:25.885133] I [resource(/rhs/brick3/hot-b3):1432:service_loop] GLUSTER: Register time: 1448871745 [2015-11-30 13:52:25.923217] I [master(/rhs/brick3/hot-b3):532:crawlwrap] _GMaster: primary master with volume id 282f2070-1821-411a-9c99-a4d34fe7e1f8 ... [2015-11-30 13:52:25.992933] I [master(/rhs/brick3/hot-b3):541:crawlwrap] _GMaster: crawl interval: 1 seconds [2015-11-30 13:52:26.1095] I [master(/rhs/brick3/hot-b3):488:mgmt_lock] _GMaster: Got lock : /rhs/brick3/hot-b3 : Becoming ACTIVE [2015-11-30 13:52:26.121913] I [master(/rhs/brick3/hot-b3):1155:crawl] _GMaster: starting history crawl... turns: 1, stime: (1448868535, 0), etime: 1448871746 [2015-11-30 13:52:26.123167] E [repce(agent):117:worker] <top>: call failed: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker res = getattr(self.obj, rmeth)(*in_data[2:]) File "/usr/libexec/glusterfs/python/syncdaemon/changelogagent.py", line 54, in history num_parallel) File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 100, in cl_history_changelog cls.raise_changelog_err() File "/usr/libexec/glusterfs/python/syncdaemon/libgfchangelog.py", line 27, in raise_changelog_err raise ChangelogException(errn, os.strerror(errn)) ChangelogException: [Errno 2] No such file or directory [2015-11-30 13:52:26.134814] E [repce(/rhs/brick3/hot-b3):207:__call__] RepceClient: call 19467:140712638633792:1448871746.12 (history) failed on peer with ChangelogException [2015-11-30 13:52:26.135267] E [resource(/rhs/brick3/hot-b3):1452:service_loop] GLUSTER: Changelog History Crawl failed, [Errno 2] No such file or directory [2015-11-30 13:52:26.135981] I [syncdutils(/rhs/brick3/hot-b3):220:finalize] <top>: exiting. [2015-11-30 13:52:26.142213] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-11-30 13:52:26.142409] I [syncdutils(agent):220:finalize] <top>: exiting. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-7.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Establish geo-rep session between master and slave volume 2. Attach a tier in the existing master volume 3. Look in for geo-rep status for newly attached bricks. 4. Restart the geo-rep session. Actual results: =============== At step 3 all the newly attached bricks becomes PASSIVE only At step 4 some of the newly attached bricks becomes FAULTY Expected results: ================= All the newly attached bricks should correctly get the lock and become ACTIVE and PASSIVE
I think documentation changes required for Attach Tier. Since Geo-rep worker behaves differently if it is worker for Cold brick or worker for Hot brick. If Geo-rep is not restarted after attaching Tier, already started workers will not know hot/cold brick unless it is restarted. Stop Geo-replication before attaching Tier.
Changes looks fine.
As per comment 10, We need to Stop Geo-replication before attaching Tier, documentation available in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696 Moving this bug to ON_QA
When we attach Tier, Gluster rearranges Bricks details in Volume info to show Hot Tier bricks first, due to this Geo-replication will not work as expected when Tier is attached while Geo-rep is running. We need to stop Geo-replication before attach-tier. (Same is available in documentation https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Managing_Data_Tiering-Attach_Volumes.html#idp8297696). Please open new RFE to support attach tier while Geo-rep is running. Closing this bug as "NOTABUG" as discussed, please reopen if this requires fix. Thanks.