Description of problem: After remove-brick commit from a machine having multiple bricks, the geo-rep in that machine starts using xsync for the other running brick in that machine, because it fails to get the connection to the removed brick, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-07-24 18:45:40.343144] I [master(/bricks/s3):461:volinfo_query] _GMaster: new master is 00261f58-a7d5-4c9f-a6a7 -af41b70d92a3 [2013-07-24 18:45:40.343264] I [master(/bricks/s3):465:volinfo_query] _GMaster: primary master with volume id 00261f5 8-a7d5-4c9f-a6a7-af41b70d92a3 ... [2013-07-24 18:45:50.191620] I [master(/bricks/s6):780:fallback_xsync] _GMaster: falling back to xsync mode [2013-07-24 18:45:50.194266] I [syncdutils(/bricks/s6):158:finalize] <top>: exiting. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you check, it is crawling on two bricks, one is /bricks/s3 and /brick/s6 , /brick/s6 was removed, it failed on that and said falling back to xsync. Version-Release number of selected component (if applicable):3.4.0.12rhs.beta6-1.el6rhs.x86_64 How reproducible: Didn't try to reproduce again. Steps to Reproduce: 1.Create and start geo-rep relationship between master(vol config as given in additional info) and slave. 2.Create some data on the master and let it sync. 3.remove-brick (to have master volume as in the additional info in the second config) 4.Check the geo-rep log-file from the machine where the bricks were removed. Actual results:geo-rep change_detector falls back to xsync Expected results:It shouldn't falls back to xsync when everything is working fine. Additional info: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> volume info before remove-brick >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Volume Name: mastervol Type: Distributed-Replicate Volume ID: 00261f58-a7d5-4c9f-a6a7-af41b70d92a3 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: machine1:/bricks/s1 Brick2: machine2:/bricks/s2 Brick3: machine3:/bricks/s3 Brick4: machine4:/bricks/s4 Brick5: machine2:/bricks/s5 Brick6: machine3:/bricks/s6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> volume info after the remove-brick >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Volume Name: mastervol Type: Distributed-Replicate Volume ID: 00261f58-a7d5-4c9f-a6a7-af41b70d92a3 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: machine1:/bricks/s1 Brick2: machine2:/bricks/s2 Brick3: machine3:/bricks/s3 Brick4: machine4:/bricks/s4 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
This happens still in the build glusterfs-3.4.0.44rhs-1
and also all the passive gsyncd crashed with the trace-back, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-20 12:35:35.899013] I [master(/bricks/brick6):426:crawlwrap] _GMaster: crawl interval: 60 seconds [2013-11-20 12:35:35.905201] E [syncdutils(/bricks/brick6):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1156, in service_loop g1.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 457, in crawlwrap self.slave.server.set_stime(self.FLAT_DIR_HIERARCHY, self.uuid, cluster_stime) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1143, in <lambda> slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff return f(*a) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 484, in set_stime Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark)) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 2] No such file or directory >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
This happens still in the build glusterfs-3.4.0.59rhs-1, although the crash of gsyncd didn't happen.
Verified with the build: glusterfs-3.7.1-7.el6rhs.x86_64 With the new steps as mentioned at comment 8, the geo-rep session needs to be stopped before commit. After commit, restarting the geo-rep correctly goes to History and Then Changelog. Moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html