Bug 987980 - Dist-geo-rep : after remove brick commit from the machine having multiple bricks, the change_detector becomes xsync.
Summary: Dist-geo-rep : after remove brick commit from the machine having multiple bri...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: RHGS 3.1.0
Assignee: Aravinda VK
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On: 1026831
Blocks: 1202842 1223636
TreeView+ depends on / blocked
 
Reported: 2013-07-24 13:47 UTC by Vijaykumar Koppad
Modified: 2015-07-29 04:28 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.7.0-2.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-29 04:28:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Vijaykumar Koppad 2013-07-24 13:47:27 UTC
Description of problem: After remove-brick commit from a machine having multiple bricks, the geo-rep in that machine starts using xsync for the other running brick in that machine, because it fails to get the connection to the removed brick, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-07-24 18:45:40.343144] I [master(/bricks/s3):461:volinfo_query] _GMaster: new master is 00261f58-a7d5-4c9f-a6a7
-af41b70d92a3
[2013-07-24 18:45:40.343264] I [master(/bricks/s3):465:volinfo_query] _GMaster: primary master with volume id 00261f5
8-a7d5-4c9f-a6a7-af41b70d92a3 ...
[2013-07-24 18:45:50.191620] I [master(/bricks/s6):780:fallback_xsync] _GMaster: falling back to xsync mode
[2013-07-24 18:45:50.194266] I [syncdutils(/bricks/s6):158:finalize] <top>: exiting.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

If you check, it is crawling on two bricks, one is /bricks/s3 and /brick/s6 , /brick/s6 was removed, it failed on that and said falling back to xsync. 


Version-Release number of selected component (if applicable):3.4.0.12rhs.beta6-1.el6rhs.x86_64


How reproducible: Didn't try to reproduce again.


Steps to Reproduce:
1.Create and start geo-rep relationship between master(vol config as given in additional info) and slave.
2.Create some data on the master and let it sync.
3.remove-brick (to have master volume as in the additional info in the second config) 
4.Check the geo-rep log-file from the machine where the bricks were removed. 

Actual results:geo-rep change_detector falls back to xsync


Expected results:It shouldn't falls back to xsync when everything is working fine. 


Additional info:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
volume info before remove-brick 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Volume Name: mastervol
Type: Distributed-Replicate
Volume ID: 00261f58-a7d5-4c9f-a6a7-af41b70d92a3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: machine1:/bricks/s1
Brick2: machine2:/bricks/s2
Brick3: machine3:/bricks/s3
Brick4: machine4:/bricks/s4
Brick5: machine2:/bricks/s5
Brick6: machine3:/bricks/s6

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
volume info after the remove-brick
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Volume Name: mastervol
Type: Distributed-Replicate
Volume ID: 00261f58-a7d5-4c9f-a6a7-af41b70d92a3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: machine1:/bricks/s1
Brick2: machine2:/bricks/s2
Brick3: machine3:/bricks/s3
Brick4: machine4:/bricks/s4
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 2 Vijaykumar Koppad 2013-11-20 07:10:08 UTC
This happens still in the build glusterfs-3.4.0.44rhs-1

Comment 3 Vijaykumar Koppad 2013-11-20 07:12:00 UTC
and also all the passive gsyncd crashed with the trace-back, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-11-20 12:35:35.899013] I [master(/bricks/brick6):426:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-11-20 12:35:35.905201] E [syncdutils(/bricks/brick6):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1156, in service_loop
    g1.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 457, in crawlwrap
    self.slave.server.set_stime(self.FLAT_DIR_HIERARCHY, self.uuid, cluster_stime)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1143, in <lambda>
    slave.server.set_stime = types.MethodType(lambda _self, path, uuid, mark: brickserver.set_stime(path, uuid + '.' + gconf.slave_id, mark), slave.server)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff
    return f(*a)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 484, in set_stime
    Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'stime']), struct.pack('!II', *mark))
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 4 Vijaykumar Koppad 2014-02-11 14:04:13 UTC
This happens still in the build glusterfs-3.4.0.59rhs-1, although the crash of gsyncd didn't happen.

Comment 12 Rahul Hinduja 2015-07-06 17:11:21 UTC
Verified with the build: glusterfs-3.7.1-7.el6rhs.x86_64

With the new steps as mentioned at comment 8, the geo-rep session needs to be stopped before commit.

After commit, restarting the geo-rep correctly goes to History and Then Changelog.

Moving this bug to verified state.

Comment 15 errata-xmlrpc 2015-07-29 04:28:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.