Bug 983927 - Dist-geo-rep : After bringing up killled first replica pair of the sub-volume, it uses xsync instead of changelog.
Summary: Dist-geo-rep : After bringing up killled first replica pair of the sub-volum...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: Vijaykumar Koppad
URL:
Whiteboard:
Depends On: 980808
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-12 09:56 UTC by Vijaykumar Koppad
Modified: 2014-08-25 00:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-25 07:40:26 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijaykumar Koppad 2013-07-12 09:56:51 UTC
Description of problem: After bringing up the first replica pair of the sub-volume brick process which was killed , it starts using xsync, since gsyncd on that node had lost connection with changelog and fails to get the connection again. If it start using only xsync, there are lot of limitations that come with the xsync like not able to propagate deletes.  

Version-Release number of selected component (if applicable):glusterfs-3.4.0.12rhs.beta3-1.el6rhs.x86_64


How reproducible: Always


Steps to Reproduce:
1.Create and start geo-rep relationship between master(dist-rep)  and slave. 
2.Kill the first replica pair brick process.
3.Create some data ( around 10K file dist in multiple files) and delete them in loop
4: After 1 or 2 iterations, start the brick process. 
5: Now gsyncd from that node, starts using xsync for syncing for every crawl.  


Actual results: brought up brick starts using xsync for syncing.

 
Expected results: Only initial crawl should be through xsync, after that it should make connection with changelog and start using changelog for further syncing. 


Additional info:

Comment 2 Vijaykumar Koppad 2013-07-17 13:32:23 UTC
Simple steps,

1.Create and start geo-rep relationship between master(dist-rep)  and slave. 
2.Start creating data on the master
3.Kill the first replica pair brick process while creation is happening.
4.Bring back the brick process. 
5. Check the change_detector in that node. 

Logs from changes.log

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2013-07-17 12:42:47.697343] I [gf-changelog.c:164:gf_changelog_notification_init] 0-glusterfs: connecting to changelog socket: /var/run/gluster/changelog-cfdffea3581f40685f18a34384edc263.sock (brick: /bricks/brick3)
[2013-07-17 12:42:47.697369] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 1/5...
[2013-07-17 12:42:49.697521] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 2/5...
[2013-07-17 12:42:51.697824] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 3/5...
[2013-07-17 12:42:53.698108] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 4/5...
[2013-07-17 12:42:55.698357] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 5/5...
[2013-07-17 12:42:57.698655] E [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: could not connect to changelog socket! bailing out...
[2013-07-17 12:42:57.698956] D [gf-changelog-process.c:584:gf_changelog_process] 0-glusterfs: byebye (1) from processing thread...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Logs from geo-rep logs, 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2013-07-17 18:25:19.405018] I [monitor(monitor):129:monitor] Monitor: ----------------------------------------------
--------------
[2013-07-17 18:25:19.405311] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-07-17 18:25:19.648494] I [gsyncd(/bricks/brick3):500:main_i] <top>: syncing: gluster://localhost:master -> ssh:
//root.43.141:gluster://localhost:slave
[2013-07-17 18:25:22.731255] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m
ode
[2013-07-17 18:25:22.735107] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m
ode
[2013-07-17 18:25:22.737579] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster
/master/ssh%3A%2F%2Froot%4010.70.43.141%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync
[2013-07-17 18:25:22.738049] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster
/master/ssh%3A%2F%2Froot%4010.70.43.141%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync
[2013-07-17 18:25:22.740072] I [master(/bricks/brick3):500:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-07-17 18:25:22.854288] I [master(/bricks/brick3):465:volinfo_query] _GMaster: new master is f8d12e80-2fc9-4c2c-
a5da-a5ad4e800e87
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Comment 3 Venky Shankar 2013-07-21 07:13:46 UTC
This is because the brick was killed, thereby gsyncd switching to the hybrid crawl mode (xsync). There is no logic as of now to try to establish connection with the changelog socket when a crawl is ongoing.

if it was the node that went down and came up, geo-replication would use changelog.

Marking this as FutureFeature.

Comment 4 Amar Tumballi 2013-09-11 14:26:11 UTC
bug 980808 is same as this... but symptoms are different.

Comment 5 Vivek Agarwal 2013-09-25 07:40:26 UTC
Per discussion with Amar/Venky, this is working as per design


Note You need to log in before you can comment on or make changes to this bug.