Bug 983927 - Dist-geo-rep : After bringing up killled first replica pair of the sub-volume, it uses xsync instead of changelog.
Dist-geo-rep : After bringing up killled first replica pair of the sub-volum...
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
Vijaykumar Koppad
: FutureFeature
Depends On: 980808
  Show dependency treegraph
Reported: 2013-07-12 05:56 EDT by Vijaykumar Koppad
Modified: 2014-08-24 20:50 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-09-25 03:40:26 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Vijaykumar Koppad 2013-07-12 05:56:51 EDT
Description of problem: After bringing up the first replica pair of the sub-volume brick process which was killed , it starts using xsync, since gsyncd on that node had lost connection with changelog and fails to get the connection again. If it start using only xsync, there are lot of limitations that come with the xsync like not able to propagate deletes.  

Version-Release number of selected component (if applicable):glusterfs-

How reproducible: Always

Steps to Reproduce:
1.Create and start geo-rep relationship between master(dist-rep)  and slave. 
2.Kill the first replica pair brick process.
3.Create some data ( around 10K file dist in multiple files) and delete them in loop
4: After 1 or 2 iterations, start the brick process. 
5: Now gsyncd from that node, starts using xsync for syncing for every crawl.  

Actual results: brought up brick starts using xsync for syncing.

Expected results: Only initial crawl should be through xsync, after that it should make connection with changelog and start using changelog for further syncing. 

Additional info:
Comment 2 Vijaykumar Koppad 2013-07-17 09:32:23 EDT
Simple steps,

1.Create and start geo-rep relationship between master(dist-rep)  and slave. 
2.Start creating data on the master
3.Kill the first replica pair brick process while creation is happening.
4.Bring back the brick process. 
5. Check the change_detector in that node. 

Logs from changes.log

[2013-07-17 12:42:47.697343] I [gf-changelog.c:164:gf_changelog_notification_init] 0-glusterfs: connecting to changelog socket: /var/run/gluster/changelog-cfdffea3581f40685f18a34384edc263.sock (brick: /bricks/brick3)
[2013-07-17 12:42:47.697369] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 1/5...
[2013-07-17 12:42:49.697521] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 2/5...
[2013-07-17 12:42:51.697824] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 3/5...
[2013-07-17 12:42:53.698108] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 4/5...
[2013-07-17 12:42:55.698357] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 5/5...
[2013-07-17 12:42:57.698655] E [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: could not connect to changelog socket! bailing out...
[2013-07-17 12:42:57.698956] D [gf-changelog-process.c:584:gf_changelog_process] 0-glusterfs: byebye (1) from processing thread...

Logs from geo-rep logs, 

[2013-07-17 18:25:19.405018] I [monitor(monitor):129:monitor] Monitor: ----------------------------------------------
[2013-07-17 18:25:19.405311] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
[2013-07-17 18:25:19.648494] I [gsyncd(/bricks/brick3):500:main_i] <top>: syncing: gluster://localhost:master -> ssh:
[2013-07-17 18:25:22.731255] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m
[2013-07-17 18:25:22.735107] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m
[2013-07-17 18:25:22.737579] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster
[2013-07-17 18:25:22.738049] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster
[2013-07-17 18:25:22.740072] I [master(/bricks/brick3):500:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-07-17 18:25:22.854288] I [master(/bricks/brick3):465:volinfo_query] _GMaster: new master is f8d12e80-2fc9-4c2c-
Comment 3 Venky Shankar 2013-07-21 03:13:46 EDT
This is because the brick was killed, thereby gsyncd switching to the hybrid crawl mode (xsync). There is no logic as of now to try to establish connection with the changelog socket when a crawl is ongoing.

if it was the node that went down and came up, geo-replication would use changelog.

Marking this as FutureFeature.
Comment 4 Amar Tumballi 2013-09-11 10:26:11 EDT
bug 980808 is same as this... but symptoms are different.
Comment 5 Vivek Agarwal 2013-09-25 03:40:26 EDT
Per discussion with Amar/Venky, this is working as per design

Note You need to log in before you can comment on or make changes to this bug.