Description of problem: After bringing up the first replica pair of the sub-volume brick process which was killed , it starts using xsync, since gsyncd on that node had lost connection with changelog and fails to get the connection again. If it start using only xsync, there are lot of limitations that come with the xsync like not able to propagate deletes. Version-Release number of selected component (if applicable):glusterfs-3.4.0.12rhs.beta3-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1.Create and start geo-rep relationship between master(dist-rep) and slave. 2.Kill the first replica pair brick process. 3.Create some data ( around 10K file dist in multiple files) and delete them in loop 4: After 1 or 2 iterations, start the brick process. 5: Now gsyncd from that node, starts using xsync for syncing for every crawl. Actual results: brought up brick starts using xsync for syncing. Expected results: Only initial crawl should be through xsync, after that it should make connection with changelog and start using changelog for further syncing. Additional info:
Simple steps, 1.Create and start geo-rep relationship between master(dist-rep) and slave. 2.Start creating data on the master 3.Kill the first replica pair brick process while creation is happening. 4.Bring back the brick process. 5. Check the change_detector in that node. Logs from changes.log ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2013-07-17 12:42:47.697343] I [gf-changelog.c:164:gf_changelog_notification_init] 0-glusterfs: connecting to changelog socket: /var/run/gluster/changelog-cfdffea3581f40685f18a34384edc263.sock (brick: /bricks/brick3) [2013-07-17 12:42:47.697369] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 1/5... [2013-07-17 12:42:49.697521] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 2/5... [2013-07-17 12:42:51.697824] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 3/5... [2013-07-17 12:42:53.698108] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 4/5... [2013-07-17 12:42:55.698357] W [gf-changelog.c:174:gf_changelog_notification_init] 0-glusterfs: connection attempt 5/5... [2013-07-17 12:42:57.698655] E [gf-changelog.c:189:gf_changelog_notification_init] 0-glusterfs: could not connect to changelog socket! bailing out... [2013-07-17 12:42:57.698956] D [gf-changelog-process.c:584:gf_changelog_process] 0-glusterfs: byebye (1) from processing thread... ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Logs from geo-rep logs, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2013-07-17 18:25:19.405018] I [monitor(monitor):129:monitor] Monitor: ---------------------------------------------- -------------- [2013-07-17 18:25:19.405311] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-07-17 18:25:19.648494] I [gsyncd(/bricks/brick3):500:main_i] <top>: syncing: gluster://localhost:master -> ssh: //root.43.141:gluster://localhost:slave [2013-07-17 18:25:22.731255] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m ode [2013-07-17 18:25:22.735107] I [master(/bricks/brick3):61:gmaster_builder] <top>: setting up xsync change detection m ode [2013-07-17 18:25:22.737579] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster /master/ssh%3A%2F%2Froot%4010.70.43.141%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync [2013-07-17 18:25:22.738049] I [master(/bricks/brick3):961:register] _GMaster: xsync temp directory: /var/run/gluster /master/ssh%3A%2F%2Froot%4010.70.43.141%3Agluster%3A%2F%2F127.0.0.1%3Aslave/cfdffea3581f40685f18a34384edc263/xsync [2013-07-17 18:25:22.740072] I [master(/bricks/brick3):500:crawlwrap] _GMaster: crawl interval: 60 seconds [2013-07-17 18:25:22.854288] I [master(/bricks/brick3):465:volinfo_query] _GMaster: new master is f8d12e80-2fc9-4c2c- a5da-a5ad4e800e87 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is because the brick was killed, thereby gsyncd switching to the hybrid crawl mode (xsync). There is no logic as of now to try to establish connection with the changelog socket when a crawl is ongoing. if it was the node that went down and came up, geo-replication would use changelog. Marking this as FutureFeature.
bug 980808 is same as this... but symptoms are different.
Per discussion with Amar/Venky, this is working as per design