Description of problem: Geo-rep fails to sync few hardlinks to one of the slaves, when there are many slaves to the same master. There are logs about the those failures in geo-rep logs on the master. The missing files had entries in the changelog which were processed by geo-rep. on the slave geo-rep gluster logs, those missing files had entries like >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-06 13:20:08.380684] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 48090: MKNOD() <gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~99F2FREJYJ => -1 (No such file or directory) [2013-08-06 13:20:08.381286] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3779275654 [2013-08-06 13:20:08.383461] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3779275654 [2013-08-06 13:20:08.383558] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 48092: MKNOD() <gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~0HSNPNXR1G => -1 (No such file or directory) [2013-08-06 13:20:08.387635] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3118733508 [2013-08-06 13:20:08.389595] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3118733508 [2013-08-06 13:20:08.389641] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 48096: MKNOD() <gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~60K9ASYI7R => -1 (No such file or directory) [2013-08-06 13:20:08.394278] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3092897598 [2013-08-06 13:20:08.394324] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 48099: MKNOD() <gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~HPI1Z2CL8Q => -1 (No such file or directory) [2013-08-06 13:20:08.394576] W [dht-layout.c:179:dht_layout_search] 0-imaster-dht: no subvolume for hash (value) = 3126779288 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Those missed files had logs like "No such file or directory", which might be problem with glusterfs or geo-rep, I am not sure. Version-Release number of selected component (if applicable):glusterfs-3.4.0.15rhs-1.el6rhs.x86_64. How reproducible: Didn't try reproduce it yet. Steps to Reproduce: 1.Create and start geo-rep relationship between master and multiple slaves 2.Create some 5k files on the master and let it sync to all the slaves 3.Create hardlinks to all the files on the master 4.Check if all the hardlinks are created on all the slaves Actual results: It missed to sync few hardlinks to slave Expected results:It shouldn't miss to sync any files to slave. Additional info:
I am hitting this kind of failure to sync few file to slave often. This is not only related to hardlinks, this could happen for any operations which has only entry operations, ie no data operations after the entry operations, for eg creation of symlink to files or just touch of a file. According the developer, this is not exactly related to geo-rep, but it is related to fuse which is actually failing to return with error if there is any error in the entry operation. This is potential data loss on the slave and ASFAIK it can't be recovered in any case.
I was able to hit it again, in the cascaded-fanout setup. Setup is like, for one master , there are 4 imaster and for each imaster, 4 slaves. Totally 21 volume involved. In this setup, some of the files didn't sync to level 2 slaves, and corresponding imaster had been trying these files to sync repeatedly, corresponding logs look like this >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-22 14:04:28.244749] W [master(/bricks/brick1):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/imaster2/ssh%3A%2F%2Froot%4010.70.43.74%3Agluster%3A%2F%2F127.0.0.1%3Aslave8/bd42ad17ef8864d51407b1c6478f5dc6/.processing/CHANGELOG.1377156494 [2013-08-22 14:04:28.972485] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/0335c1dc-1a9d-4136-a385-41bae6df7e49 [errcode: 23] [2013-08-22 14:04:28.973099] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/07b82513-f019-4bf0-a670-f98a975b6c0a [errcode: 23] [2013-08-22 14:04:28.975903] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/00ae2199-ecf0-4014-af9b-7fe0cc977a77 [errcode: 23] [2013-08-22 14:04:28.983009] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/05d96416-0d9c-47b4-aaa6-59aa7e2f108e [errcode: 23] [2013-08-22 14:04:28.984671] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/018370fe-865c-40b4-bbff-c57c89d44829 [errcode: 23] [2013-08-22 14:04:28.984946] W [master(/bricks/brick1):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/imaster2/ssh%3A%2F%2Froot%4010.70.43.74%3Agluster%3A%2F%2F127.0.0.1%3Aslave8/bd42ad17ef8864d51407b1c6478f5dc6/.processing/CHANGELOG.1377156494 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consider the file with gfid "018370fe-865c-40b4-bbff-c57c89d44829". If we check it in the .processing directory in working dir of the geo-rep, the output is like, [root@Chase .processing]# grep 018370fe-865c-40b4-bbff-c57c89d44829 * CHANGELOG.1377156494:D 018370fe-865c-40b4-bbff-c57c89d44829 CHANGELOG.1377156494:M 018370fe-865c-40b4-bbff-c57c89d44829 [root@Chase .processing]# It has only D and M entries in changelogs in .processing . and if we check it in .processed directory in working dir of the geo-rep, the output is like. [root@Chase .processed]# grep 018370fe-865c-40b4-bbff-c57c89d44829 * CHANGELOG.1377156474:E 018370fe-865c-40b4-bbff-c57c89d44829 MKNOD 0a92bc5c-8bbc-47b2-b47b-fffcf832eec7%2F5215bbb9~~ZITVTH5YV6 CHANGELOG.1377156474:M 018370fe-865c-40b4-bbff-c57c89d44829 and it had E entry in .processed. which means, entry operation has been processed, but if we check it on slave side logs, we have logs like, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-22 08:37:25.999381] W [fuse-bridge.c:2398:fuse_create_cbk] 0-glusterfs-fuse: 218647: <gfid:00000000-0000-000 0-0000-00000000000d>/018370fe-865c-40b4-bbff-c57c89d44829 => -1 (No such file or directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which means entry operation has been failed on slave and that has not been captured anywhere.
https://code.engineering.redhat.com/gerrit/#/c/11999
https://code.engineering.redhat.com/gerrit/#/c/12028
Hold off testing till next build, need one more patch to fix it completely. https://code.engineering.redhat.com/gerrit/#/c/12088
Verified on the build glusterfs-3.4.0.30rhs-2.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html