994351 – Dist-geo-rep : geo-rep failed to sync few of the hardlinks to one of the slaves, when there are many.

Bug 994351 - Dist-geo-rep : geo-rep failed to sync few of the hardlinks to one of the slaves, when there are many.

Summary: Dist-geo-rep : geo-rep failed to sync few of the hardlinks to one of the slav...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Vijaykumar Koppad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1001055 1003805
TreeView+	depends on / blocked

Reported:	2013-08-07 06:09 UTC by Vijaykumar Koppad
Modified:	2014-08-25 00:50 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0.25rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1001055 1003805 (view as bug list)
Environment:
Last Closed:	2013-09-23 22:38:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2013-08-07 06:09:35 UTC

Description of problem: Geo-rep fails to sync few hardlinks to one of the slaves, when there are many slaves to the same master. There are logs about the those failures in geo-rep logs on the master. The missing files had entries in the changelog which were processed by geo-rep. 

on the slave geo-rep gluster logs, those missing files had entries like

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-06 13:20:08.380684] W [fuse-bridge.c:1627:fuse_err_cbk]
0-glusterfs-fuse: 48090: MKNOD()
<gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~99F2FREJYJ =>
-1 (No such file or directory)
[2013-08-06 13:20:08.381286] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3779275654
[2013-08-06 13:20:08.383461] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3779275654
[2013-08-06 13:20:08.383558] W [fuse-bridge.c:1627:fuse_err_cbk]
0-glusterfs-fuse: 48092: MKNOD()
<gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~0HSNPNXR1G =>
-1 (No such file or directory)
[2013-08-06 13:20:08.387635] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3118733508
[2013-08-06 13:20:08.389595] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3118733508
[2013-08-06 13:20:08.389641] W [fuse-bridge.c:1627:fuse_err_cbk]
0-glusterfs-fuse: 48096: MKNOD()
<gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~60K9ASYI7R =>
-1 (No such file or directory)
[2013-08-06 13:20:08.394278] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3092897598
[2013-08-06 13:20:08.394324] W [fuse-bridge.c:1627:fuse_err_cbk]
0-glusterfs-fuse: 48099: MKNOD()
<gfid:c4c692f1-207c-4cb8-a23b-15127fc2be2b>/5200f7ef~~HPI1Z2CL8Q =>
-1 (No such file or directory)
[2013-08-06 13:20:08.394576] W [dht-layout.c:179:dht_layout_search]
0-imaster-dht: no subvolume for hash (value) = 3126779288
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Those missed files had logs like "No such file or directory", which might be problem with glusterfs or geo-rep, I am not sure. 

Version-Release number of selected component (if applicable):glusterfs-3.4.0.15rhs-1.el6rhs.x86_64.


How reproducible: Didn't try reproduce it yet. 


Steps to Reproduce:
1.Create and start geo-rep relationship between master and multiple slaves 
2.Create some 5k files on the master and let it sync to all the slaves 
3.Create hardlinks to all the files on the master
4.Check if all the hardlinks are created on all the slaves 

Actual results: It missed to sync few hardlinks  to slave


Expected results:It shouldn't miss to sync any files to slave.


Additional info:

Comment 2 Vijaykumar Koppad 2013-08-21 15:12:45 UTC

I am hitting this kind of failure to sync few file to slave often. This is not only related to hardlinks, this could happen for  any operations which has only entry operations, ie no data operations after the entry operations, for eg creation of symlink to files or just touch of a file. According the developer, this is not exactly related to geo-rep, but it is related to fuse which is actually failing to return with error if there is any error in the entry operation. This is potential data loss on the slave and ASFAIK it can't be recovered in any case.

Comment 3 Vijaykumar Koppad 2013-08-22 08:53:51 UTC

I was able to hit it again, in the cascaded-fanout setup. Setup is like, for one master , there are 4  imaster  and for each imaster, 4 slaves. Totally 21 volume involved. In this setup, some of the files didn't sync to level 2 slaves, and corresponding imaster had been trying these files to sync repeatedly, corresponding logs look like this

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-22 14:04:28.244749] W [master(/bricks/brick1):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/imaster2/ssh%3A%2F%2Froot%4010.70.43.74%3Agluster%3A%2F%2F127.0.0.1%3Aslave8/bd42ad17ef8864d51407b1c6478f5dc6/.processing/CHANGELOG.1377156494
[2013-08-22 14:04:28.972485] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/0335c1dc-1a9d-4136-a385-41bae6df7e49 [errcode: 23]
[2013-08-22 14:04:28.973099] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/07b82513-f019-4bf0-a670-f98a975b6c0a [errcode: 23]
[2013-08-22 14:04:28.975903] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/00ae2199-ecf0-4014-af9b-7fe0cc977a77 [errcode: 23]
[2013-08-22 14:04:28.983009] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/05d96416-0d9c-47b4-aaa6-59aa7e2f108e [errcode: 23]
[2013-08-22 14:04:28.984671] W [master(/bricks/brick1):618:regjob] <top>: Rsync: .gfid/018370fe-865c-40b4-bbff-c57c89d44829 [errcode: 23]
[2013-08-22 14:04:28.984946] W [master(/bricks/brick1):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/imaster2/ssh%3A%2F%2Froot%4010.70.43.74%3Agluster%3A%2F%2F127.0.0.1%3Aslave8/bd42ad17ef8864d51407b1c6478f5dc6/.processing/CHANGELOG.1377156494

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


consider the file with gfid  "018370fe-865c-40b4-bbff-c57c89d44829". If we check it in the .processing directory in working dir of the geo-rep, the output is like, 
[root@Chase .processing]# grep 018370fe-865c-40b4-bbff-c57c89d44829 *
CHANGELOG.1377156494:D 018370fe-865c-40b4-bbff-c57c89d44829
CHANGELOG.1377156494:M 018370fe-865c-40b4-bbff-c57c89d44829
[root@Chase .processing]# 

It has only D and M entries in changelogs in  .processing .

and if we check it in .processed directory in working dir of the geo-rep, the output is like. 

[root@Chase .processed]# grep 018370fe-865c-40b4-bbff-c57c89d44829 *
CHANGELOG.1377156474:E 018370fe-865c-40b4-bbff-c57c89d44829 MKNOD 0a92bc5c-8bbc-47b2-b47b-fffcf832eec7%2F5215bbb9~~ZITVTH5YV6
CHANGELOG.1377156474:M 018370fe-865c-40b4-bbff-c57c89d44829


and it had E entry in .processed. 

which means, entry operation has been processed, but if we check it on slave side logs, 
we have logs like, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-22 08:37:25.999381] W [fuse-bridge.c:2398:fuse_create_cbk] 0-glusterfs-fuse: 218647: <gfid:00000000-0000-000
0-0000-00000000000d>/018370fe-865c-40b4-bbff-c57c89d44829 => -1 (No such file or directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

which means entry operation has been failed on slave and that has not been captured anywhere.

Comment 4 Amar Tumballi 2013-08-27 06:19:44 UTC

https://code.engineering.redhat.com/gerrit/#/c/11999

Comment 5 Amar Tumballi 2013-08-27 10:27:05 UTC

https://code.engineering.redhat.com/gerrit/#/c/12028

Comment 6 Amar Tumballi 2013-08-28 10:57:35 UTC

Hold off testing till next build, need one more patch to fix it completely.

https://code.engineering.redhat.com/gerrit/#/c/12088

Comment 7 Vijaykumar Koppad 2013-08-31 13:37:50 UTC

Verified on the build glusterfs-3.4.0.30rhs-2.el6rhs.x86_64

Comment 8 Scott Haines 2013-09-23 22:38:50 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 9 Scott Haines 2013-09-23 22:41:30 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.