Bug 1001684 - Dist-geo-rep : geo-rep failed to sync some regular files to the slave with slave saying "no such file or directory"
Summary: Dist-geo-rep : geo-rep failed to sync some regular files to the slave with sl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Vijaykumar Koppad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-27 13:58 UTC by Vijaykumar Koppad
Modified: 2014-08-25 00:50 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.4.0.31rhs-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-23 22:38:52 UTC
Embargoed:


Attachments (Terms of Use)
10.70.43.25_geo-rep-slave-client log file (4.38 MB, text/plain)
2013-08-30 14:54 UTC, Vijaykumar Koppad
no flags Details
10.70.43.30_geo-rep-slave-client log file (8.69 KB, text/plain)
2013-08-30 14:54 UTC, Vijaykumar Koppad
no flags Details
10.70.43.26_geo-rep-slave-client log file (3.81 MB, text/plain)
2013-08-30 14:54 UTC, Vijaykumar Koppad
no flags Details
10.70.43.31_geo-rep-slave-client log file (8.69 KB, text/plain)
2013-08-30 14:54 UTC, Vijaykumar Koppad
no flags Details
List of missing files from slave. (32.19 KB, text/plain)
2013-08-30 15:00 UTC, Vijaykumar Koppad
no flags Details

Description Vijaykumar Koppad 2013-08-27 13:58:55 UTC
Description of problem: after creating regular files on the master, geo-rep failed to sync few files to the slave. The files which were missing were processed by the geo-rep from .processing files, 

consider the file 521ca484~~12JF2W3YRM which is missing from the slave

on master 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redlemon bd42ad17ef8864d51407b1c6478f5dc6]# grep 521ca484~~12JF2W3YRM .processed/*
.processed/CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

the slave log for the same file says had entries like, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-27 13:16:29.856620] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM with gfid (187c5ed4-81c4-4bc9-9dcb-b829041a2789): No such file or directory
[2013-08-27 13:16:29.856675] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18136: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM => -1 (No such file or directory)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.24rhs-1.el6rhs.x86_64

How reproducible:Didn't try to reproduce it.


Steps to Reproduce:
1.create and start a geo-rep relationship between master(dist-rep) and slave(dist-rep) 
2.create regular file on the master using command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 /mnt/master/"
3. check if it completely syncs. 

Actual results: geo-rep fails to sync some files to slave 


Expected results:geo-rep shouldn't fail to sync any files. 


Additional info:

for a directory which is missing, slave log had entries like, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-08-27 13:16:29.858833] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/level19 with gfid (c00bd253-693f-4cd1-82e2-7a7f04240a59): Invalid argument
[2013-08-27 13:16:29.858861] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18144: MKDIR() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/level19 => -1 (Invalid argument)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 1 Vijaykumar Koppad 2013-08-27 14:06:17 UTC
with the above logs the slave-geo-rep also had logs of "no subvolume for hash (value)"  

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-08-27 13:16:29.856620] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM with gfid (187c5ed4-81c4-4bc9-9dcb-b829041a2789): No such file or directory
[2013-08-27 13:16:29.856675] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18136: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM => -1 (No such file or directory)
[2013-08-27 13:16:29.856898] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2358681008
[2013-08-27 13:16:29.856932] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~MGODYPWTRX with gfid (fcd47fee-395e-4429-a316-252d4697d703): No such file or directory
[2013-08-27 13:16:29.856960] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18137: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~MGODYPWTRX => -1 (No such file or directory)
[2013-08-27 13:16:29.857145] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2235270489


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 2 Vijaykumar Koppad 2013-08-27 14:22:44 UTC
It is happening consistently.

Comment 5 Amar Tumballi 2013-08-27 16:57:05 UTC
vkoppad, was there any brick disconnect during this test? the missing 'hash' subvol error can happen during that situation.

Comment 6 Venky Shankar 2013-08-27 17:13:02 UTC
Vijaykumar,

Does the changelog has an MKDIR entry operation for gfid: 4496bd88-652c-479e-92b8-f2f40f39b512 ?

If this too has failed then there should be an error in the logs.

Hostname(s) where I can log in and look into?

Comment 7 Venky Shankar 2013-08-28 03:43:44 UTC
(In reply to Vijaykumar Koppad from comment #0)
> Description of problem: after creating regular files on the master, geo-rep
> failed to sync few files to the slave. The files which were missing were
> processed by the geo-rep from .processing files, 
> 
> consider the file 521ca484~~12JF2W3YRM which is missing from the slave
> 
> on master 

I don't see the file on the master too.

> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> [root@redlemon bd42ad17ef8864d51407b1c6478f5dc6]# grep 521ca484~~12JF2W3YRM
> .processed/*
> .processed/CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789
> CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM


There is a unlink for this file in the changelog.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@redlemon .processed]# grep '521ca484~~12JF2W3YRM' *
CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM
CHANGELOG.1377612197:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 UNLINK 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Comment 8 Venky Shankar 2013-08-28 03:49:39 UTC
OK, looks like this file was never created on the master after the UNLINK. The parent GFID is different (which implies that the directory was removed and created again - so it got a different GFID and has that entry in the changelog, but not for the file in question):

CHANGELOG.1377612277:E b5ddec9e-8d2e-4487-8917-f37168aa3a09 MKDIR 00000000-0000-0000-0000-000000000001%2Flevel09
CHANGELOG.1377612277:M b5ddec9e-8d2e-4487-8917-f37168aa3a09

[root@redeye slave]# getfattr -m . -n glusterfs.gfid.string level09
# file: level09
glusterfs.gfid.string="b5ddec9e-8d2e-4487-8917-f37168aa3a09"

So, the logs entries you have given in the comments are actually old log entries.

Comment 9 Vijaykumar Koppad 2013-08-28 06:28:50 UTC
dht is finding holes, due to which it cannot find the subvolume for a given hash range.
This could happen when the bricks are offline. But the there are not n/w failures reported.
What seems to be reported is blocking inodelk failures due to ENOENT.

[2013-08-27 13:12:29.646006] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-2: remote operation failed: No such file or directory
[2013-08-27 13:12:29.646249] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-0: remote operation failed: No such file or directory
[2013-08-27 13:12:29.646578] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-1: remote operation failed: No such file or directory
[2013-08-27 13:12:29.646979] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-3: remote operation failed: No such file or directory
[2013-08-27 13:12:29.647368] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-0: remote operation failed: No such file or directory
[2013-08-27 13:12:29.647911] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-2: remote operation failed: No such file or directory
[2013-08-27 13:12:29.648353] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-1: remote operation failed: No such file or directory
[2013-08-27 13:12:29.648388] I [afr-lk-common.c:1085:afr_lock_blocking] 0-slave-replicate-0: unable to lock on even one child
[2013-08-27 13:12:29.648406] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-slave-replicate-0: Blocking inodelks failed.
[2013-08-27 13:12:29.649062] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-3: remote operation failed: No such file or directory
[2013-08-27 13:12:29.649096] I [afr-lk-common.c:1085:afr_lock_blocking] 0-slave-replicate-1: unable to lock on even one child
[2013-08-27 13:12:29.649113] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-slave-replicate-1: Blocking inodelks failed.
[2013-08-27 13:12:29.650934] I [dht-layout.c:633:dht_layout_normalize] 0-slave-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=0 misc=0
[2013-08-27 13:12:29.651214] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2715612820

Comment 10 Vijaykumar Koppad 2013-08-28 07:05:25 UTC
Not able to reproduce again in the fresh setup.

Comment 11 shishir gowda 2013-08-28 09:31:35 UTC
Based on comment 10, reducing the severity.

Also can the blocker flag be removed as the issue is not blocking QE testing anymore?

Comment 12 Sachidananda Urs 2013-08-28 11:37:00 UTC
Vijaykumar is trying to reproduce the issue, removing blocker till he hits it again. However, DEV has taken a look into the setup. Working on RCA.

Comment 13 Vijaykumar Koppad 2013-08-30 14:54:22 UTC
Created attachment 792210 [details]
10.70.43.25_geo-rep-slave-client log file

Comment 14 Vijaykumar Koppad 2013-08-30 14:54:42 UTC
Created attachment 792211 [details]
10.70.43.30_geo-rep-slave-client log file

Comment 15 Vijaykumar Koppad 2013-08-30 14:54:42 UTC
Created attachment 792212 [details]
10.70.43.26_geo-rep-slave-client log file

Comment 16 Vijaykumar Koppad 2013-08-30 14:54:57 UTC
Created attachment 792213 [details]
10.70.43.31_geo-rep-slave-client log file

Comment 17 Vijaykumar Koppad 2013-08-30 15:00:17 UTC
Created attachment 792217 [details]
List of missing files from slave.

Comment 18 Vijaykumar Koppad 2013-08-30 15:05:44 UTC
I was able to hit it again in glusterfs-3.4.0.30rhs-2.el6rhs.x86_64. I have attached all the geo-rep-slave client logs and also I have attached list of files which were missing from slave. 

Like this bug description, all the files which were missing had entries in the changelog which were processed by the geo-rep. Hope I have given enough information.

Comment 19 Rachana Patel 2013-09-02 13:52:17 UTC
faced similar issue with glusterfs-3.4.0.24rhs-2.el6rhs.x86_64.

Steps :-
1. create and start geo rep session between master and slave volume(both distribute volume). No data is present on master.
2. Now start creating data on master volume.
3. stop data creation after some time.
4. once sync is completed and there are no change logs in .processing directory, verify data in master and slave.
5. around 17 files were missing out of 50k files/directory.
6. look into log for those files

e.g - file having gfid - bb3a710b-dc3d-4bd6-a25b-6186a38c4bc - is present on master volume but not on slave.

change log entry is present on master:-

[root@4DVM5 .processed]# grep 'bb3a710b-dc3d-4bd6-a25b-6186a38c4bce' *
CHANGELOG.1377844357:E bb3a710b-dc3d-4bd6-a25b-6186a38c4bce SYMLINK caf14ecd-68c5-4dad-a687-d465030eb03d%2Fservertool
CHANGELOG.1377844357:M bb3a710b-dc3d-4bd6-a25b-6186a38c4bce
[root@4DVM5 .processed]# pwd
/var/run/gluster/change_add_master/ssh%3A%2F%2Froot%4010.70.37.1%3Agluster%3A%2F%2F127.0.0.1%3Achange_add__slave/00ce598912748d08d8518cb05bdcb67b/.processed

in slave log:-
[root@rhsauto026 geo-replication-slaves]# pwd
/var/log/glusterfs/geo-replication-slaves
[root@rhsauto026 geo-replication-slaves]# grep -B2 'bb3a710b-dc3d-4bd6-a25b-6186a38c4bce' 48a93238-d126-449a-a11b-98f9e1fffb99:gluster%3A%2F%2F127.0.0.1%3Achange_add__slave.gluster.log
[2013-08-30 11:47:01.927129] I [dht-layout.c:633:dht_layout_normalize] 0-change_add__slave-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=0 misc=0
[2013-08-30 11:47:01.927385] W [dht-layout.c:179:dht_layout_search] 0-change_add__slave-dht: no subvolume for hash (value) = 899727189
[2013-08-30 11:47:01.927424] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:caf14ecd-68c5-4dad-a687-d465030eb03d>/servertool with gfid (bb3a710b-dc3d-4bd6-a25b-6186a38c4bce): No such file or directory

Comment 23 Vijaykumar Koppad 2013-09-06 13:32:49 UTC
I was able to hit it again in the build glusterfs-3.4.0.31rhs-1, in a fanout setup , and while syncing symlinks in first xsync crawl.

steps I followed
1. create and start a geo-rep fanout setup from a master to  4 slaves.
2. stop all geo-rep sessions and create files on master.
3. start all the geo-rep sessions and let files to sync to all the slaves.
4. Now stop all geo-rep sessions, and create symlinks to all those files. 
5. Start the geo-rep session.
6. First xsync crawl fails to sync around 16 to 20 files out of 2000 files,  to all the slaves.

on master missing file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@Ray ~]# getfattr -h -n glusterfs.gfid.string /mnt/master/level05/level15/symlink_to_files/5229cf3e~~4CA6FC77NJ
getfattr: Removing leading '/' from absolute path names
# file: mnt/master/level05/level15/symlink_to_files/5229cf3e~~4CA6FC77NJ
glusterfs.gfid.string="0854f9ef-aa93-462a-aa02-10e9d510f34d"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

On master, the missing file had entries in the XSYNC-CHANGELOG
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[root@Ray bd42ad17ef8864d51407b1c6478f5dc6]# grep 0854f9ef-aa93-462a-aa02-10e9d510f34d xsync/*
xsync/XSYNC-CHANGELOG.1378471766:E 0854f9ef-aa93-462a-aa02-10e9d510f34d SYMLINK fddd8985-0185-4655-a6fd-565b7c5a54ea%2F5229cf3e%7E%7E4CA6FC77NJ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

on the slave for same GFID, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-09-06 12:49:35.192857] W [dht-layout.c:179:dht_layout_search] 0-imaster1-dht:
 no subvolume for hash (value) = 3927056761
[2013-09-06 12:49:35.192898] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:fddd8985-0185-4655-a6fd-565b7c5a54ea>/5229cf3e~~4CA6FC77NJ with gfid (0854f9ef-aa93-462a-aa02-10e9d510f34d): No such file or directory
[2013-09-06 12:49:35.192930] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 726: SYMLINK() <gfid:fddd8985-0185-4655-a6fd-565b7c5a54ea>/5229cf3e~~4CA6FC77NJ => -1 (No such file or directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Since logs are similar to the issue, I am moving it to assigned state,  though this happened to me while syncing symlinks in first xsync.

Comment 24 Amar Tumballi 2013-09-06 16:16:40 UTC
VijayKoppad,

Can you please reopen another bug for the new issue. The issue we tracked here was for the steps given in description:

----

Steps to Reproduce:
1.create and start a geo-rep relationship between master(dist-rep) and slave(dist-rep) 
2.create regular file on the master using command "./crefi.py -n 10 --multi  -b 10 -d 10 --random --max=500K --min=10 /mnt/master/"
3. check if it completely syncs. 

Actual results: geo-rep fails to sync some files to slave 

----

In the latest verification steps:

----
steps I followed
1. create and start a geo-rep fanout setup from a master to  4 slaves.
2. stop all geo-rep sessions and create files on master.
3. start all the geo-rep sessions and let files to sync to all the slaves.
4. Now stop all geo-rep sessions, and create symlinks to all those files. 
5. Start the geo-rep session.
6. First xsync crawl fails to sync around 16 to 20 files out of 2000 files,  to all the slaves.

----

Considering its a 'fanout' setup, would be good to open another bug, and close this one.

Comment 25 Vijaykumar Koppad 2013-09-09 11:13:03 UTC
Since this issue description is different compared the scenario where I am hitting the similar issue, I am moving this bug to the verified. 

The issue with symlinks, is being hit only in cascaded and fanout  setup. But the logs are similar. The issue with symlink is being tracked in Bug 1003580.

Comment 26 Scott Haines 2013-09-23 22:38:52 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 27 Scott Haines 2013-09-23 22:41:31 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html


Note You need to log in before you can comment on or make changes to this bug.