Description of problem: after creating regular files on the master, geo-rep failed to sync few files to the slave. The files which were missing were processed by the geo-rep from .processing files, consider the file 521ca484~~12JF2W3YRM which is missing from the slave on master >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [root@redlemon bd42ad17ef8864d51407b1c6478f5dc6]# grep 521ca484~~12JF2W3YRM .processed/* .processed/CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the slave log for the same file says had entries like, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-27 13:16:29.856620] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM with gfid (187c5ed4-81c4-4bc9-9dcb-b829041a2789): No such file or directory [2013-08-27 13:16:29.856675] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18136: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM => -1 (No such file or directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable): glusterfs-3.4.0.24rhs-1.el6rhs.x86_64 How reproducible:Didn't try to reproduce it. Steps to Reproduce: 1.create and start a geo-rep relationship between master(dist-rep) and slave(dist-rep) 2.create regular file on the master using command "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 /mnt/master/" 3. check if it completely syncs. Actual results: geo-rep fails to sync some files to slave Expected results:geo-rep shouldn't fail to sync any files. Additional info: for a directory which is missing, slave log had entries like, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-27 13:16:29.858833] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/level19 with gfid (c00bd253-693f-4cd1-82e2-7a7f04240a59): Invalid argument [2013-08-27 13:16:29.858861] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18144: MKDIR() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/level19 => -1 (Invalid argument) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
with the above logs the slave-geo-rep also had logs of "no subvolume for hash (value)" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-08-27 13:16:29.856620] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM with gfid (187c5ed4-81c4-4bc9-9dcb-b829041a2789): No such file or directory [2013-08-27 13:16:29.856675] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18136: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~12JF2W3YRM => -1 (No such file or directory) [2013-08-27 13:16:29.856898] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2358681008 [2013-08-27 13:16:29.856932] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~MGODYPWTRX with gfid (fcd47fee-395e-4429-a316-252d4697d703): No such file or directory [2013-08-27 13:16:29.856960] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 18137: MKNOD() <gfid:4496bd88-652c-479e-92b8-f2f40f39b512>/521ca484~~MGODYPWTRX => -1 (No such file or directory) [2013-08-27 13:16:29.857145] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2235270489 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
It is happening consistently.
vkoppad, was there any brick disconnect during this test? the missing 'hash' subvol error can happen during that situation.
Vijaykumar, Does the changelog has an MKDIR entry operation for gfid: 4496bd88-652c-479e-92b8-f2f40f39b512 ? If this too has failed then there should be an error in the logs. Hostname(s) where I can log in and look into?
(In reply to Vijaykumar Koppad from comment #0) > Description of problem: after creating regular files on the master, geo-rep > failed to sync few files to the slave. The files which were missing were > processed by the geo-rep from .processing files, > > consider the file 521ca484~~12JF2W3YRM which is missing from the slave > > on master I don't see the file on the master too. > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > [root@redlemon bd42ad17ef8864d51407b1c6478f5dc6]# grep 521ca484~~12JF2W3YRM > .processed/* > .processed/CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 > CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM There is a unlink for this file in the changelog. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [root@redlemon .processed]# grep '521ca484~~12JF2W3YRM' * CHANGELOG.1377608854:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 CREATE 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM CHANGELOG.1377612197:E 187c5ed4-81c4-4bc9-9dcb-b829041a2789 UNLINK 4496bd88-652c-479e-92b8-f2f40f39b512%2F521ca484~~12JF2W3YRM <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
OK, looks like this file was never created on the master after the UNLINK. The parent GFID is different (which implies that the directory was removed and created again - so it got a different GFID and has that entry in the changelog, but not for the file in question): CHANGELOG.1377612277:E b5ddec9e-8d2e-4487-8917-f37168aa3a09 MKDIR 00000000-0000-0000-0000-000000000001%2Flevel09 CHANGELOG.1377612277:M b5ddec9e-8d2e-4487-8917-f37168aa3a09 [root@redeye slave]# getfattr -m . -n glusterfs.gfid.string level09 # file: level09 glusterfs.gfid.string="b5ddec9e-8d2e-4487-8917-f37168aa3a09" So, the logs entries you have given in the comments are actually old log entries.
dht is finding holes, due to which it cannot find the subvolume for a given hash range. This could happen when the bricks are offline. But the there are not n/w failures reported. What seems to be reported is blocking inodelk failures due to ENOENT. [2013-08-27 13:12:29.646006] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-2: remote operation failed: No such file or directory [2013-08-27 13:12:29.646249] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-0: remote operation failed: No such file or directory [2013-08-27 13:12:29.646578] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-1: remote operation failed: No such file or directory [2013-08-27 13:12:29.646979] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-3: remote operation failed: No such file or directory [2013-08-27 13:12:29.647368] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-0: remote operation failed: No such file or directory [2013-08-27 13:12:29.647911] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-2: remote operation failed: No such file or directory [2013-08-27 13:12:29.648353] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-1: remote operation failed: No such file or directory [2013-08-27 13:12:29.648388] I [afr-lk-common.c:1085:afr_lock_blocking] 0-slave-replicate-0: unable to lock on even one child [2013-08-27 13:12:29.648406] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-slave-replicate-0: Blocking inodelks failed. [2013-08-27 13:12:29.649062] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-slave-client-3: remote operation failed: No such file or directory [2013-08-27 13:12:29.649096] I [afr-lk-common.c:1085:afr_lock_blocking] 0-slave-replicate-1: unable to lock on even one child [2013-08-27 13:12:29.649113] I [afr-transaction.c:1063:afr_post_blocking_inodelk_cbk] 0-slave-replicate-1: Blocking inodelks failed. [2013-08-27 13:12:29.650934] I [dht-layout.c:633:dht_layout_normalize] 0-slave-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=0 misc=0 [2013-08-27 13:12:29.651214] W [dht-layout.c:179:dht_layout_search] 0-slave-dht: no subvolume for hash (value) = 2715612820
Not able to reproduce again in the fresh setup.
Based on comment 10, reducing the severity. Also can the blocker flag be removed as the issue is not blocking QE testing anymore?
Vijaykumar is trying to reproduce the issue, removing blocker till he hits it again. However, DEV has taken a look into the setup. Working on RCA.
Created attachment 792210 [details] 10.70.43.25_geo-rep-slave-client log file
Created attachment 792211 [details] 10.70.43.30_geo-rep-slave-client log file
Created attachment 792212 [details] 10.70.43.26_geo-rep-slave-client log file
Created attachment 792213 [details] 10.70.43.31_geo-rep-slave-client log file
Created attachment 792217 [details] List of missing files from slave.
I was able to hit it again in glusterfs-3.4.0.30rhs-2.el6rhs.x86_64. I have attached all the geo-rep-slave client logs and also I have attached list of files which were missing from slave. Like this bug description, all the files which were missing had entries in the changelog which were processed by the geo-rep. Hope I have given enough information.
faced similar issue with glusterfs-3.4.0.24rhs-2.el6rhs.x86_64. Steps :- 1. create and start geo rep session between master and slave volume(both distribute volume). No data is present on master. 2. Now start creating data on master volume. 3. stop data creation after some time. 4. once sync is completed and there are no change logs in .processing directory, verify data in master and slave. 5. around 17 files were missing out of 50k files/directory. 6. look into log for those files e.g - file having gfid - bb3a710b-dc3d-4bd6-a25b-6186a38c4bc - is present on master volume but not on slave. change log entry is present on master:- [root@4DVM5 .processed]# grep 'bb3a710b-dc3d-4bd6-a25b-6186a38c4bce' * CHANGELOG.1377844357:E bb3a710b-dc3d-4bd6-a25b-6186a38c4bce SYMLINK caf14ecd-68c5-4dad-a687-d465030eb03d%2Fservertool CHANGELOG.1377844357:M bb3a710b-dc3d-4bd6-a25b-6186a38c4bce [root@4DVM5 .processed]# pwd /var/run/gluster/change_add_master/ssh%3A%2F%2Froot%4010.70.37.1%3Agluster%3A%2F%2F127.0.0.1%3Achange_add__slave/00ce598912748d08d8518cb05bdcb67b/.processed in slave log:- [root@rhsauto026 geo-replication-slaves]# pwd /var/log/glusterfs/geo-replication-slaves [root@rhsauto026 geo-replication-slaves]# grep -B2 'bb3a710b-dc3d-4bd6-a25b-6186a38c4bce' 48a93238-d126-449a-a11b-98f9e1fffb99:gluster%3A%2F%2F127.0.0.1%3Achange_add__slave.gluster.log [2013-08-30 11:47:01.927129] I [dht-layout.c:633:dht_layout_normalize] 0-change_add__slave-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=0 misc=0 [2013-08-30 11:47:01.927385] W [dht-layout.c:179:dht_layout_search] 0-change_add__slave-dht: no subvolume for hash (value) = 899727189 [2013-08-30 11:47:01.927424] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:caf14ecd-68c5-4dad-a687-d465030eb03d>/servertool with gfid (bb3a710b-dc3d-4bd6-a25b-6186a38c4bce): No such file or directory
https://code.engineering.redhat.com/gerrit/#/c/12111
I was able to hit it again in the build glusterfs-3.4.0.31rhs-1, in a fanout setup , and while syncing symlinks in first xsync crawl. steps I followed 1. create and start a geo-rep fanout setup from a master to 4 slaves. 2. stop all geo-rep sessions and create files on master. 3. start all the geo-rep sessions and let files to sync to all the slaves. 4. Now stop all geo-rep sessions, and create symlinks to all those files. 5. Start the geo-rep session. 6. First xsync crawl fails to sync around 16 to 20 files out of 2000 files, to all the slaves. on master missing file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [root@Ray ~]# getfattr -h -n glusterfs.gfid.string /mnt/master/level05/level15/symlink_to_files/5229cf3e~~4CA6FC77NJ getfattr: Removing leading '/' from absolute path names # file: mnt/master/level05/level15/symlink_to_files/5229cf3e~~4CA6FC77NJ glusterfs.gfid.string="0854f9ef-aa93-462a-aa02-10e9d510f34d" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On master, the missing file had entries in the XSYNC-CHANGELOG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [root@Ray bd42ad17ef8864d51407b1c6478f5dc6]# grep 0854f9ef-aa93-462a-aa02-10e9d510f34d xsync/* xsync/XSYNC-CHANGELOG.1378471766:E 0854f9ef-aa93-462a-aa02-10e9d510f34d SYMLINK fddd8985-0185-4655-a6fd-565b7c5a54ea%2F5229cf3e%7E%7E4CA6FC77NJ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the slave for same GFID, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-09-06 12:49:35.192857] W [dht-layout.c:179:dht_layout_search] 0-imaster1-dht: no subvolume for hash (value) = 3927056761 [2013-09-06 12:49:35.192898] I [fuse-bridge.c:3515:fuse_auxgfid_newentry_cbk] 0-fuse-aux-gfid-mount: failed to create the entry <gfid:fddd8985-0185-4655-a6fd-565b7c5a54ea>/5229cf3e~~4CA6FC77NJ with gfid (0854f9ef-aa93-462a-aa02-10e9d510f34d): No such file or directory [2013-09-06 12:49:35.192930] W [fuse-bridge.c:1627:fuse_err_cbk] 0-glusterfs-fuse: 726: SYMLINK() <gfid:fddd8985-0185-4655-a6fd-565b7c5a54ea>/5229cf3e~~4CA6FC77NJ => -1 (No such file or directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Since logs are similar to the issue, I am moving it to assigned state, though this happened to me while syncing symlinks in first xsync.
VijayKoppad, Can you please reopen another bug for the new issue. The issue we tracked here was for the steps given in description: ---- Steps to Reproduce: 1.create and start a geo-rep relationship between master(dist-rep) and slave(dist-rep) 2.create regular file on the master using command "./crefi.py -n 10 --multi -b 10 -d 10 --random --max=500K --min=10 /mnt/master/" 3. check if it completely syncs. Actual results: geo-rep fails to sync some files to slave ---- In the latest verification steps: ---- steps I followed 1. create and start a geo-rep fanout setup from a master to 4 slaves. 2. stop all geo-rep sessions and create files on master. 3. start all the geo-rep sessions and let files to sync to all the slaves. 4. Now stop all geo-rep sessions, and create symlinks to all those files. 5. Start the geo-rep session. 6. First xsync crawl fails to sync around 16 to 20 files out of 2000 files, to all the slaves. ---- Considering its a 'fanout' setup, would be good to open another bug, and close this one.
Since this issue description is different compared the scenario where I am hitting the similar issue, I am moving this bug to the verified. The issue with symlinks, is being hit only in cascaded and fanout setup. But the logs are similar. The issue with symlink is being tracked in Bug 1003580.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html