Description of problem: Directory syncing is causing changelog retries because of rsync errors on directory gfids. Version-Release number of selected component (if applicable): mainline How reproducible: Always Steps to Reproduce: 1. Setup geo-replication 2. Create directory on master 3. Let geo-rep sync to slave Actual results: Directory syncing is causing retries. Directory is synced but lot of retries. I think rsync on directory gfid is failing. (rsync /mastermnt/.gfid/<dirgfid> /slavemnt/) Expected results: Directory syncing should not involve any retries. It would cause performance degradation Analysis: I found out that the issue is introduced with the below patch. commit 9b4de61a136b8e5ba7bf0e48690cdb1292d0dee8 cluster/dht : User xattrs are not healed after brick stop/start Additional info: Geo-replication log: [2017-10-08 11:12:31.657124] I [master(/bricks/brick0/b0):1382:crawl] _GMaster: slave's time stime=(1507460923, 0) [2017-10-08 11:12:31.867773] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1440 num_files=1 job=1 return_code=23 [2017-10-08 11:12:31.871272] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:32.875348] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1471 num_files=1 job=3 return_code=23 [2017-10-08 11:12:32.880616] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:33.907165] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1759 num_files=1 job=2 return_code=23 [2017-10-08 11:12:33.908428] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:35.39695] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1532 num_files=1 job=1 return_code=23 [2017-10-08 11:12:35.42959] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:36.42484] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1516 num_files=1 job=3 return_code=23 [2017-10-08 11:12:36.47372] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:37.95174] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1764 num_files=1 job=2 return_code=23 [2017-10-08 11:12:37.98913] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:38.209673] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1562 num_files=1 job=1 return_code=23 [2017-10-08 11:12:38.214554] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:39.236735] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1782 num_files=1 job=3 return_code=23 [2017-10-08 11:12:39.241420] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:40.267714] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1557 num_files=1 job=2 return_code=23 [2017-10-08 11:12:40.272130] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs files=['CHANGELOG.1507461149'] [2017-10-08 11:12:41.384635] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1578 num_files=1 job=1 return_code=23 [2017-10-08 11:12:41.389205] E [master(/bricks/brick0/b0):1246:process] _GMaster: changelogs could not be processed completely - moving on... files=['CHANGELOG.1507461149'] [2017-10-08 11:12:41.408358] I [master(/bricks/brick0/b0):1295:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0 CRE=0 duration=0.0000 UNL=0 [2017-10-08 11:12:41.410573] I [master(/bricks/brick0/b0):1305:process] _GMaster: Data/Metadata Time Taken SETA=0 SETX=0 meta_duration=0.0000 data_duration=1507461161.4104 DATA=0 XATT=0 [2017-10-08 11:12:41.412946] I [master(/bricks/brick0/b0):1315:process] _GMaster: Batch Completed changelog_end=1507461149 entry_stime=(1507461148, 0) changelog_start=1507461149 stime=(1507461148, 0) duration=9.7488 num_changelogs=1 mode=live_changelog /var/log/glusterfs/geo-replication/master/ssh\%3A\%2F\%2Froot\%40192.168.122.13\%3Agluster\%3A\%2F\%2F127.0.0.1\%3Aslave.log (END) Slave mount logs: [2017-10-08 11:12:31.680512] E [MSGID: 101046] [dht-common.c:702:dht_discover_complete] 0-slave-dht: dict is null [2017-10-08 11:12:31.687599] I [dict.c:471:dict_get] (-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28c31) [0x7f4af9601c31] -->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28448) [0x7f4af9601448] -->/usr/local/lib/libglusterfs.so.0(dict_get+0x79) [0x7f4b0765662d] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument] /var/log/glusterfs/geo-replication-slaves/3ab6f770-d2a9-440f-ba28-83bcf1a40eb6:fedora3.\%2Fbricks\%2Fbrick0\%2Fb0.slave.gluster.log (END)
> [2017-10-08 11:12:31.687599] I [dict.c:471:dict_get] (-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28c31) [0x7f4af9601c31] -->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28448) [0x7f4af9601448] -->/usr/local/lib/libglusterfs.so.0(dict_get+0x79) [0x7f4b0765662d] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument] I think we are not creating "trusted.glusterfs.dht.mds" in discover codepath. Note that dht in geo-rep slave mounts receives only nameless lookup. We do create mds xattr in nameless lookup codepath, but only if directory is missing on one of the subvolumes. I guess in this case directory is present on all subvols, but only mds xattr is missing. @Kotresh, The above hypothesis is only true for those directories that were created before the patch in question is merged. If a directory is created _after_ upgrading, you shouldn't see these errors. Can you please check whether this issue is seen for directories that were created after upgrade is done?
(In reply to Raghavendra G from comment #1) > @Kotresh, > > The above hypothesis is only true for those directories that were created > before the patch in question is merged. If a directory is created _after_ > upgrading, you shouldn't see these errors. Can you please check whether this > issue is seen for directories that were created after upgrade is done? Retries are happening for the new directory creations as well.
As per slave logs in #c1 it seems xattr is NULL on local frame,I think because xattr is NULL so retries are happening.
(In reply to Mohit Agrawal from comment #3) > As per slave logs in #c1 it seems xattr is NULL on local frame,I think > because xattr is NULL so retries are happening. The retries are happening in normal use case for every directory being synced. It needs to be analysed and should be fixed. Could you take a look at this further? -Kotresh HR
Hi Mohit, Even with the patch https://review.gluster.org/#/c/18436/, the issue is reproducible. Please take a look at it. I have made this a blocker for 3.13.0 release. Thanks, Kotresh HR
This bug is moved to https://github.com/gluster/glusterfs/issues/914, and will be tracked there from now on. Visit GitHub issues URL for further details