Bug 1499520 - Geo-rep: Lot of changelog(rsync) retries and failure during directory syncing. [NEEDINFO]
Summary: Geo-rep: Lot of changelog(rsync) retries and failure during directory syncing.
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: glusterfs-3.13.0 1565577 1571069 1576767 1580215 1600671
TreeView+ depends on / blocked
 
Reported: 2017-10-08 11:23 UTC by Kotresh HR
Modified: 2020-03-12 12:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1600671 (view as bug list)
Environment:
Last Closed: 2020-03-12 12:37:31 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
khiremat: needinfo? (moagrawa)


Attachments (Terms of Use)

Description Kotresh HR 2017-10-08 11:23:09 UTC
Description of problem:
Directory syncing is causing changelog retries because of rsync errors on directory gfids.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Setup geo-replication
2. Create directory on master
3. Let geo-rep sync to slave

Actual results:
Directory syncing is causing retries. Directory is synced but lot of retries. 
I think rsync on directory gfid is failing. (rsync /mastermnt/.gfid/<dirgfid> /slavemnt/)

Expected results:
Directory syncing should not involve any retries. It would cause performance degradation

Analysis:
I found out that the issue is introduced with the below patch.

commit 9b4de61a136b8e5ba7bf0e48690cdb1292d0dee8
cluster/dht : User xattrs are not healed after brick stop/start


Additional info:

Geo-replication log:

[2017-10-08 11:12:31.657124] I [master(/bricks/brick0/b0):1382:crawl] _GMaster: slave's time    stime=(1507460923, 0)
[2017-10-08 11:12:31.867773] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1440 num_files=1     job=1   return_code=23
[2017-10-08 11:12:31.871272] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:32.875348] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1471 num_files=1     job=3   return_code=23
[2017-10-08 11:12:32.880616] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:33.907165] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1759 num_files=1     job=2   return_code=23
[2017-10-08 11:12:33.908428] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:35.39695] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken  duration=0.1532 num_files=1     job=1   return_code=23
[2017-10-08 11:12:35.42959] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:36.42484] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken  duration=0.1516 num_files=1     job=3   return_code=23
[2017-10-08 11:12:36.47372] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:37.95174] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken  duration=0.1764 num_files=1     job=2   return_code=23
[2017-10-08 11:12:37.98913] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:38.209673] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1562 num_files=1     job=1   return_code=23
[2017-10-08 11:12:38.214554] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:39.236735] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1782 num_files=1     job=3   return_code=23
[2017-10-08 11:12:39.241420] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:40.267714] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1557 num_files=1     job=2   return_code=23
[2017-10-08 11:12:40.272130] W [master(/bricks/brick0/b0):1267:process] _GMaster: incomplete sync, retrying changelogs  files=['CHANGELOG.1507461149']
[2017-10-08 11:12:41.384635] I [master(/bricks/brick0/b0):1860:syncjob] Syncer: Sync Time Taken duration=0.1578 num_files=1     job=1   return_code=23
[2017-10-08 11:12:41.389205] E [master(/bricks/brick0/b0):1246:process] _GMaster: changelogs could not be processed completely - moving on...   files=['CHANGELOG.1507461149']
[2017-10-08 11:12:41.408358] I [master(/bricks/brick0/b0):1295:process] _GMaster: Entry Time Taken      MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
[2017-10-08 11:12:41.410573] I [master(/bricks/brick0/b0):1305:process] _GMaster: Data/Metadata Time Taken      SETA=0  SETX=0  meta_duration=0.0000    data_duration=1507461161.4104   DATA=0  XATT=0
[2017-10-08 11:12:41.412946] I [master(/bricks/brick0/b0):1315:process] _GMaster: Batch Completed       changelog_end=1507461149        entry_stime=(1507461148, 0)     changelog_start=1507461149      stime=(1507461148, 0)   duration=9.7488 num_changelogs=1        mode=live_changelog
/var/log/glusterfs/geo-replication/master/ssh\%3A\%2F\%2Froot\%40192.168.122.13\%3Agluster\%3A\%2F\%2F127.0.0.1\%3Aslave.log (END)


Slave mount logs:

[2017-10-08 11:12:31.680512] E [MSGID: 101046] [dht-common.c:702:dht_discover_complete] 0-slave-dht: dict is null
[2017-10-08 11:12:31.687599] I [dict.c:471:dict_get] (-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28c31) [0x7f4af9601c31] -->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28448) [0x7f4af9601448] -->/usr/local/lib/libglusterfs.so.0(dict_get+0x79) [0x7f4b0765662d] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]
/var/log/glusterfs/geo-replication-slaves/3ab6f770-d2a9-440f-ba28-83bcf1a40eb6:fedora3.\%2Fbricks\%2Fbrick0\%2Fb0.slave.gluster.log (END)

Comment 1 Raghavendra G 2017-10-09 05:38:41 UTC
> [2017-10-08 11:12:31.687599] I [dict.c:471:dict_get] (-->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28c31) [0x7f4af9601c31] -->/usr/local/lib/glusterfs/4.0dev/xlator/cluster/distribute.so(+0x28448) [0x7f4af9601448] -->/usr/local/lib/libglusterfs.so.0(dict_get+0x79) [0x7f4b0765662d] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]

I think we are not creating "trusted.glusterfs.dht.mds" in discover codepath. Note that dht in geo-rep slave mounts receives only nameless lookup. We do create mds xattr in nameless lookup codepath, but only if directory is missing on one of the subvolumes. I guess in this case directory is present on all subvols, but only mds xattr is missing.

@Kotresh,

The above hypothesis is only true for those directories that were created before the patch in question is merged. If a directory is created _after_ upgrading, you shouldn't see these errors. Can you please check whether this issue is seen for directories that were created after upgrade is done?

Comment 2 Kotresh HR 2017-10-10 12:56:36 UTC
(In reply to Raghavendra G from comment #1)

> @Kotresh,
> 
> The above hypothesis is only true for those directories that were created
> before the patch in question is merged. If a directory is created _after_
> upgrading, you shouldn't see these errors. Can you please check whether this
> issue is seen for directories that were created after upgrade is done?

Retries are happening for the new directory creations as well.

Comment 3 Mohit Agrawal 2017-10-11 02:21:41 UTC
As per slave logs in #c1 it seems xattr is NULL on local frame,I think because xattr is NULL so retries are happening.

Comment 4 Kotresh HR 2017-11-02 11:32:37 UTC
(In reply to Mohit Agrawal from comment #3)
> As per slave logs in #c1 it seems xattr is NULL on local frame,I think
> because xattr is NULL so retries are happening.

The retries are happening in normal use case for every directory being synced. It needs to be analysed and should be fixed. Could you take a look at this further?

-Kotresh HR

Comment 5 Kotresh HR 2017-11-21 05:21:40 UTC
Hi Mohit,

Even with the patch https://review.gluster.org/#/c/18436/, the issue is reproducible. Please take a look at it. I have made this a blocker for 3.13.0 release.

Thanks,
Kotresh HR

Comment 8 Worker Ant 2020-03-12 12:37:31 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/914, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.