Bug 1111171

Summary:	Dist-geo-rep : geo-rep xsync crawl takes too much time to sync meta data changes.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijaykumar Koppad <vkoppad>
Component:	geo-replication	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	Bhaskar Bandari <bbandari>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.0	CC:	aavati, avishwan, bbandari, bturner, csaba, david.macdonald, nlevinki, nsathyan, sharne, smanjara, ssamanta, vshankar
Target Milestone:	---
Target Release:	RHGS 3.0.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.6.0.25-1	Doc Type:	Known Issue
Doc Text:	The Geo-replication takes more time to synchronize initial data while using Hybrid crawl.	Story Points:	---
Clone Of:
Clones:	1111490 (view as bug list)		Environment:
Last Closed:	2014-09-22 19:42:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1103155, 1111490

Description Vijaykumar Koppad 2014-06-19 11:55:17 UTC

Description of problem:  geo-rep xsync crawl takes too much time to sync meta data changes.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# gluster v geo master 10.70.43.111::slave status

MASTER NODE                 MASTER VOL    MASTER BRICK                 SLAVE                  STATUS     CHECKPOINT STATUS                                                           CRAWL STATUS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
targarean.blr.redhat.com    master        /bricks/brick1/master_b1     10.70.43.131::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:54:43    Hybrid Crawl
targarean.blr.redhat.com    master        /bricks/brick2/master_b5     10.70.43.131::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:54:43    Hybrid Crawl
targarean.blr.redhat.com    master        /bricks/brick3/master_b9     10.70.43.131::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:54:43    Hybrid Crawl
stark.blr.redhat.com        master        /bricks/brick1/master_b3     10.70.42.236::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:53:59    Hybrid Crawl
stark.blr.redhat.com        master        /bricks/brick2/master_b7     10.70.42.236::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:53:59    Hybrid Crawl
stark.blr.redhat.com        master        /bricks/brick3/master_b11    10.70.42.236::slave    Active     checkpoint as of 2014-06-19 12:26:45 is completed at 2014-06-19 14:53:59    Hybrid Crawl
spiderman.blr.redhat.com    master        /bricks/brick1/master_b4     10.70.43.165::slave    Passive    N/A                                                                         N/A
spiderman.blr.redhat.com    master        /bricks/brick2/master_b8     10.70.43.165::slave    Passive    N/A                                                                         N/A
spiderman.blr.redhat.com    master        /bricks/brick3/master_b12    10.70.43.165::slave    Passive    N/A                                                                         N/A
shaktiman.blr.redhat.com    master        /bricks/brick1/master_b2     10.70.43.111::slave    Passive    N/A                                                                         N/A
shaktiman.blr.redhat.com    master        /bricks/brick2/master_b6     10.70.43.111::slave    Passive    N/A                                                                         N/A
shaktiman.blr.redhat.com    master        /bricks/brick3/master_b10    10.70.43.111::slave    Passive    N/A                                                                         N/A
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
You can calculate time it took to sync all the meta data change through  checkpoint output in the status output. It took almost 2:30 hours to sync meta data changes to 5K files. spread over 1000 dirs. 

Version-Release number of selected component (if applicable):  glusterfs-3.6.0.18-1.el6rhs


How reproducible: Happens most of the time. 


Steps to Reproduce:
1. create and start a geo-rep relationship between master and slave. 
2. create data using the command "crefi -T 10 -n 5 --multi -d 10 -b 10 --random --max=10K --min=1K /mnt/master"
2. change the config option change_detector to xsync
3. then change ownership of all the files using the command "crefi -T 10 -n 5 --multi -d 10 -b 10 --random --max=10K --min=1K --fop=chown /mnt/master"


Actual results: it takes too much time to sync meta-data changes to files. 


Expected results: It shouldn't take 2:30 hrs to sync meta-data changes to just 5K files. It should be within minutes. 


Additional info:

Comment 3 Aravinda VK 2014-06-20 10:29:43 UTC

Upstream patch sent for review: http://review.gluster.org/#/c/8124/

Comment 5 Venky Shankar 2014-06-26 07:02:17 UTC

Vijaykumar,

Do you have a setup where I can profile the system? Essentially, I would like to strace and check which calls are taking up most time.

As of now patch http://review.gluster.org/#/c/8124/ cuts down extra stat() [on mater] and redundant ch[mod,own]() from the slave. But I would like to check anyhow on other calls taking time.

Comment 6 Shalaka 2014-06-29 07:37:31 UTC

Please review and signoff edited doc text.

Comment 7 shilpa 2014-06-30 07:21:29 UTC

Hit this issue while testing xsync crawl for symlinks.. It took too long to sync the changes. Tested on glusterfs-3.6.0.22

Comment 8 Venky Shankar 2014-07-01 11:02:08 UTC

(In reply to shilpa from comment #7)
> Hit this issue while testing xsync crawl for symlinks.. It took too long to
> sync the changes. Tested on glusterfs-3.6.0.22

Hey Shilpa,

Do you have the setup where I can look into. I am eager to thrash this out.

Comment 10 shilpa 2014-07-03 06:24:06 UTC

Venky,

I have provided the hardware details in the email.

Comment 11 Venky Shankar 2014-07-03 08:34:59 UTC

(In reply to shilpa from comment #10)
> Venky,
> 
> I have provided the hardware details in the email.

Thanks. I am looking into it.

Comment 12 Venky Shankar 2014-07-03 09:58:05 UTC

Update:

Followed steps as mentioned in Comment #1 (FUSE client) and did not experience the huge delay for metadata synchronization.

I captured syscall invocations/timing on the bricks and they do not seem to reflect delays on setattr() calls. Also, gsyncd on master nodes are able to sync metadata without delays as mentioned in Comment #1.

Comment 14 Venky Shankar 2014-07-07 06:42:20 UTC

The slowness seems to be not due to syncing metadata (i.e. chown() or chmod()) but  due to gsyncd getting EINVAL during creation of a file which already exist. Xsync based crawl scans the filesystem based on xtime and generates a changelog consisting of the list of changes to be replicated. During crawl, if a directory or a file satisfies:

xtime(master) > xtime(slave)

... and entry is made to the changelog. Since the changes identified is solely based only by crawling the master volume, entry for files/directories are made in changelog irrespective of their existence on the slave (else we'd need to query the slave, which is an expensive operation). Now, when the actual replication is done, the changelog is replayed -- files/directories which already exist on slave should return EEXIST (entry already exist) on a creation which is gracefully handled by geo-replication. But what we get on such an operation is EINVAL (Invalid argument) as shown below from the logs:

[2014-07-04 13:20:41.930759] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-9: remote operation failed: File exists. Path: <gfid:7bc16d1b-e63f-4adf-bde9-ed9f4ebf220a>/level99
[2014-07-04 13:20:41.930934] W [client-rpc-fops.c:306:client3_3_mkdir_cbk] 0-slave-client-8: remote operation failed: File exists. Path: <gfid:7bc16d1b-e63f-4adf-bde9-ed9f4ebf220a>/level99
[2014-07-04 13:20:41.930986] W [fuse-bridge.c:1237:fuse_err_cbk] 0-glusterfs-fuse: 89160: SETXATTR() /.gfid/7bc16d1b-e63f-4adf-bde9-ed9f4ebf220a => -1 (Invalid argument)

[NOTE: geo-replication uses special setxattr() based interface to create entries on the slave]

EINVAL is given separate handling in geo-replication to retry the operation for a number of times. This was historically due to entries getting missed during creation because of an unknown race (at that time, now known) hoping that retrying would eventually fix it. After hitting retry limit, geo-replication moves ahead with the next operation (after logging the failed op as shown below), which undergoes the same problem again.

[2014-07-04 18:44:47.376432] W [syncdutils(slave):480:errno_wrap] <top>: reached maximum retries (['.gfid/7bc16d1b-e63f-4adf-bde9-ed9f4ebf220a', 'glusterfs.gfid.newfile', '\x00\x00%\x8a\x00\x00%\x8a20b99f65-149e-4ff7-a043-1afa881357da\x00\x00\x00\x81\xa453b68684%%BKPGIUFLRB\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x00\x00'])...


SIDE NOTE: setxattr() does not have EINVAL as a valid errno (EEXIST though has another meaning)

Comment 15 Venky Shankar 2014-07-07 13:51:19 UTC

looks like EINVAL was generated in cases where the we get an EEXIST during entry creation and setattr() [to set uid/gid] using an inode on which a lookup has not happened [which is when DHT error's out with layout missing logs].

gfid-access translator uses a fresh inode for the entry to be created if it's not found in the inode table. On failure cases (errno EEXIST, etc.) DHT does not populate the inode context with the layout information. gfid-access incorrectly uses this inode blindly for setattr() after entry creation thereby resulting in EINVAL trickled up to the client application (generated by dht).

Fix:
Handle error cases in gfid-access entry creation (callback path) and return correct errno (EEXIST, etc.).

Comment 16 Aravinda VK 2014-07-14 05:53:08 UTC

Sent patch to downstream: https://code.engineering.redhat.com/gerrit/#/c/29095/
Upstream patch: http://review.gluster.org/#/c/8124/

Comment 17 Aravinda VK 2014-07-16 05:25:16 UTC

One more patch from Venky https://code.engineering.redhat.com/gerrit/#/c/29133/
Upstream patch: http://review.gluster.org/#/c/8260/

Comment 20 Vijaykumar Koppad 2014-07-23 11:15:14 UTC

verified on the build glusterfs-3.6.0.25-1

Comment 24 errata-xmlrpc 2014-09-22 19:42:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html