1020352 – Dist-geo-rep : geo-rep failed to sync few files to level 2 slave in cascaded-fanout setup.

Bug 1020352 - Dist-geo-rep : geo-rep failed to sync few files to level 2 slave in cascaded-fanout setup.

Summary: Dist-geo-rep : geo-rep failed to sync few files to level 2 slave in cascaded-...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Venky Shankar
QA Contact:	Vijaykumar Koppad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-17 13:45 UTC by Vijaykumar Koppad
Modified:	2015-05-13 16:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0.55rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-02-25 07:54:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Vijaykumar Koppad 2013-10-17 13:45:43 UTC

Description of problem:  geo-rep failed to sync few files to level 2 slave in cascaded-fanout setup. The file which were missing had the entry in processed changelog in geo-rep working-dir. These files were failed ti sync by rsync with [errcode: 23], 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
.[2013-10-17 16:59:53.159749] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/d033a27b-820a-4720-968
6-89d5f79e1d0e [errcode: 23]
[2013-10-17 16:59:53.161875] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/6b75e1fd-962a-4214-95c
3-b88ca733c706 [errcode: 23]
[2013-10-17 16:59:53.164371] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/d78e87ab-7624-430e-88b
a-262eff1ff18f [errcode: 23]
[2013-10-17 16:59:53.166728] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/69942e83-9874-4698-a11
3-a4317253410b [errcode: 23]
[2013-10-17 16:59:53.168988] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/cda11caa-a083-4a6a-bab1-34c05e9aca9c [errcode: 23]
[2013-10-17 16:59:53.171253] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/d4389302-cfe0-40fc-a178-27af3031081c [errcode: 23]
[2013-10-17 16:59:53.173624] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/6c5a603c-7f12-419c-8697-568a88258f90 [errcode: 23]
[2013-10-17 16:59:53.175545] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/d44790d6-6e6e-4381-b2e5-3f1407546350 [errcode: 23]
[2013-10-17 16:59:53.177774] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/762c9e92-b585-46ea-9f12-734d4ee01828 [errcode: 23]
[2013-10-17 16:59:53.180067] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/cbfcc035-fd28-42c0-95a4-e019bc4f7d80 [errcode: 23]

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
These file were tried for 10 time then exited without syncing. 


Version-Release number of selected component (if applicable):3.4.0.35rhs-1.el6rhs.x86_64


How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1.create and start a geo-rep cascaded-fanout  setup ( 1-4-4)( totally 21 volumes involved , 1 master , 4imasters , and 16slaves)   
2.stop all the geo-rep sessions . 
3.create data on the master. After the completion of the data creation , start geo-rep between master and imasters. let them sync. 
4. After the completion of sync, start geo-rep sessions between imasters and slaves and let files sync slaves.  
5: Check all the files synced to all the slave

Actual results: One of the slave at level2 had 7 files missing 


Expected results: It should sync all the files. 


Additional info:

Comment 2 Vijaykumar Koppad 2013-10-18 07:00:40 UTC

I had filed few bugs which might be related to the same issue, but scenarios are  different. Few bugs I had filed for bigbend like Bug 1003580, which is on QA for the version glusterfs-3.4.0.34rhs seems to be the same problem, but I have hit this bug in the version glusterfs-3.4.0.35rhs. Since the scenario was different, I filed another bug.

Based on my investigation, the cause of the files getting missed like this one is rsync failures with [errcode: 23]. 

This is how it works as far as I know,

1. In geo-replication, when we get list of files to be synced to slave, we create entries on the slave with the same GFID through RPC. During this creation, we ignore few errors, for various reasons.

2. If we ignore the errors which are actually genuine errors, we end up not having that entry on the slave side. As far as I have seen, there would be no logs related to these errors or warning in any logs. 

2. After the creation of the entry, we give the list to rsync process to create data on the slave. Rsync creates data in the assumption that the entry has already been created. If the entry is not there, it just errors out, and we get errors like, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-17 16:59:53.164371] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/d78e87ab-7624-430e-88b
a-262eff1ff18f [errcode: 23]
[2013-10-17 16:59:53.166728] W [master(/bricks/brick3):621:regjob] <top>: Rsync: .gfid/69942e83-9874-4698-a11
3-a4317253410b [errcode: 23]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

These errors could be because of the same reason or could be different. I have no proper evidence to prove that.

3. In geo-rep, there is a mechanism of retrying the failed efforts to sync. If some files in a changelog are failed to sync, it will retry that changelog again. Earlier these retries would go on until they are synced or would go on forever. Later it was changed to 10 tries, and after that, that changelog will be skipped with logs like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-10-18 11:26:10.283335] W [master(/bricks/brick1):750:process] _GMaster: changelog CHANGELOG.1382075354 could not be processed - moving on..
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
 We can say those files which are skipped will never get synced, unless we will have data operation on those files and those files are synced to slave without any problem.


   Consider this issue hits for a set of 1000 files in cascaded-fanout setup as explained in the steps to reproduce. Out of those 1000 files, consider 7 files didn't sync to one of the 16 level-2 slaves. This might look like a very rare case, but the implication of this is disastrous. After this is hit, if I delete all the files on the master, they will be removed from all the slaves properly, but if i create same set of 1000 files, all those files won't get synced to many slaves with same errors of RSYNC 

I had tried the similar setup in 3.4.0.33rhs-1.el6rhs.x86_64, I didn't see this issue with regular files. I had seen issue only with symlinks, and that too with first xsync crawl only , and the bug for that is Bug 1003580. That is why I am moving it to regression.

Comment 4 Vivek Agarwal 2013-12-26 13:44:15 UTC

setting corbett and other flags as this is being tracked for corbett.

Comment 5 Venky Shankar 2013-12-30 09:18:03 UTC

(In reply to Vijaykumar Koppad from comment #2)

Vijaykumar,

Geo-replication syncing (and error handling) have gone through tremendous amount of changes in the last month. I'll try to answer your concerns inline.

[snip]

> This is how it works as far as I know,
> 
> 1. In geo-replication, when we get list of files to be synced to slave, we
> create entries on the slave with the same GFID through RPC. During this
> creation, we ignore few errors, for various reasons.
> 
> 2. If we ignore the errors which are actually genuine errors, we end up not
> having that entry on the slave side. As far as I have seen, there would be
> no logs related to these errors or warning in any logs. 

We DO NOT ignore genuine errors. That would be disastrous. And yes, as I've mentioned in other bugs that error logs would have entries for the missing GFIDs along with other error messages. For reference please see here: https://bugzilla.redhat.com/show_bug.cgi?id=1000462#c16. These log entries would be very helpful in debugging. It a setup as large as this, it's good to capture these additional log entries as soon as possible and put it in the BZ.

> 
> 2. After the creation of the entry, we give the list to rsync process to
> create data on the slave. Rsync creates data in the assumption that the
> entry has already been created. If the entry is not there, it just errors
> out, and we get errors like, 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> [2013-10-17 16:59:53.164371] W [master(/bricks/brick3):621:regjob] <top>:
> Rsync: .gfid/d78e87ab-7624-430e-88b
> a-262eff1ff18f [errcode: 23]
> [2013-10-17 16:59:53.166728] W [master(/bricks/brick3):621:regjob] <top>:
> Rsync: .gfid/69942e83-9874-4698-a11
> 3-a4317253410b [errcode: 23]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> 
> These errors could be because of the same reason or could be different. I
> have no proper evidence to prove that.
> 
> 3. In geo-rep, there is a mechanism of retrying the failed efforts to sync.
> If some files in a changelog are failed to sync, it will retry that
> changelog again. Earlier these retries would go on until they are synced or
> would go on forever. Later it was changed to 10 tries, and after that, that
> changelog will be skipped with logs like 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> [2013-10-18 11:26:10.283335] W [master(/bricks/brick1):750:process]
> _GMaster: changelog CHANGELOG.1382075354 could not be processed - moving on..
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>  We can say those files which are skipped will never get synced, unless we
> will have data operation on those files and those files are synced to slave
> without any problem.

This is the part which has changed now. You should see the number of skipped entries in the "SKIPPED COUNT" in status detail (I agree, status detail is a bit screwed up , but this is the only accurate this in status detail as of now).

> 
> 
>    Consider this issue hits for a set of 1000 files in cascaded-fanout setup
> as explained in the steps to reproduce. Out of those 1000 files, consider 7
> files didn't sync to one of the 16 level-2 slaves. This might look like a
> very rare case, but the implication of this is disastrous. After this is
> hit, if I delete all the files on the master, they will be removed from all
> the slaves properly, but if i create same set of 1000 files, all those files
> won't get synced to many slaves with same errors of RSYNC 
> 
> I had tried the similar setup in 3.4.0.33rhs-1.el6rhs.x86_64, I didn't see
> this issue with regular files. I had seen issue only with symlinks, and that
> too with first xsync crawl only , and the bug for that is Bug 1003580. That
> is why I am moving it to regression.

It's good to reproduce this once more.

Comment 6 Vijaykumar Koppad 2014-01-09 06:24:45 UTC

I tried it in the build glusterfs-3.4.0.55rhs-1. I was unable to reproduce the issue.

Comment 7 Vijaykumar Koppad 2014-01-09 12:52:28 UTC

Considering I was unable to reproduce this issue in the the build glusterfs-3.4.0.55rhs-1, moving it to verified. If it happens again, will reopen it.

Comment 9 errata-xmlrpc 2014-02-25 07:54:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.