867345 – geo-rep failed to sync large file of order GB through ssh session.

Bug 867345 - geo-rep failed to sync large file of order GB through ssh session.

Summary: geo-rep failed to sync large file of order GB through ssh session.

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Vijaykumar Koppad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-17 11:01 UTC by Vijaykumar Koppad
Modified:	2014-08-25 00:49 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-01-10 07:06:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2012-10-17 11:01:05 UTC

Description of problem: If the file is of the order 1GB, it fails to sync on the slave which is geo-rep session through ssh. During this time there were 3 other geo-rep session on the same volume of all the type of geo-rep possible like through ssh to a volume , through ssh to a file , through gluster to a volume, and to a local file 

These are the DEBUG logs .. 

[2012-10-17 08:31:09.21575] D [repce:190:__call__] RepceClient: call 5734:139746862941952:1350442869.02 keep_alive -> 66
[2012-10-17 08:31:09.977252] W [master:786:regjob] _GMaster: failed to sync ./file_1G
[2012-10-17 08:31:16.274018] W [master:786:regjob] _GMaster: failed to sync ./file_1G
[2012-10-17 08:31:16.274231] D [master:660:crawl] _GMaster: ... crawl #979 done, took 14.937212 seconds
[2012-10-17 08:31:17.277674] D [master:615:volinfo_state_machine] <top>: (None, 426253f7) << (None, 426253f7) -> (None, 426253f7)
[2012-10-17 08:31:17.277858] D [master:696:crawl] _GMaster: entering .
[2012-10-17 08:31:17.278656] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.28 xtime('.', '426253f7-b423-4a69-91c7-53e736e17d00') ...
[2012-10-17 08:31:17.280658] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.28 xtime -> (1350442787, 818281)
[2012-10-17 08:31:17.283113] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.28 entries('.',) ...
[2012-10-17 08:31:17.286344] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.28 entries -> ['.file_1G.b65DJ1']
[2012-10-17 08:31:17.286490] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.29 purge('.', set(['.file_1G.b65DJ1'])) ...
[2012-10-17 08:31:17.288861] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.29 purge -> None
[2012-10-17 08:31:17.290184] D [master:778:crawl] _GMaster: syncing ./file_1G ...
[2012-10-17 08:31:17.355895] D [resource:526:rsync] SSH: files: ./file_1G
[2012-10-17 08:31:25.858294] W [master:786:regjob] _GMaster: failed to sync ./file_1G


Version-Release number of selected component (if applicable):RHS-2.0.z u3 


How reproducible:Consistently 


Steps to Reproduce:
1.Start a geo-rep session between master(dist-replicate) and slave(dist-rep) through ssh   
2.Create all other type of geo-rep session mentioned above to the same volume.
3.Create a 1GB sparse file.
4.Check the file in slave mount point. 
5. It fails to sync the data , even though all the other slaves got file synced.
6. If you check the log_file , you might get the similar log in geo-rep logs. 

Actual results: Large file failed to sync 


Expected results: File should sync. 


Additional info:

Comment 1 Vijaykumar Koppad 2012-10-17 11:10:24 UTC

Just to remember the state when it happened. This is detailed setup. 

there is a master and a slave machine. 
one volume in master machine called master(dist-rep)
two volumes in slave machine called slave (dist-rep) and slave_gfs(dist-stripe)

1.MASTER               SLAVE                                      STATUS    
--------------------------------------------------------------------------------
master               file:///root/slave_local                       OK        
master               ssh://<slave>:/mnt/slave_ssh                   OK        
master               gluster://<slave>:slave_gfs                    OK        
master               ssh://<slave>::slave                           OK

Comment 3 Vijaykumar Koppad 2012-10-18 06:24:41 UTC

Apparently , I found out, it got synced after  50 min. 
which is again very bad . Initially whenever you create a file, atleast it should get entry on the slave or there should be update on the rysnc temp file, Which was  not the case .

Comment 4 Csaba Henk 2012-10-18 09:08:29 UTC

"failed to sync" is not necessarily bad, but at least a warning sign. Should not be seen if the setup is static.

However, what do you mean by "there should be update on the rysnc temp file"? It kept being the same size over a period of time?

Comment 5 Vijaykumar Koppad 2012-10-18 09:44:55 UTC

rsync tmp file kept being 0 for long time.

Comment 6 Venky Shankar 2013-01-10 07:06:57 UTC

could be a setup problem (NTP, sync delays).

Reopen if needed.

Note You need to log in before you can comment on or make changes to this bug.