Bug 867345

Summary: geo-rep failed to sync large file of order GB through ssh session.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Venky Shankar <vshankar>
Status: CLOSED WORKSFORME QA Contact: Vijaykumar Koppad <vkoppad>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.0CC: aavati, bbandari, csaba, rhs-bugs, shaines, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-10 07:06:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijaykumar Koppad 2012-10-17 11:01:05 UTC
Description of problem: If the file is of the order 1GB, it fails to sync on the slave which is geo-rep session through ssh. During this time there were 3 other geo-rep session on the same volume of all the type of geo-rep possible like through ssh to a volume , through ssh to a file , through gluster to a volume, and to a local file 

These are the DEBUG logs .. 

[2012-10-17 08:31:09.21575] D [repce:190:__call__] RepceClient: call 5734:139746862941952:1350442869.02 keep_alive -> 66
[2012-10-17 08:31:09.977252] W [master:786:regjob] _GMaster: failed to sync ./file_1G
[2012-10-17 08:31:16.274018] W [master:786:regjob] _GMaster: failed to sync ./file_1G
[2012-10-17 08:31:16.274231] D [master:660:crawl] _GMaster: ... crawl #979 done, took 14.937212 seconds
[2012-10-17 08:31:17.277674] D [master:615:volinfo_state_machine] <top>: (None, 426253f7) << (None, 426253f7) -> (None, 426253f7)
[2012-10-17 08:31:17.277858] D [master:696:crawl] _GMaster: entering .
[2012-10-17 08:31:17.278656] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.28 xtime('.', '426253f7-b423-4a69-91c7-53e736e17d00') ...
[2012-10-17 08:31:17.280658] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.28 xtime -> (1350442787, 818281)
[2012-10-17 08:31:17.283113] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.28 entries('.',) ...
[2012-10-17 08:31:17.286344] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.28 entries -> ['.file_1G.b65DJ1']
[2012-10-17 08:31:17.286490] D [repce:175:push] RepceClient: call 5734:139747139553024:1350442877.29 purge('.', set(['.file_1G.b65DJ1'])) ...
[2012-10-17 08:31:17.288861] D [repce:190:__call__] RepceClient: call 5734:139747139553024:1350442877.29 purge -> None
[2012-10-17 08:31:17.290184] D [master:778:crawl] _GMaster: syncing ./file_1G ...
[2012-10-17 08:31:17.355895] D [resource:526:rsync] SSH: files: ./file_1G
[2012-10-17 08:31:25.858294] W [master:786:regjob] _GMaster: failed to sync ./file_1G


Version-Release number of selected component (if applicable):RHS-2.0.z u3 


How reproducible:Consistently 


Steps to Reproduce:
1.Start a geo-rep session between master(dist-replicate) and slave(dist-rep) through ssh   
2.Create all other type of geo-rep session mentioned above to the same volume.
3.Create a 1GB sparse file.
4.Check the file in slave mount point. 
5. It fails to sync the data , even though all the other slaves got file synced.
6. If you check the log_file , you might get the similar log in geo-rep logs. 

Actual results: Large file failed to sync 


Expected results: File should sync. 


Additional info:

Comment 1 Vijaykumar Koppad 2012-10-17 11:10:24 UTC
Just to remember the state when it happened. This is detailed setup. 

there is a master and a slave machine. 
one volume in master machine called master(dist-rep)
two volumes in slave machine called slave (dist-rep) and slave_gfs(dist-stripe)

1.MASTER               SLAVE                                      STATUS    
--------------------------------------------------------------------------------
master               file:///root/slave_local                       OK        
master               ssh://<slave>:/mnt/slave_ssh                   OK        
master               gluster://<slave>:slave_gfs                    OK        
master               ssh://<slave>::slave                           OK

Comment 3 Vijaykumar Koppad 2012-10-18 06:24:41 UTC
Apparently , I found out, it got synced after  50 min. 
which is again very bad . Initially whenever you create a file, atleast it should get entry on the slave or there should be update on the rysnc temp file, Which was  not the case .

Comment 4 Csaba Henk 2012-10-18 09:08:29 UTC
"failed to sync" is not necessarily bad, but at least a warning sign. Should not be seen if the setup is static.

However, what do you mean by "there should be update on the rysnc temp file"? It kept being the same size over a period of time?

Comment 5 Vijaykumar Koppad 2012-10-18 09:44:55 UTC
rsync tmp file kept being 0 for long time.

Comment 6 Venky Shankar 2013-01-10 07:06:57 UTC
could be a setup problem (NTP, sync delays).

Reopen if needed.