Bug 989192

Summary:	Dist-geo-rep : geo-rep failover-failback is broken : special-sync-mode blind results in faulty state.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vijaykumar Koppad <vkoppad>
Component:	geo-replication	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	M S Vishwanath Bhat <vbhat>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.1	CC:	aavati, amarts, bbandari, csaba, mzywusko, rhs-bugs, sdharane, shaines, vbhat
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.18rhs-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	994462 (view as bug list)		Environment:
Last Closed:	2013-09-23 22:29:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	957769, 994462

Description Vijaykumar Koppad 2013-07-28 07:59:01 UTC

Description of problem: In the geo-rep  failover-failback process , when geo-rep is started in reverse direction with -special-sync-mode blind, the status becomes faulty with following trace-back in the geo-rep logs, 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-07-28 13:04:19.435298] I [master(/bricks/brick3):60:gmaster_builder] <top>: setting up changelog change detection mode
[2013-07-28 13:04:19.437689] I [master(/bricks/brick3):980:register] _GMaster: xsync temp directory: /var/run/gluster/slave/ssh%3A%2F%2Froot%4010.70.43.86%3Agluster%3A%2F%2F127.0.0.1%3Amaster/cfdffea3581f40685f18a34384edc263/xsync
[2013-07-28 13:04:19.494443] I [master(/bricks/brick3):496:crawlwrap] _GMaster: crawl interval: 60 seconds
[2013-07-28 13:04:19.495254] E [syncdutils(/bricks/brick3):206:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 232, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 447, in keep_alive
    vi, gap = cls.keepalive_payload_hook(timo, timo * 0.5)
TypeError: keepalive_payload_hook() takes exactly 3 arguments (2 given)
[2013-07-28 13:04:19.500574] I [syncdutils(/bricks/brick3):158:finalize] <top>: exiting.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable):3.4.0.12rhs.beta6-1.el6rhs.x86_64


How reproducible:Didn't try it again.


Steps to Reproduce:
1.Create and start a geo-rep relationship between master and slave
2.create some data.
3.Before it syncs completely to slave bring down the one sub-volume.
4.Then start creating files on the slave and then do rsync with options  "rsync -Pvza --numeric-ids --ignore-existing" from slave to master.
5. Stop the geo-rep from master and slave. 
6.Now create geo-rep session from slave to master.
7. Change the config special-sync-mode to blind and start the geo-rep from slave to master.
8.Check the status of the geo-rep.

Actual results: After starting geo-rep in blind mode , status becomes faulty


Expected results: In the whole process of the failover-failback of the geo-rep, status shouldn't go to faulty. 


Additional info:

Comment 3 Amar Tumballi 2013-07-29 17:45:39 UTC

This needs to be changed with newer geo-replication. I am not sure if this is valid anymore. Venky, would be good to look into this.

Comment 4 Amar Tumballi 2013-08-07 10:37:43 UTC

https://code.engineering.redhat.com/gerrit/#/c/11198/

Comment 5 M S Vishwanath Bhat 2013-08-20 10:18:25 UTC

Fixed now.

Tested in Version:
[root@mustang ~]# rpm -q glusterfs
glusterfs-3.4.0.20rhs-2.el6rhs.x86_64

One point to be noted is marker (geo-replication.indexing) should be turned on from the slave before the application starts writing to the slave. If it's not turned on in the slave volume during I/O from the slave, special_sync_mode will *not* work.

The steps I did for verification is

1. Create some files in master and let all the files be synced to slave.
2. Shutdown all the master nodes.
3. Turn on the indexing on the slave before any IO on slave.
gluster v set slave geo-replication.indexing on
4. Now create some files from the slave. (*deletes/renames/hardlinks will have isses*)
The above part where the application is moved from master to slave is failover part.
5. Now bring up all the master nodes. The status of geo-rep will be defunct. Then use stop force and delete to stop and delete the geo-rep session from master to slave.
6. Create geo-rep session from slave to master. Also make sure to setup the ssh pem from slave to master.
7. Set the special_sync_mode to recover
gluster v geo slave master_node::master config special_sync_mode recover
8. Start the geo-rep session.
9. Now use status detail to monitor the status of file syncing. Make sure during this time, there should be no IO going on from slave.
10. Once all the files are synced, stop and delete the session from slave to master.
11. Re establish the session from master to slave. Move the application here back to master.

Moving this bug to VERIFIED.

Comment 6 Scott Haines 2013-09-23 22:29:51 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html