1000372 – Dist-geo-rep: Files are not getting synced after failover-failback with normal reversal of the direction of sync.

Bug 1000372 - Dist-geo-rep: Files are not getting synced after failover-failback with normal reversal of the direction of sync.

Summary: Dist-geo-rep: Files are not getting synced after failover-failback with norma...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:	consistency, failover
Depends On:
Blocks:	1285206
TreeView+	depends on / blocked

Reported:	2013-08-23 10:05 UTC by M S Vishwanath Bhat
Modified:	2018-11-05 05:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1285206 (view as bug list)
Environment:
Last Closed:	2015-11-25 08:47:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description M S Vishwanath Bhat 2013-08-23 10:05:13 UTC

Description of problem:
I was testing failover-failback with normal reversal of syncing direction (without special_sync_mode set to recover). After failover i.e. when the slave becomes master and master becomes slave, the files created in the slave during master's downtime is not being synced to the original master.

It's been about 19 hours and still no files are synced. status details shows zero files being synced.


Version-Release number of selected component (if applicable):
[root@ramanujan ~]# rpm -q glusterfs
glusterfs-3.4.0.21rhs-1.el6rhs.x86_64


How reproducible:
Hit once. Not sure if reproducible.

Steps to Reproduce:
1. Create two 2*2 distributed-replicated volumes master and slave. And create and start a geo-rep session between them.
2. From master create few tar files and few text files using crefi.py.
3. Create 1000 more text files and in a while loop keep on truncating them, again using crefi.py
4. Now shutdown all the master nodes.
5. redirect the application to slave. In this case application is just a while loop truncating the text files.
6. Write few more files to the slave volume. Delete two of the directories from slave volume (which was synced from original master)
7. Now bring the master nodes back up. The geo-rep status will be defunct. So so a stop force and delete the session from master to slave.
8. Setup a password ssh from one slave node to one master node.
9. Create a geo-rep session from original slave to original master with push-pem after generating the ssh_pem_pub_file.
10. Again start truncating the files in a while loop and stop after sometime.
11. Now wait for the files to get synced eventually from original slave to original master.

Actual results:
Even after 18 odd hours files are not getting synced. 

 
                                         MASTER: slave  SLAVE: euclid::master
 
NODE                           HEALTH    UPTIME      FILES SYNCD    FILES PENDING    BYTES PENDING    DELETES PENDING   
----------------------------------------------------------------------------------------------------------------------
pythagoras.blr.redhat.com      Stable    00:11:17    0              3974             1.6GB            0                 
ramanujan.blr.redhat.com       Stable    18:30:18    0              0                0Bytes           0                 


I also saw that the session is pythagoras went faulty for sometime before coming back to good health. You can see that in the uptime in above status.

Expected results:
Files should get synced properly and the session should not go to faulty state in the meantime.

Additional info:
I keep seeing these error lot of times in the log file.

[2013-08-23 10:02:03.814223] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a91b66d2-9593-4e0e-86d3-0dd3f8bdc11f [errcode: 23]
[2013-08-23 10:02:03.815316] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/aedaf168-b03c-4896-a262-93459d1bdbbf [errcode: 23]
[2013-08-23 10:02:03.816378] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6933210c-d800-487d-890b-0eb1fdb1949f [errcode: 23]
[2013-08-23 10:02:03.817516] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f0f3db69-4e11-47b0-b065-99c25e7c9acb [errcode: 23]
[2013-08-23 10:02:03.818588] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a67655aa-6bd0-4adb-8ae7-fe81d8adefce [errcode: 23]
[2013-08-23 10:02:03.819664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/fb70279c-db84-4cab-951e-4329bf7220fb [errcode: 23]
[2013-08-23 10:02:03.820691] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/31130090-8c53-4c7f-b687-13f9297ea9ef [errcode: 23]
[2013-08-23 10:02:03.821742] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6bc7713-bec9-48bf-80c7-72a01261cdbc [errcode: 23]
[2013-08-23 10:02:03.822782] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5ff97c63-a470-44c3-8898-d7b4f6773094 [errcode: 23]
[2013-08-23 10:02:03.823844] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/b417020f-11bb-4fc3-899f-63a19a752111 [errcode: 23]
[2013-08-23 10:02:03.824928] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06cad29b-6f29-4c76-b405-5bfda39e6da8 [errcode: 23]
[2013-08-23 10:02:03.826318] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2a86b950-952e-4a83-b676-9b625be56f36 [errcode: 23]
[2013-08-23 10:02:03.827397] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6808035-0156-48b5-ae22-8ebd455e888d [errcode: 23]
[2013-08-23 10:02:03.828501] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7ba652a4-ca39-4d7d-8ac6-dc31c221fbb9 [errcode: 23]
[2013-08-23 10:02:03.829513] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/55fbaaf9-ed02-484f-bf1f-c807831f53d2 [errcode: 23]
[2013-08-23 10:02:03.830559] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7dbd3165-9422-47c6-a5a8-d2db91033c9c [errcode: 23]
[2013-08-23 10:02:03.831664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/bc3ef685-5df3-49f0-b658-cf700df1fb70 [errcode: 23]
[2013-08-23 10:02:03.832727] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9ebfa7e5-b1a1-490d-a08c-7937c2041c37 [errcode: 23]
[2013-08-23 10:02:03.833779] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06adb421-b36e-40e3-86e8-68e5a7625103 [errcode: 23]
[2013-08-23 10:02:03.834961] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/ebaa8716-1081-4124-b8fb-0282ff6c8c7f [errcode: 23]
[2013-08-23 10:02:03.835979] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/3e240f53-f90a-4e36-a6d1-bcc9ecf6f6ce [errcode: 23]
[2013-08-23 10:02:03.837038] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f20c194e-b45e-4699-ab1c-7076a6926eda [errcode: 23]
[2013-08-23 10:02:03.838247] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/c474c533-48ef-46ca-be51-0333977c44d1 [errcode: 23]
[2013-08-23 10:02:03.839344] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/97972934-a2fb-4a15-8f3d-c87138146ed7 [errcode: 23]
[2013-08-23 10:02:03.840460] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/e52ecee3-9888-4265-92cb-41af02ce9af3 [errcode: 23]
[2013-08-23 10:02:03.841609] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5a67eb33-81cd-4a87-a03b-3692d1e0f8c4 [errcode: 23]
[2013-08-23 10:02:03.842641] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/984490d3-3ce5-4724-94be-137c9da179b7 [errcode: 23]
[2013-08-23 10:02:03.843815] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7eb8ab80-be34-4cd5-8a01-8c7465cbfa2c [errcode: 23]
[2013-08-23 10:02:03.844984] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/dbaed18e-2383-404e-b3db-8813e04b75b1 [errcode: 23]
[2013-08-23 10:02:03.849615] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9d448e17-e6be-4461-9aa6-706d27e640be [errcode: 23]
[2013-08-23 10:02:03.850678] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6cd497ff-0482-48ef-9e7f-2188bd68d853 [errcode: 23]
[2013-08-23 10:02:03.856074] W [master(/rhs/bricks/brick0):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/slave/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Amaster/59ddf777397e52a13ba1333653d63854/xsync/XSYNC-CHANGELOG.1377231464


I will archive all the logs and sosreport for further debugging.

Comment 2 Amar Tumballi 2013-09-11 14:42:47 UTC

logs look similar to few other tests which failed, and got fixed in latest releases. Can we run a round of test for this with latest build?

Comment 3 Amar Tumballi 2013-11-13 10:00:21 UTC

Was the ssh keys setup properly between slave-> master in this case? it looks like that is the issue.

Comment 4 Aravinda VK 2015-11-25 08:47:57 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 5 Aravinda VK 2015-11-25 08:50:18 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.