Bug 1000372 - Dist-geo-rep: Files are not getting synced after failover-failback with normal reversal of the direction of sync. [NEEDINFO]
Dist-geo-rep: Files are not getting synced after failover-failback with norma...
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
consistency, failover
: ZStream
Depends On:
Blocks: 1285206
  Show dependency treegraph
Reported: 2013-08-23 06:05 EDT by M S Vishwanath Bhat
Modified: 2016-05-31 21:57 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1285206 (view as bug list)
Last Closed: 2015-11-25 03:47:57 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amarts: needinfo? (vbhat)

Attachments (Terms of Use)

  None (edit)
Description M S Vishwanath Bhat 2013-08-23 06:05:13 EDT
Description of problem:
I was testing failover-failback with normal reversal of syncing direction (without special_sync_mode set to recover). After failover i.e. when the slave becomes master and master becomes slave, the files created in the slave during master's downtime is not being synced to the original master.

It's been about 19 hours and still no files are synced. status details shows zero files being synced.

Version-Release number of selected component (if applicable):
[root@ramanujan ~]# rpm -q glusterfs

How reproducible:
Hit once. Not sure if reproducible.

Steps to Reproduce:
1. Create two 2*2 distributed-replicated volumes master and slave. And create and start a geo-rep session between them.
2. From master create few tar files and few text files using crefi.py.
3. Create 1000 more text files and in a while loop keep on truncating them, again using crefi.py
4. Now shutdown all the master nodes.
5. redirect the application to slave. In this case application is just a while loop truncating the text files.
6. Write few more files to the slave volume. Delete two of the directories from slave volume (which was synced from original master)
7. Now bring the master nodes back up. The geo-rep status will be defunct. So so a stop force and delete the session from master to slave.
8. Setup a password ssh from one slave node to one master node.
9. Create a geo-rep session from original slave to original master with push-pem after generating the ssh_pem_pub_file.
10. Again start truncating the files in a while loop and stop after sometime.
11. Now wait for the files to get synced eventually from original slave to original master.

Actual results:
Even after 18 odd hours files are not getting synced. 

                                         MASTER: slave  SLAVE: euclid::master
pythagoras.blr.redhat.com      Stable    00:11:17    0              3974             1.6GB            0                 
ramanujan.blr.redhat.com       Stable    18:30:18    0              0                0Bytes           0                 

I also saw that the session is pythagoras went faulty for sometime before coming back to good health. You can see that in the uptime in above status.

Expected results:
Files should get synced properly and the session should not go to faulty state in the meantime.

Additional info:
I keep seeing these error lot of times in the log file.

[2013-08-23 10:02:03.814223] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a91b66d2-9593-4e0e-86d3-0dd3f8bdc11f [errcode: 23]
[2013-08-23 10:02:03.815316] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/aedaf168-b03c-4896-a262-93459d1bdbbf [errcode: 23]
[2013-08-23 10:02:03.816378] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6933210c-d800-487d-890b-0eb1fdb1949f [errcode: 23]
[2013-08-23 10:02:03.817516] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f0f3db69-4e11-47b0-b065-99c25e7c9acb [errcode: 23]
[2013-08-23 10:02:03.818588] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a67655aa-6bd0-4adb-8ae7-fe81d8adefce [errcode: 23]
[2013-08-23 10:02:03.819664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/fb70279c-db84-4cab-951e-4329bf7220fb [errcode: 23]
[2013-08-23 10:02:03.820691] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/31130090-8c53-4c7f-b687-13f9297ea9ef [errcode: 23]
[2013-08-23 10:02:03.821742] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6bc7713-bec9-48bf-80c7-72a01261cdbc [errcode: 23]
[2013-08-23 10:02:03.822782] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5ff97c63-a470-44c3-8898-d7b4f6773094 [errcode: 23]
[2013-08-23 10:02:03.823844] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/b417020f-11bb-4fc3-899f-63a19a752111 [errcode: 23]
[2013-08-23 10:02:03.824928] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06cad29b-6f29-4c76-b405-5bfda39e6da8 [errcode: 23]
[2013-08-23 10:02:03.826318] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2a86b950-952e-4a83-b676-9b625be56f36 [errcode: 23]
[2013-08-23 10:02:03.827397] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6808035-0156-48b5-ae22-8ebd455e888d [errcode: 23]
[2013-08-23 10:02:03.828501] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7ba652a4-ca39-4d7d-8ac6-dc31c221fbb9 [errcode: 23]
[2013-08-23 10:02:03.829513] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/55fbaaf9-ed02-484f-bf1f-c807831f53d2 [errcode: 23]
[2013-08-23 10:02:03.830559] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7dbd3165-9422-47c6-a5a8-d2db91033c9c [errcode: 23]
[2013-08-23 10:02:03.831664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/bc3ef685-5df3-49f0-b658-cf700df1fb70 [errcode: 23]
[2013-08-23 10:02:03.832727] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9ebfa7e5-b1a1-490d-a08c-7937c2041c37 [errcode: 23]
[2013-08-23 10:02:03.833779] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06adb421-b36e-40e3-86e8-68e5a7625103 [errcode: 23]
[2013-08-23 10:02:03.834961] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/ebaa8716-1081-4124-b8fb-0282ff6c8c7f [errcode: 23]
[2013-08-23 10:02:03.835979] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/3e240f53-f90a-4e36-a6d1-bcc9ecf6f6ce [errcode: 23]
[2013-08-23 10:02:03.837038] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f20c194e-b45e-4699-ab1c-7076a6926eda [errcode: 23]
[2013-08-23 10:02:03.838247] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/c474c533-48ef-46ca-be51-0333977c44d1 [errcode: 23]
[2013-08-23 10:02:03.839344] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/97972934-a2fb-4a15-8f3d-c87138146ed7 [errcode: 23]
[2013-08-23 10:02:03.840460] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/e52ecee3-9888-4265-92cb-41af02ce9af3 [errcode: 23]
[2013-08-23 10:02:03.841609] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5a67eb33-81cd-4a87-a03b-3692d1e0f8c4 [errcode: 23]
[2013-08-23 10:02:03.842641] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/984490d3-3ce5-4724-94be-137c9da179b7 [errcode: 23]
[2013-08-23 10:02:03.843815] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7eb8ab80-be34-4cd5-8a01-8c7465cbfa2c [errcode: 23]
[2013-08-23 10:02:03.844984] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/dbaed18e-2383-404e-b3db-8813e04b75b1 [errcode: 23]
[2013-08-23 10:02:03.849615] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9d448e17-e6be-4461-9aa6-706d27e640be [errcode: 23]
[2013-08-23 10:02:03.850678] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6cd497ff-0482-48ef-9e7f-2188bd68d853 [errcode: 23]
[2013-08-23 10:02:03.856074] W [master(/rhs/bricks/brick0):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/slave/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Amaster/59ddf777397e52a13ba1333653d63854/xsync/XSYNC-CHANGELOG.1377231464

I will archive all the logs and sosreport for further debugging.
Comment 2 Amar Tumballi 2013-09-11 10:42:47 EDT
logs look similar to few other tests which failed, and got fixed in latest releases. Can we run a round of test for this with latest build?
Comment 3 Amar Tumballi 2013-11-13 05:00:21 EST
Was the ssh keys setup properly between slave-> master in this case? it looks like that is the issue.
Comment 4 Aravinda VK 2015-11-25 03:47:57 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.
Comment 5 Aravinda VK 2015-11-25 03:50:18 EST
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.