Description of problem: I was testing failover-failback with normal reversal of syncing direction (without special_sync_mode set to recover). After failover i.e. when the slave becomes master and master becomes slave, the files created in the slave during master's downtime is not being synced to the original master. It's been about 19 hours and still no files are synced. status details shows zero files being synced. Version-Release number of selected component (if applicable): [root@ramanujan ~]# rpm -q glusterfs glusterfs-3.4.0.21rhs-1.el6rhs.x86_64 How reproducible: Hit once. Not sure if reproducible. Steps to Reproduce: 1. Create two 2*2 distributed-replicated volumes master and slave. And create and start a geo-rep session between them. 2. From master create few tar files and few text files using crefi.py. 3. Create 1000 more text files and in a while loop keep on truncating them, again using crefi.py 4. Now shutdown all the master nodes. 5. redirect the application to slave. In this case application is just a while loop truncating the text files. 6. Write few more files to the slave volume. Delete two of the directories from slave volume (which was synced from original master) 7. Now bring the master nodes back up. The geo-rep status will be defunct. So so a stop force and delete the session from master to slave. 8. Setup a password ssh from one slave node to one master node. 9. Create a geo-rep session from original slave to original master with push-pem after generating the ssh_pem_pub_file. 10. Again start truncating the files in a while loop and stop after sometime. 11. Now wait for the files to get synced eventually from original slave to original master. Actual results: Even after 18 odd hours files are not getting synced. MASTER: slave SLAVE: euclid::master NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING ---------------------------------------------------------------------------------------------------------------------- pythagoras.blr.redhat.com Stable 00:11:17 0 3974 1.6GB 0 ramanujan.blr.redhat.com Stable 18:30:18 0 0 0Bytes 0 I also saw that the session is pythagoras went faulty for sometime before coming back to good health. You can see that in the uptime in above status. Expected results: Files should get synced properly and the session should not go to faulty state in the meantime. Additional info: I keep seeing these error lot of times in the log file. [2013-08-23 10:02:03.814223] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a91b66d2-9593-4e0e-86d3-0dd3f8bdc11f [errcode: 23] [2013-08-23 10:02:03.815316] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/aedaf168-b03c-4896-a262-93459d1bdbbf [errcode: 23] [2013-08-23 10:02:03.816378] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6933210c-d800-487d-890b-0eb1fdb1949f [errcode: 23] [2013-08-23 10:02:03.817516] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f0f3db69-4e11-47b0-b065-99c25e7c9acb [errcode: 23] [2013-08-23 10:02:03.818588] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a67655aa-6bd0-4adb-8ae7-fe81d8adefce [errcode: 23] [2013-08-23 10:02:03.819664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/fb70279c-db84-4cab-951e-4329bf7220fb [errcode: 23] [2013-08-23 10:02:03.820691] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/31130090-8c53-4c7f-b687-13f9297ea9ef [errcode: 23] [2013-08-23 10:02:03.821742] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6bc7713-bec9-48bf-80c7-72a01261cdbc [errcode: 23] [2013-08-23 10:02:03.822782] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5ff97c63-a470-44c3-8898-d7b4f6773094 [errcode: 23] [2013-08-23 10:02:03.823844] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/b417020f-11bb-4fc3-899f-63a19a752111 [errcode: 23] [2013-08-23 10:02:03.824928] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06cad29b-6f29-4c76-b405-5bfda39e6da8 [errcode: 23] [2013-08-23 10:02:03.826318] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/2a86b950-952e-4a83-b676-9b625be56f36 [errcode: 23] [2013-08-23 10:02:03.827397] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/a6808035-0156-48b5-ae22-8ebd455e888d [errcode: 23] [2013-08-23 10:02:03.828501] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7ba652a4-ca39-4d7d-8ac6-dc31c221fbb9 [errcode: 23] [2013-08-23 10:02:03.829513] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/55fbaaf9-ed02-484f-bf1f-c807831f53d2 [errcode: 23] [2013-08-23 10:02:03.830559] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7dbd3165-9422-47c6-a5a8-d2db91033c9c [errcode: 23] [2013-08-23 10:02:03.831664] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/bc3ef685-5df3-49f0-b658-cf700df1fb70 [errcode: 23] [2013-08-23 10:02:03.832727] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9ebfa7e5-b1a1-490d-a08c-7937c2041c37 [errcode: 23] [2013-08-23 10:02:03.833779] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/06adb421-b36e-40e3-86e8-68e5a7625103 [errcode: 23] [2013-08-23 10:02:03.834961] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/ebaa8716-1081-4124-b8fb-0282ff6c8c7f [errcode: 23] [2013-08-23 10:02:03.835979] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/3e240f53-f90a-4e36-a6d1-bcc9ecf6f6ce [errcode: 23] [2013-08-23 10:02:03.837038] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/f20c194e-b45e-4699-ab1c-7076a6926eda [errcode: 23] [2013-08-23 10:02:03.838247] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/c474c533-48ef-46ca-be51-0333977c44d1 [errcode: 23] [2013-08-23 10:02:03.839344] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/97972934-a2fb-4a15-8f3d-c87138146ed7 [errcode: 23] [2013-08-23 10:02:03.840460] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/e52ecee3-9888-4265-92cb-41af02ce9af3 [errcode: 23] [2013-08-23 10:02:03.841609] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/5a67eb33-81cd-4a87-a03b-3692d1e0f8c4 [errcode: 23] [2013-08-23 10:02:03.842641] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/984490d3-3ce5-4724-94be-137c9da179b7 [errcode: 23] [2013-08-23 10:02:03.843815] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/7eb8ab80-be34-4cd5-8a01-8c7465cbfa2c [errcode: 23] [2013-08-23 10:02:03.844984] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/dbaed18e-2383-404e-b3db-8813e04b75b1 [errcode: 23] [2013-08-23 10:02:03.849615] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/9d448e17-e6be-4461-9aa6-706d27e640be [errcode: 23] [2013-08-23 10:02:03.850678] W [master(/rhs/bricks/brick0):618:regjob] <top>: Rsync: .gfid/6cd497ff-0482-48ef-9e7f-2188bd68d853 [errcode: 23] [2013-08-23 10:02:03.856074] W [master(/rhs/bricks/brick0):745:process] _GMaster: incomplete sync, retrying changelog: /var/run/gluster/slave/ssh%3A%2F%2Froot%4010.70.35.90%3Agluster%3A%2F%2F127.0.0.1%3Amaster/59ddf777397e52a13ba1333653d63854/xsync/XSYNC-CHANGELOG.1377231464 I will archive all the logs and sosreport for further debugging.
logs look similar to few other tests which failed, and got fixed in latest releases. Can we run a round of test for this with latest build?
Was the ssh keys setup properly between slave-> master in this case? it looks like that is the issue.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.