1393362 – [geo-rep]: Sync stuck in Hybrid crawl since 2 days

Bug 1393362 - [geo-rep]: Sync stuck in Hybrid crawl since 2 days

Summary: [geo-rep]: Sync stuck in Hybrid crawl since 2 days

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351530
TreeView+	depends on / blocked

Reported:	2016-11-09 12:11 UTC by Rahul Hinduja
Modified:	2018-10-11 10:10 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	If a geo-replication session is created while gluster volume rebalance is in progress, then geo-replication may miss some files/directories sync to slave volume. This is caused because of internal movement of files due to rebalance. Workaround: Do not create a geo-replication session if the master Volume rebalance is in progress.
Clone Of:
Environment:
Last Closed:	2018-10-11 10:10:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2016-11-09 12:11:58 UTC

Description of problem:
=======================

I have data set of 4.5G with 6Lac files in it. Started rebalance and created geo-rep session. It picked up Hybrid crawl correctly but after a while the sync stopped. 

Following are the last logs at geo-rep:

[2016-11-07 10:47:25.684036] I [master(/rhs/brick1/b1):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.699337] I [master(/rhs/brick2/b5):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.775741] I [master(/rhs/brick2/b5):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.796564] I [master(/rhs/brick3/b9):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.911012] I [master(/rhs/brick3/b9):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:43.776889] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%V2QR9QLZP0: ENOENT
[2016-11-07 10:47:43.777194] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%W8RDJO0C0J: ENOENT
[2016-11-07 10:47:43.777380] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%CA4KDL5VY9: ENOENT
[2016-11-07 10:47:43.777580] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%9243ET58Q3: ENOENT
[2016-11-07 10:47:43.777765] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%IW26Q6YJTP: ENOENT
[2016-11-07 10:47:43.777935] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TAVDD5GI7T: ENOENT
[2016-11-07 10:47:43.778105] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%XZ68HFOKP7: ENOENT
[2016-11-07 10:47:43.778285] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%Q1586CIEV8: ENOENT
[2016-11-07 10:47:43.778450] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%G3ADPJ93YG: ENOENT
[2016-11-07 10:47:43.778644] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TONI1KSYGJ: ENOENT
[2016-11-07 10:47:43.778808] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%2730385I98: ENOENT
[2016-11-07 10:52:28.307632] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:52:58.265819] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:53:46.210700] I [master(/rhs/brick1/b1):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b
89ac45352ab8c894d210d136dd56/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 11:17:33.266061] W [master(/rhs/brick3/b9):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:22:13.797106] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515948
[2016-11-07 11:23:06.200653] W [master(/rhs/brick2/b5):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:28:43.559046] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515979

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.8.4-3.el7rhgs.x86_64

Steps Carried:
==============
1. Create 4 node cluster and create 6x2 master volume
2. Create 2 node cluster and create 2x2 slave volume
3. Mount the master volume to Fuse client and create Huge data set (6L small files)
crefi -T 10 -n 1000 --multi -d 10 -b 5 --random --max=5K --min=1K --f=create /mnt/master/
4. Add 2 more nodes to the master cluster
5. Add bricks from these nodes to convert it into 7x2
6. Start the rebalance on master node
7. While rebalance is in progress, setup geo-rep (enabling shared volume, gsec_create, push-pem, config, start)
8. After geo-rep start, it picks up Hybrid crawl but stucks after syncing a while

Following are strace:
=====================

[root@dhcp37-76 ~]# strace -f -s 500 -p 18196
Process 18196 attached
write(2, "rsync: get_acl: sys_acl_get_file(.gfid/4b7bba53-bcc8-41c5-93be-ae2090c242ae, ACL_TYPE_ACCESS): No such file or directory (2)", 124
^CProcess 18196 detached
 <detached ...>
[root@dhcp37-76 ~]#

Comment 3 Aravinda VK 2016-11-09 14:16:16 UTC

Seeing lot of rsync errors about creating directories, that means mkdir failed for some reason and all the files from that directory failed to sync. (Checked Georep logs from *.121 node). Not yet traced the reason for mkdir failure.

Comment 4 Aravinda VK 2016-11-09 14:25:08 UTC

Operation not permitted are expected for rsync, rsync expects directory should be present before syncing files. Yet to find the reason for entry failure by entry_ops

[2016-11-09 10:09:24.794229] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/00201812-179b-400e-a6f2-b332bf72c7de" failed: Operation not permitted (1)
[2016-11-09 10:09:24.794754] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.794974] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01c699d5-8955-4e7e-85ae-b8c2f6dcb9da" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795208] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795418] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01d22973-d93b-435b-9d25-f7ec2ebc262a" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795630] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795819] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/022988cd-b59c-4200-85d1-010343ce23a7" failed: Operation not permitted (1)
[2016-11-09 10:09:24.796001] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.796231] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/0234671b-0b85-48c4-bdd6-e2108805f631" failed: Operation not permitted (1)

Comment 5 Rahul Hinduja 2016-11-11 07:25:02 UTC

Logs mentioned in comment 4 are for different issue(May be related but perceived difference) where crawl is moved to Changelog after restarting the geo-rep. But it still misses the data. (log: 2016-11-09)

Original issue was Hybrid crawls being stuck (log: 2016-11-07)

Comment 17 Amar Tumballi 2018-10-11 10:10:19 UTC

Not worked on this in last 2 years! and not planning to pick up in any immediate releases! 

RCA available, and agreed long back that it is not an issue, and documented the Known Issue!

Please reopen the bug if found important for any release!

Note You need to log in before you can comment on or make changes to this bug.