Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1393362 - [geo-rep]: Sync stuck in Hybrid crawl since 2 days
[geo-rep]: Sync stuck in Hybrid crawl since 2 days
Status: CLOSED DEFERRED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
3.2
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
Rahul Hinduja
: ZStream
Depends On:
Blocks: 1351530
  Show dependency treegraph
 
Reported: 2016-11-09 07:11 EST by Rahul Hinduja
Modified: 2018-10-11 06:10 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
If a geo-replication session is created while gluster volume rebalance is in progress, then geo-replication may miss some files/directories sync to slave volume. This is caused because of internal movement of files due to rebalance. Workaround: Do not create a geo-replication session if the master Volume rebalance is in progress.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-10-11 06:10:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rahul Hinduja 2016-11-09 07:11:58 EST
Description of problem:
=======================

I have data set of 4.5G with 6Lac files in it. Started rebalance and created geo-rep session. It picked up Hybrid crawl correctly but after a while the sync stopped. 

Following are the last logs at geo-rep:

[2016-11-07 10:47:25.684036] I [master(/rhs/brick1/b1):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.699337] I [master(/rhs/brick2/b5):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.775741] I [master(/rhs/brick2/b5):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.796564] I [master(/rhs/brick3/b9):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.911012] I [master(/rhs/brick3/b9):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:43.776889] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%V2QR9QLZP0: ENOENT
[2016-11-07 10:47:43.777194] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%W8RDJO0C0J: ENOENT
[2016-11-07 10:47:43.777380] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%CA4KDL5VY9: ENOENT
[2016-11-07 10:47:43.777580] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%9243ET58Q3: ENOENT
[2016-11-07 10:47:43.777765] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%IW26Q6YJTP: ENOENT
[2016-11-07 10:47:43.777935] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TAVDD5GI7T: ENOENT
[2016-11-07 10:47:43.778105] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%XZ68HFOKP7: ENOENT
[2016-11-07 10:47:43.778285] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%Q1586CIEV8: ENOENT
[2016-11-07 10:47:43.778450] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%G3ADPJ93YG: ENOENT
[2016-11-07 10:47:43.778644] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TONI1KSYGJ: ENOENT
[2016-11-07 10:47:43.778808] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%2730385I98: ENOENT
[2016-11-07 10:52:28.307632] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:52:58.265819] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:53:46.210700] I [master(/rhs/brick1/b1):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b
89ac45352ab8c894d210d136dd56/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 11:17:33.266061] W [master(/rhs/brick3/b9):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:22:13.797106] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515948
[2016-11-07 11:23:06.200653] W [master(/rhs/brick2/b5):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:28:43.559046] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515979

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.8.4-3.el7rhgs.x86_64

Steps Carried:
==============
1. Create 4 node cluster and create 6x2 master volume
2. Create 2 node cluster and create 2x2 slave volume
3. Mount the master volume to Fuse client and create Huge data set (6L small files)
crefi -T 10 -n 1000 --multi -d 10 -b 5 --random --max=5K --min=1K --f=create /mnt/master/
4. Add 2 more nodes to the master cluster
5. Add bricks from these nodes to convert it into 7x2
6. Start the rebalance on master node
7. While rebalance is in progress, setup geo-rep (enabling shared volume, gsec_create, push-pem, config, start)
8. After geo-rep start, it picks up Hybrid crawl but stucks after syncing a while

Following are strace:
=====================

[root@dhcp37-76 ~]# strace -f -s 500 -p 18196
Process 18196 attached
write(2, "rsync: get_acl: sys_acl_get_file(.gfid/4b7bba53-bcc8-41c5-93be-ae2090c242ae, ACL_TYPE_ACCESS): No such file or directory (2)", 124
^CProcess 18196 detached
 <detached ...>
[root@dhcp37-76 ~]#
Comment 3 Aravinda VK 2016-11-09 09:16:16 EST
Seeing lot of rsync errors about creating directories, that means mkdir failed for some reason and all the files from that directory failed to sync. (Checked Georep logs from *.121 node). Not yet traced the reason for mkdir failure.
Comment 4 Aravinda VK 2016-11-09 09:25:08 EST
Operation not permitted are expected for rsync, rsync expects directory should be present before syncing files. Yet to find the reason for entry failure by entry_ops

[2016-11-09 10:09:24.794229] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/00201812-179b-400e-a6f2-b332bf72c7de" failed: Operation not permitted (1)
[2016-11-09 10:09:24.794754] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.794974] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01c699d5-8955-4e7e-85ae-b8c2f6dcb9da" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795208] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795418] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01d22973-d93b-435b-9d25-f7ec2ebc262a" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795630] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795819] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/022988cd-b59c-4200-85d1-010343ce23a7" failed: Operation not permitted (1)
[2016-11-09 10:09:24.796001] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.796231] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/0234671b-0b85-48c4-bdd6-e2108805f631" failed: Operation not permitted (1)
Comment 5 Rahul Hinduja 2016-11-11 02:25:02 EST
Logs mentioned in comment 4 are for different issue(May be related but perceived difference) where crawl is moved to Changelog after restarting the geo-rep. But it still misses the data. (log: 2016-11-09)

Original issue was Hybrid crawls being stuck (log: 2016-11-07)
Comment 17 Amar Tumballi 2018-10-11 06:10:19 EDT
Not worked on this in last 2 years! and not planning to pick up in any immediate releases! 

RCA available, and agreed long back that it is not an issue, and documented the Known Issue!

Please reopen the bug if found important for any release!

Note You need to log in before you can comment on or make changes to this bug.