Description of problem: ======================= I have data set of 4.5G with 6Lac files in it. Started rebalance and created geo-rep session. It picked up Hybrid crawl correctly but after a while the sync stopped. Following are the last logs at geo-rep: [2016-11-07 10:47:25.684036] I [master(/rhs/brick1/b1):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0) [2016-11-07 10:47:25.699337] I [master(/rhs/brick2/b5):519:crawlwrap] _GMaster: crawl interval: 60 seconds [2016-11-07 10:47:25.775741] I [master(/rhs/brick2/b5):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0) [2016-11-07 10:47:25.796564] I [master(/rhs/brick3/b9):519:crawlwrap] _GMaster: crawl interval: 60 seconds [2016-11-07 10:47:25.911012] I [master(/rhs/brick3/b9):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0) [2016-11-07 10:47:43.776889] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%V2QR9QLZP0: ENOENT [2016-11-07 10:47:43.777194] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%W8RDJO0C0J: ENOENT [2016-11-07 10:47:43.777380] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%CA4KDL5VY9: ENOENT [2016-11-07 10:47:43.777580] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%9243ET58Q3: ENOENT [2016-11-07 10:47:43.777765] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%IW26Q6YJTP: ENOENT [2016-11-07 10:47:43.777935] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TAVDD5GI7T: ENOENT [2016-11-07 10:47:43.778105] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%XZ68HFOKP7: ENOENT [2016-11-07 10:47:43.778285] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%Q1586CIEV8: ENOENT [2016-11-07 10:47:43.778450] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%G3ADPJ93YG: ENOENT [2016-11-07 10:47:43.778644] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TONI1KSYGJ: ENOENT [2016-11-07 10:47:43.778808] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%2730385I98: ENOENT [2016-11-07 10:52:28.307632] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276 cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515645 [2016-11-07 10:52:58.265819] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53 557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515645 [2016-11-07 10:53:46.210700] I [master(/rhs/brick1/b1):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b 89ac45352ab8c894d210d136dd56/xsync/XSYNC-CHANGELOG.1478515645 [2016-11-07 11:17:33.266061] W [master(/rhs/brick3/b9):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645 [2016-11-07 11:22:13.797106] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276 cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515948 [2016-11-07 11:23:06.200653] W [master(/rhs/brick2/b5):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645 [2016-11-07 11:28:43.559046] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53 557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515979 Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.8.4-3.el7rhgs.x86_64 Steps Carried: ============== 1. Create 4 node cluster and create 6x2 master volume 2. Create 2 node cluster and create 2x2 slave volume 3. Mount the master volume to Fuse client and create Huge data set (6L small files) crefi -T 10 -n 1000 --multi -d 10 -b 5 --random --max=5K --min=1K --f=create /mnt/master/ 4. Add 2 more nodes to the master cluster 5. Add bricks from these nodes to convert it into 7x2 6. Start the rebalance on master node 7. While rebalance is in progress, setup geo-rep (enabling shared volume, gsec_create, push-pem, config, start) 8. After geo-rep start, it picks up Hybrid crawl but stucks after syncing a while Following are strace: ===================== [root@dhcp37-76 ~]# strace -f -s 500 -p 18196 Process 18196 attached write(2, "rsync: get_acl: sys_acl_get_file(.gfid/4b7bba53-bcc8-41c5-93be-ae2090c242ae, ACL_TYPE_ACCESS): No such file or directory (2)", 124 ^CProcess 18196 detached <detached ...> [root@dhcp37-76 ~]#
Seeing lot of rsync errors about creating directories, that means mkdir failed for some reason and all the files from that directory failed to sync. (Checked Georep logs from *.121 node). Not yet traced the reason for mkdir failure.
Operation not permitted are expected for rsync, rsync expects directory should be present before syncing files. Yet to find the reason for entry failure by entry_ops [2016-11-09 10:09:24.794229] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/00201812-179b-400e-a6f2-b332bf72c7de" failed: Operation not permitted (1) [2016-11-09 10:09:24.794754] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory *** [2016-11-09 10:09:24.794974] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01c699d5-8955-4e7e-85ae-b8c2f6dcb9da" failed: Operation not permitted (1) [2016-11-09 10:09:24.795208] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory *** [2016-11-09 10:09:24.795418] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01d22973-d93b-435b-9d25-f7ec2ebc262a" failed: Operation not permitted (1) [2016-11-09 10:09:24.795630] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory *** [2016-11-09 10:09:24.795819] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/022988cd-b59c-4200-85d1-010343ce23a7" failed: Operation not permitted (1) [2016-11-09 10:09:24.796001] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory *** [2016-11-09 10:09:24.796231] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/0234671b-0b85-48c4-bdd6-e2108805f631" failed: Operation not permitted (1)
Logs mentioned in comment 4 are for different issue(May be related but perceived difference) where crawl is moved to Changelog after restarting the geo-rep. But it still misses the data. (log: 2016-11-09) Original issue was Hybrid crawls being stuck (log: 2016-11-07)
Not worked on this in last 2 years! and not planning to pick up in any immediate releases! RCA available, and agreed long back that it is not an issue, and documented the Known Issue! Please reopen the bug if found important for any release!