Bug 1393362

Summary: [geo-rep]: Sync stuck in Hybrid crawl since 2 days
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DEFERRED QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, atumball, avishwan, bmohanra, csaba, rcyriac, rhs-bugs, storage-qa-internal, vnosov
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
If a geo-replication session is created while gluster volume rebalance is in progress, then geo-replication may miss some files/directories sync to slave volume. This is caused because of internal movement of files due to rebalance. Workaround: Do not create a geo-replication session if the master Volume rebalance is in progress.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 10:10:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351530    

Description Rahul Hinduja 2016-11-09 12:11:58 UTC
Description of problem:
=======================

I have data set of 4.5G with 6Lac files in it. Started rebalance and created geo-rep session. It picked up Hybrid crawl correctly but after a while the sync stopped. 

Following are the last logs at geo-rep:

[2016-11-07 10:47:25.684036] I [master(/rhs/brick1/b1):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.699337] I [master(/rhs/brick2/b5):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.775741] I [master(/rhs/brick2/b5):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:25.796564] I [master(/rhs/brick3/b9):519:crawlwrap] _GMaster: crawl interval: 60 seconds
[2016-11-07 10:47:25.911012] I [master(/rhs/brick3/b9):1271:crawl] _GMaster: starting hybrid crawl..., stime: (-1, 0)
[2016-11-07 10:47:43.776889] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%V2QR9QLZP0: ENOENT
[2016-11-07 10:47:43.777194] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%W8RDJO0C0J: ENOENT
[2016-11-07 10:47:43.777380] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%CA4KDL5VY9: ENOENT
[2016-11-07 10:47:43.777580] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970d%%9243ET58Q3: ENOENT
[2016-11-07 10:47:43.777765] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%IW26Q6YJTP: ENOENT
[2016-11-07 10:47:43.777935] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TAVDD5GI7T: ENOENT
[2016-11-07 10:47:43.778105] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%XZ68HFOKP7: ENOENT
[2016-11-07 10:47:43.778285] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%Q1586CIEV8: ENOENT
[2016-11-07 10:47:43.778450] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%G3ADPJ93YG: ENOENT
[2016-11-07 10:47:43.778644] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%TONI1KSYGJ: ENOENT
[2016-11-07 10:47:43.778808] W [master(/rhs/brick1/b1):1398:Xcrawl] _GMaster: irregular xtime for ./thread0/level00/level10/level20/level30/level40/581f970e%%2730385I98: ENOENT
[2016-11-07 10:52:28.307632] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:52:58.265819] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 10:53:46.210700] I [master(/rhs/brick1/b1):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b
89ac45352ab8c894d210d136dd56/xsync/XSYNC-CHANGELOG.1478515645
[2016-11-07 11:17:33.266061] W [master(/rhs/brick3/b9):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:22:13.797106] I [master(/rhs/brick3/b9):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/7276
cecc9d436837c6dcbfda8f8e2d3b/xsync/XSYNC-CHANGELOG.1478515948
[2016-11-07 11:23:06.200653] W [master(/rhs/brick2/b5):1058:process] _GMaster: incomplete sync, retrying changelogs: XSYNC-CHANGELOG.1478515645
[2016-11-07 11:28:43.559046] I [master(/rhs/brick2/b5):1281:crawl] _GMaster: processing xsync changelog /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.37.207%3Agluster%3A%2F%2F127.0.0.1%3Aslave/1c53
557c54e8d43b89fcc316baed04a9/xsync/XSYNC-CHANGELOG.1478515979

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.8.4-3.el7rhgs.x86_64

Steps Carried:
==============
1. Create 4 node cluster and create 6x2 master volume
2. Create 2 node cluster and create 2x2 slave volume
3. Mount the master volume to Fuse client and create Huge data set (6L small files)
crefi -T 10 -n 1000 --multi -d 10 -b 5 --random --max=5K --min=1K --f=create /mnt/master/
4. Add 2 more nodes to the master cluster
5. Add bricks from these nodes to convert it into 7x2
6. Start the rebalance on master node
7. While rebalance is in progress, setup geo-rep (enabling shared volume, gsec_create, push-pem, config, start)
8. After geo-rep start, it picks up Hybrid crawl but stucks after syncing a while

Following are strace:
=====================

[root@dhcp37-76 ~]# strace -f -s 500 -p 18196
Process 18196 attached
write(2, "rsync: get_acl: sys_acl_get_file(.gfid/4b7bba53-bcc8-41c5-93be-ae2090c242ae, ACL_TYPE_ACCESS): No such file or directory (2)", 124
^CProcess 18196 detached
 <detached ...>
[root@dhcp37-76 ~]#

Comment 3 Aravinda VK 2016-11-09 14:16:16 UTC
Seeing lot of rsync errors about creating directories, that means mkdir failed for some reason and all the files from that directory failed to sync. (Checked Georep logs from *.121 node). Not yet traced the reason for mkdir failure.

Comment 4 Aravinda VK 2016-11-09 14:25:08 UTC
Operation not permitted are expected for rsync, rsync expects directory should be present before syncing files. Yet to find the reason for entry failure by entry_ops

[2016-11-09 10:09:24.794229] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/00201812-179b-400e-a6f2-b332bf72c7de" failed: Operation not permitted (1)
[2016-11-09 10:09:24.794754] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.794974] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01c699d5-8955-4e7e-85ae-b8c2f6dcb9da" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795208] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795418] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/01d22973-d93b-435b-9d25-f7ec2ebc262a" failed: Operation not permitted (1)
[2016-11-09 10:09:24.795630] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.795819] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/022988cd-b59c-4200-85d1-010343ce23a7" failed: Operation not permitted (1)
[2016-11-09 10:09:24.796001] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): *** Skipping any contents from this failed directory ***
[2016-11-09 10:09:24.796231] E [resource(/rhs/brick1/b14):1003:rsync] SSH: SYNC Error(Rsync): rsync: recv_generator: mkdir "/proc/9429/cwd/.gfid/0234671b-0b85-48c4-bdd6-e2108805f631" failed: Operation not permitted (1)

Comment 5 Rahul Hinduja 2016-11-11 07:25:02 UTC
Logs mentioned in comment 4 are for different issue(May be related but perceived difference) where crawl is moved to Changelog after restarting the geo-rep. But it still misses the data. (log: 2016-11-09)

Original issue was Hybrid crawls being stuck (log: 2016-11-07)

Comment 17 Amar Tumballi 2018-10-11 10:10:19 UTC
Not worked on this in last 2 years! and not planning to pick up in any immediate releases! 

RCA available, and agreed long back that it is not an issue, and documented the Known Issue!

Please reopen the bug if found important for any release!