| Summary: | geo-rep+sharding : Checkpoint reports completion, but all files are not copied to slave | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Sahina Bose <sabose> | ||||
| Component: | geo-replication | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | rhgs-3.1 | CC: | avishwan, chrisw, csaba, nlevinki, rhinduja, sasundar | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-04-26 05:57:52 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1258386 | ||||||
| Attachments: |
|
||||||
|
Description
Sahina Bose
2016-04-22 14:30:54 UTC
Created attachment 1149770 [details]
geo-rep-logs-from-master
RCA:
The issue was due to deleting stime xattrs only from brick root and not from all dirs. Hybrid Crawl(Filesystem crawl during initial sync) will compare xtime and stime. where xtime is like modification time set by marker translator on every change. stime is Slave time which indicates till what time Slave is in sync with Master. stime is maintained in Brick root as well as on directories. Xtime will be available in all files and directories.
/(xtime=10, stime=8)
d1/(xtime=10, stime=8)
f1(xtime=10)
f2(xtime=7)
d2/(xtime=7, stime=7)
f3(xtime=6)
f4(xtime=7)
Above directory structure shows two directories d1, d2 and two files in each directories. Hybrid crawl detects the files and directories need to be synced based on xtime > stime. In the above example, Hybrid crawl picks only "d1/f1" to be synced.
When we remove stime xattr from brick root, Geo-rep may fail to get the list of files/directories for syncing.
/(xtime=10, stime=)
d1/(xtime=10, stime=8)
f1(xtime=10)
f2(xtime=7)
d2/(xtime=7, stime=7)
f3(xtime=6)
f4(xtime=7)
Even after stime reset, Geo-rep picks only "d1/f1" for syncing. Since Geo-rep thinks all the other files already synced to Slave.
This xtime > stime comparison is required to avoid re-crawl and sync of same dirs/files when worker crashes.
Workaround:
-------------------
Delete stime xattrs from directories and brick root if Resync is required.
Possible solution:
-------------------
Instead of deleting stime xattr, Mark xattr value as -2. Geo-rep should ignore xtime>stime comparison if Brick root stime xattr value is -2. This should be part of Geo-rep delete command. (BZ 1205162)
Closing this, as this will not be a supported use case. While recovering from slave, files in slave should not be deleted. |