Bug 1109557
Summary: | Dist-geo-rep : after renames on master, slave has more number files than master when synced through history crawl. | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vijaykumar Koppad <vkoppad> | ||||
Component: | geo-replication | Assignee: | Kotresh HR <khiremat> | ||||
Status: | CLOSED ERRATA | QA Contact: | Bhaskar Bandari <bbandari> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | rhgs-3.0 | CC: | aavati, avishwan, bbandari, csaba, david.macdonald, nlevinki, nsathyan, ssamanta, vagarwal, vshankar | ||||
Target Milestone: | --- | ||||||
Target Release: | RHGS 3.0.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.6.0.27-1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-09-22 19:41:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1111577 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Vijaykumar Koppad
2014-06-15 10:50:40 UTC
Created attachment 908894 [details]
sosreport of the master and slave nodes.
This issue was easily reproducible before commit 62265f4 was merged. commit 62265f4 introduces a change in the mknod() path that ignores internals fops. With this I was unable to reproduce this in @vkoppad test setup with a fair amount of rename/hardlink workload. But for some reason, this @ajha is still able to hit it in his setup. Therefore, this needs some more debugging for RCA. Will update soon. s/this @ajha/@ajha/ in comment #3 ;) I think this bug is not related to capturing internal fops. Its related to a small window where we still do xsync crawl, and this xsync crawl for some reason tries to sync some of the already synced files, which ultimately results in more files on slave than master, because xsync can't handle renames according this Bug 984591. We were seeing two issues with @ajha setup: 1. Stickybit files on mount. 2. Data loss during rename. And 3rd could be Vijaykumar's prediction. 3. xsync came in between and does not handle deletes. It is found out that, after the commit 62265f4(http://review.gluster.org/#/c/8070/), the first one is fixed. The data loss during rename was happening because, distributed-replicate setup of master and slave on same node with (2*2). With this unoptimal setup, gsyncds of both replicates will be active and tries to sync the changelogs collected at both replicated pairs. This is similar to a test case where two renames on same filename happens simultaneously. This is resulting in data loss. So there could be some race on simultaneous renames in dht resulting to data loss. This data loss is not at all reproducible with optimal setup, where replicate bricks falling into different nodes. So there is no issue with geo-replication w.r.t data loss. The third one could have been hit with steps mentioned above only if pause exceeded 120 sec. After 120 sec of pause, on geo-rep resume, it will go to faulty and restarts with history crawl first, then for a small window xsync and then changelog. So it is possible that, during that small xsync window, the deletes are missed to sync and hence more number of files on slave mount. The fix to this is being done as a separate patch (http://review.gluster.org/8151) and tracked with bug (https://bugzilla.redhat.com/show_bug.cgi?id=1112238). If this patch goes in, this will automatically be fixed. Marking this on QA based on following reasons for three issues seen. 1. Stickybit files on mount: http://review.gluster.org/#/c/8070/ fixes it. 2. Data loss during rename: This is because two gsyncds processing the same changelog entries parallely from two replicas. This is an un-optimal setup of having both replicas on single node. I could hit this without geo-replication and corresponding dht bug is also there https://bugzilla.redhat.com/show_bug.cgi?id=1117135 3. The chances of xsync coming in between and missing renames/deletes: Well, this is known issue that, xsync does not handle rename/deletes. But problem, of geo-rep using xsync even when history crawl is there is being tracked by different bug https://bugzilla.redhat.com/show_bug.cgi?id=1112238 I think this is a different problem and should not be dependent of this bug. Hence moving to ON_QA. After much discussion, its decided that it should be dependent on Bug 1112238 as the extra files on slave volume is specifically seen with History crawl. Hence moving it to ASSIGNED and marking it as dependent. verified on the build glusterfs-3.6.0.27-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |