Bug 1065332
Summary: | Directory deletion fails when a replicate sub-volume goes down and comes back up again | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Sachidananda Urs <surs> |
Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | RajeshReddy <rmekala> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.0 | CC: | kramdoss, mzywusko, nbalacha, nlevinki, pkarampu, rgowdapp, smohan, vbellur |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | triaged, dht-fixed | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-06-24 05:06:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sachidananda Urs
2014-02-14 11:23:26 UTC
Please find sosreports here: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1065332/ Sachidananda, Could you change the sosreports' permission so I can access them? Thanks in advance, Krutika Krutika Dhananjay, I've changed the permissions. You're welcome, Sachidananda. I am able to recreate this bug consistently even with glusterfs untar on mount point as the method of creating data on a 2x2 volume. ROOT CAUSE ANALYSIS: What rm -rf does in a nutshell: ------------------------------ As part of rm -rf on the mount point, first of all, readdirs are performed starting from root (STEP-0), regular files under the directories are unlinked (say STEP-1). And then, rmdir is performed on their directories (call it STEP-2). How DHT does rmdir: ------------------ Now, the way DHT does rmdir is by first winding RMDIR FOP on all but the hashed sub-volume of the concerned directory. And then once that is done, the RMDIR is finally wound on the hashed sub-volume. Observations: ------------ What Pranith and I observed was that there were few directories (for instance /glusterfs-3.5qa2/contrib/libexecinfo, /glusterfs-3.5qa2/contrib/rbtree etc) whose cached sub-volume happened to be that replicate xlator which was not in quorum. In this case, dht_rmdir() on this was failing with EROFS (as expected). Despite seeing this error, after STEP 1, DHT still goes ahead and winds an RMDIR on the hashed subvolume - the result : the directory is removed from the hashed sub-volume but is still present on the remaining subvols of DHT. Now after bringing the downed brick back up (that is after quorum is restored), when rm -rf is attempted again, as part of STEP-0, READDIRPs are issued on the directories. And the way dht_readdirp() works is by taking into account only those directory entries whose hashed-subvolume happens to be the same as the sub-volume on which the current readdirp was performed. In this example, READDIRP on the parent of the directories libexecinfo and rbtree (i.e /glusterfs-3.5qa2/contrib) returned no entries (barring . and ..) on the hashed sub-volume and the names 'libexecinfo' and 'rbtree' from the cached sub-volumes. Since these entries were found on cached sub-volume alone, dht readdirp ignores them and treats the parent directory to be empty. This causes a subsequent RMDIR on the parent to fail eventually with ENOTEMPTY. I will try the same test case on NFS mount point and update the bug with the RCA. Two updates: 1. I tried the same test case on an NFS mount 3 times and the result: I got the same error as in fuse mount - ENOTEMPTY. And the root cause of this behavior is same as the one described in comment #5. 2. Turns out Susant had already sent a patch in dht_rmdir() in April, which fixes this issue and is currently under review: http://review.gluster.org/#/c/7460/. I applied this patch and ran the test again, and everything worked fine. Assigning bug to Susanth/dht-component as per https://bugzilla.redhat.com/show_bug.cgi?id=1065332#c6 Triage-update: Need to refresh http://review.gluster.org/#/c/7460/ and test. This should have been fixed by http://review.gluster.org/#/c/14060/ We will need to retest on RHGS 3.1.3 and confirm. The issue reported is no more seen in 3.1.3 build. Tried the test mentioned in steps to reproduce for couple of times. rm -rf deletes all directories as expected. |