Bug 1121099
Summary: | DHT: Accessing a directory fails with ENOENT on nfs mount after add-brick | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | shylesh <shmohan> |
Component: | distribute | Assignee: | Susant Kumar Palai <spalai> |
Status: | CLOSED ERRATA | QA Contact: | shylesh <shmohan> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.0 | CC: | nbalacha, nsathyan, rhs-bugs, spalai, srangana, ssaha, ssamanta |
Target Milestone: | --- | ||
Target Release: | RHGS 3.0.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.6.0.28-1 | Doc Type: | Bug Fix |
Doc Text: |
When the cluster topology changes due to add-brick, all sub
volumes of DHT will not contain the directories till a rebalance is completed. Till the rebalance is run, if a caller bypasses lookup and calls access due to saved/cached inode information (like NFS server does) then, dht_access misreads the error (ESTALE/ENOENT) from the new subvolumes and incorrectly tries to handle the inode as a file. This results in the directories in memory state in DHT to be corrupted and not heal even post a rebalance.
Fixed the problem in dht_access thereby preventing
DHT from misrepresenting a directory as a file in the case
presented above.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2014-09-22 19:44:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1125958 |
Description
shylesh
2014-07-18 10:58:32 UTC
Observations: Able to reproduce this with ease, started with a 2 brick setup and moved to 3->rebalance->4->rebalance and was able to see the problem after add brick and before rebalance as the directory gets created after rebalance in the newly added brick. Steps to create the issue seems basic, create the volume and add some files and a directory and some files inside this directory as well. All from the NFS mount point. Also list all entries from the mount point before changing the graph by an add-brick. Post add brick try listing any entry inside the created directory and ESTALE. The issue seems to be that we get a dht_access, that determines the local->cached_subvol as the newly added brick. This comes from dht_local_init where local->cached_subvol is taken from layout->list[0].xlator. On sending a access to this brick, we get an ESTALE error, as the directory does not exist here (obviously) and then we decide to do a dht_rebalance_complete_check in dht_access_cbk where the discover happens, BUT this syncop is sent to the one subvolume which is the cached_subvol as a result gets ESTALE again in discover part of the lookup (as this is from NFS we get nameless lookups). Overall, the start point seems to be determining the cached_subvol incorrectly. This seems to be due to some layout sorting that puts the 0-0 layout first (the newly added brick) and so the calls end up here. Tested with the following patch to change the sorting of layouts that are 0-0 --- a/xlators/cluster/dht/src/dht-layout.c +++ b/xlators/cluster/dht/src/dht-layout.c @@ -483,10 +483,10 @@ dht_layout_entry_cmp (dht_layout_t *layout, int i, int j) { int64_t diff = 0; - /* swap zero'ed out layouts to front, if needed */ + /* swap zero'ed out layouts to back, if needed */ if (!layout->list[j].start && !layout->list[j].stop) { - diff = (int64_t) layout->list[i].stop - - (int64_t) layout->list[j].stop; + diff = (int64_t) layout->list[j].stop + - (int64_t) layout->list[i].stop; goto out; } diff = (int64_t) layout->list[i].start The result was that the problem was not noticed or disappeared. Need to understand what the layout->list[0] should be for directories? Should this not be the hashed subvolume (where determination of this is possible?). For files this would be the cached subvol I think. Some questions on this bug: When I try to issue "cd dir" then dht_access gets triggered from nfs mount. Now let's get a small picture of dht_access function: ========================================================================== if ((op_ret == -1) && (op_errno == ENOTCONN) && IA_ISDIR(local->loc.inode->ia_type)) { subvol = dht_subvol_next_available (this, prev->this); if (!subvol) goto out; /* check if we are done with visiting every node */ if (subvol == local->cached_subvol) { goto out; } STACK_WIND (frame, dht_access_cbk, subvol, subvol->fops->access, &local->loc, local->rebalance.flags, NULL); return 0; } if ((op_ret == -1) && dht_inode_missing(op_errno)) { /* File would be migrated to other node */ local->op_errno = op_errno; local->rebalance.target_op_fn = dht_access2; ret = dht_rebalance_complete_check (frame->this, frame); if (!ret) return 0; } ================================================ Behaviour: As the cached may not have the dir entry(for many reasons) dht_access fails. Q1: In case a directory is not present on one of the subvol why should not we continue the check of accessibility of the directory on other subvols. Instead we check "dht_rebalance_complete_check". dht_rebalance_complete_check: ============================================ /* getxattr on cached_subvol for 'linkto' value. Do path based getxattr * as root:root. If a fd is already open, access check wont be done*/ if (!local->loc.inode) { ret = syncop_fgetxattr (src_node, local->fd, &dict, conf->link_xattr_name); } else { SYNCTASK_SETID (0, 0); ret = syncop_getxattr (src_node, &local->loc, &dict, conf->link_xattr_name); SYNCTASK_SETID (frame->root->uid, frame->root->gid); } Q2. We unconditionally want to fetch the linkto value assuming the inode represents a regular file and might be under migration. Why ? =========================================================================== Shyam, Mostly for the above bug we can handle the following way[Just the starting point of the fixing]. I tested and was not able to hit the issue. if ((op_ret == -1) && (op_errno == ENOTCONN || op_errno == ENOTCONN || op_errno == ESTALE) && IA_ISDIR(local->loc.inode->ia_type)) { subvol = dht_subvol_next_available (this, prev->this); if (!subvol) goto out; } The intention is to iterate over the next subvols assuming some subvolumes have directories. Need further investigation to find whether this has any relation with the splunk issue. @Susant, Comment #6 is right. We probably need to add ENOENT as well to the list of errors where we will check other subvolumes of DHT for the directory. I am assuming Splunk is hitting things along the same lines, based on available data in the log files and Avati's debugging of the same. It would/should have hit this path, and then we get into some indeterinate state in between and causes the issue is my understanding. Although we need to tie the knots in order before the claim :) I think this is too important a functionality to be pushed off to a z-stream release. I would consider this a blocker for RHSS 3.0 Fix submitted here, http://review.gluster.org/8462 Fixed issue as identified in the various comments of this bug. Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/30795/ Upstream patch: http://review.gluster.org/#/c/8462/ verified on glusterfs-3.6.0.28-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |