Description of problem: After doing an add-brick , if we try to access existing directories from NFS mount, operation fails with ENOENT Version-Release number of selected component (if applicable): 3.6.0.24-1.el6_5.x86_64 How reproducible: always Steps to Reproduce: 1. create a 2 brick distribute volume 2. do nfs mount and create some directories 3. do add-brick and try to cd or find on the existing directories Actual results: operation fails with ENOENT "no such file or directory" Additional info: This is because lookup call goes to the newly added directory but nfs doesn't do named lookups so dht_selfheal doesn't heal the directories on the new bricks hence the error. (gdb) p *loc $1 = {path = 0x2f403f0 "<gfid:cc37ad96-b3a8-470c-9dc4-b659c034fbb7>", name = 0x0, inode = 0x7f8e5a61a184, parent = 0x0, gfid = "\314\067\255\226\263\250G\f\235ĶY\300\064", <incomplete sequence \373\267>, pargfid = '\000' <repeats 15 times>} in the code name is 0x0 so lookup fails on the new bricks [2014-07-18 09:52:29.152516] W [client-rpc-fops.c:2758:client3_3_lookup_cbk] 0-alto-client-12: remote operation failed: No such file or directory. Path: <gfid:cc37ad96 -b3a8-470c-9dc4-b659c034fbb7> (cc37ad96-b3a8-470c-9dc4-b659c034fbb7) [2014-07-18 09:52:29.152575] W [client-rpc-fops.c:2758:client3_3_lookup_cbk] 0-alto-client-13: remote operation failed: No such file or directory. Path: <gfid:cc37ad96 -b3a8-470c-9dc4-b659c034fbb7> (cc37ad96-b3a8-470c-9dc4-b659c034fbb7) cluster info ============ [root@rhs-client4 alto12]# gluster v info alto Volume Name: alto Type: Distribute Volume ID: 515d1a3a-6d94-4322-8baa-50e28eeee5b3 Status: Started Snap Volume: no Number of Bricks: 14 Transport-type: tcp Bricks: Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/alto0 Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/alto1 Brick3: rhs-client4.lab.eng.blr.redhat.com:/home/alto2 Brick4: rhs-client39.lab.eng.blr.redhat.com:/home/alto3 Brick5: rhs-client4.lab.eng.blr.redhat.com:/home/alto4 Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/alto5 Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/alto6 Brick8: rhs-client39.lab.eng.blr.redhat.com:/home/alto7 Brick9: rhs-client4.lab.eng.blr.redhat.com:/home/alto8 Brick10: rhs-client39.lab.eng.blr.redhat.com:/home/alto9 Brick11: rhs-client4.lab.eng.blr.redhat.com:/home/alto10 Brick12: rhs-client39.lab.eng.blr.redhat.com:/home/alto11 Brick13: rhs-client4.lab.eng.blr.redhat.com:/home/alto12 Brick14: rhs-client39.lab.eng.blr.redhat.com:/home/alto13 Options Reconfigured: performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable attaching the sosreports
Observations: Able to reproduce this with ease, started with a 2 brick setup and moved to 3->rebalance->4->rebalance and was able to see the problem after add brick and before rebalance as the directory gets created after rebalance in the newly added brick. Steps to create the issue seems basic, create the volume and add some files and a directory and some files inside this directory as well. All from the NFS mount point. Also list all entries from the mount point before changing the graph by an add-brick. Post add brick try listing any entry inside the created directory and ESTALE. The issue seems to be that we get a dht_access, that determines the local->cached_subvol as the newly added brick. This comes from dht_local_init where local->cached_subvol is taken from layout->list[0].xlator. On sending a access to this brick, we get an ESTALE error, as the directory does not exist here (obviously) and then we decide to do a dht_rebalance_complete_check in dht_access_cbk where the discover happens, BUT this syncop is sent to the one subvolume which is the cached_subvol as a result gets ESTALE again in discover part of the lookup (as this is from NFS we get nameless lookups). Overall, the start point seems to be determining the cached_subvol incorrectly. This seems to be due to some layout sorting that puts the 0-0 layout first (the newly added brick) and so the calls end up here.
Tested with the following patch to change the sorting of layouts that are 0-0 --- a/xlators/cluster/dht/src/dht-layout.c +++ b/xlators/cluster/dht/src/dht-layout.c @@ -483,10 +483,10 @@ dht_layout_entry_cmp (dht_layout_t *layout, int i, int j) { int64_t diff = 0; - /* swap zero'ed out layouts to front, if needed */ + /* swap zero'ed out layouts to back, if needed */ if (!layout->list[j].start && !layout->list[j].stop) { - diff = (int64_t) layout->list[i].stop - - (int64_t) layout->list[j].stop; + diff = (int64_t) layout->list[j].stop + - (int64_t) layout->list[i].stop; goto out; } diff = (int64_t) layout->list[i].start The result was that the problem was not noticed or disappeared. Need to understand what the layout->list[0] should be for directories? Should this not be the hashed subvolume (where determination of this is possible?). For files this would be the cached subvol I think.
Some questions on this bug: When I try to issue "cd dir" then dht_access gets triggered from nfs mount. Now let's get a small picture of dht_access function: ========================================================================== if ((op_ret == -1) && (op_errno == ENOTCONN) && IA_ISDIR(local->loc.inode->ia_type)) { subvol = dht_subvol_next_available (this, prev->this); if (!subvol) goto out; /* check if we are done with visiting every node */ if (subvol == local->cached_subvol) { goto out; } STACK_WIND (frame, dht_access_cbk, subvol, subvol->fops->access, &local->loc, local->rebalance.flags, NULL); return 0; } if ((op_ret == -1) && dht_inode_missing(op_errno)) { /* File would be migrated to other node */ local->op_errno = op_errno; local->rebalance.target_op_fn = dht_access2; ret = dht_rebalance_complete_check (frame->this, frame); if (!ret) return 0; } ================================================ Behaviour: As the cached may not have the dir entry(for many reasons) dht_access fails. Q1: In case a directory is not present on one of the subvol why should not we continue the check of accessibility of the directory on other subvols. Instead we check "dht_rebalance_complete_check". dht_rebalance_complete_check: ============================================ /* getxattr on cached_subvol for 'linkto' value. Do path based getxattr * as root:root. If a fd is already open, access check wont be done*/ if (!local->loc.inode) { ret = syncop_fgetxattr (src_node, local->fd, &dict, conf->link_xattr_name); } else { SYNCTASK_SETID (0, 0); ret = syncop_getxattr (src_node, &local->loc, &dict, conf->link_xattr_name); SYNCTASK_SETID (frame->root->uid, frame->root->gid); } Q2. We unconditionally want to fetch the linkto value assuming the inode represents a regular file and might be under migration. Why ? =========================================================================== Shyam, Mostly for the above bug we can handle the following way[Just the starting point of the fixing]. I tested and was not able to hit the issue. if ((op_ret == -1) && (op_errno == ENOTCONN || op_errno == ENOTCONN || op_errno == ESTALE) && IA_ISDIR(local->loc.inode->ia_type)) { subvol = dht_subvol_next_available (this, prev->this); if (!subvol) goto out; } The intention is to iterate over the next subvols assuming some subvolumes have directories. Need further investigation to find whether this has any relation with the splunk issue.
@Susant, Comment #6 is right. We probably need to add ENOENT as well to the list of errors where we will check other subvolumes of DHT for the directory. I am assuming Splunk is hitting things along the same lines, based on available data in the log files and Avati's debugging of the same. It would/should have hit this path, and then we get into some indeterinate state in between and causes the issue is my understanding. Although we need to tie the knots in order before the claim :)
I think this is too important a functionality to be pushed off to a z-stream release. I would consider this a blocker for RHSS 3.0
Fix submitted here, http://review.gluster.org/8462 Fixed issue as identified in the various comments of this bug.
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/30795/ Upstream patch: http://review.gluster.org/#/c/8462/
verified on glusterfs-3.6.0.28-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html