Bug 1121099

Summary: DHT: Accessing a directory fails with ENOENT on nfs mount after add-brick
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: distributeAssignee: Susant Kumar Palai <spalai>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: nbalacha, nsathyan, rhs-bugs, spalai, srangana, ssaha, ssamanta
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.28-1 Doc Type: Bug Fix
Doc Text:
When the cluster topology changes due to add-brick, all sub volumes of DHT will not contain the directories till a rebalance is completed. Till the rebalance is run, if a caller bypasses lookup and calls access due to saved/cached inode information (like NFS server does) then, dht_access misreads the error (ESTALE/ENOENT) from the new subvolumes and incorrectly tries to handle the inode as a file. This results in the directories in memory state in DHT to be corrupted and not heal even post a rebalance. Fixed the problem in dht_access thereby preventing DHT from misrepresenting a directory as a file in the case presented above.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-09-22 19:44:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1125958    

Description shylesh 2014-07-18 10:58:32 UTC
Description of problem:
After doing an add-brick , if we try to access existing directories from NFS mount, operation fails with ENOENT


Version-Release number of selected component (if applicable):
3.6.0.24-1.el6_5.x86_64

How reproducible:
always

Steps to Reproduce:
1. create a 2 brick distribute volume
2. do nfs mount and create some directories
3. do add-brick and try to cd or find on the existing directories

Actual results:

operation fails with ENOENT "no such file or directory"
 

Additional info:
This is because lookup call goes to the newly added directory but nfs doesn't do named lookups so dht_selfheal doesn't heal the directories on the new bricks hence the error.
(gdb) p *loc
$1 = {path = 0x2f403f0 "<gfid:cc37ad96-b3a8-470c-9dc4-b659c034fbb7>", name = 0x0, inode = 0x7f8e5a61a184, parent = 0x0,
  gfid = "\314\067\255\226\263\250G\f\235ĶY\300\064", <incomplete sequence \373\267>, pargfid = '\000' <repeats 15 times>}



in the code name is 0x0 so lookup fails on the new bricks

[2014-07-18 09:52:29.152516] W [client-rpc-fops.c:2758:client3_3_lookup_cbk] 0-alto-client-12: remote operation failed: No such file or directory. Path: <gfid:cc37ad96
-b3a8-470c-9dc4-b659c034fbb7> (cc37ad96-b3a8-470c-9dc4-b659c034fbb7)
[2014-07-18 09:52:29.152575] W [client-rpc-fops.c:2758:client3_3_lookup_cbk] 0-alto-client-13: remote operation failed: No such file or directory. Path: <gfid:cc37ad96
-b3a8-470c-9dc4-b659c034fbb7> (cc37ad96-b3a8-470c-9dc4-b659c034fbb7)

cluster info
============
[root@rhs-client4 alto12]# gluster v info alto
 
Volume Name: alto
Type: Distribute
Volume ID: 515d1a3a-6d94-4322-8baa-50e28eeee5b3
Status: Started
Snap Volume: no
Number of Bricks: 14
Transport-type: tcp
Bricks:
Brick1: rhs-client4.lab.eng.blr.redhat.com:/home/alto0
Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/alto1
Brick3: rhs-client4.lab.eng.blr.redhat.com:/home/alto2
Brick4: rhs-client39.lab.eng.blr.redhat.com:/home/alto3
Brick5: rhs-client4.lab.eng.blr.redhat.com:/home/alto4
Brick6: rhs-client39.lab.eng.blr.redhat.com:/home/alto5
Brick7: rhs-client4.lab.eng.blr.redhat.com:/home/alto6
Brick8: rhs-client39.lab.eng.blr.redhat.com:/home/alto7
Brick9: rhs-client4.lab.eng.blr.redhat.com:/home/alto8
Brick10: rhs-client39.lab.eng.blr.redhat.com:/home/alto9
Brick11: rhs-client4.lab.eng.blr.redhat.com:/home/alto10
Brick12: rhs-client39.lab.eng.blr.redhat.com:/home/alto11
Brick13: rhs-client4.lab.eng.blr.redhat.com:/home/alto12
Brick14: rhs-client39.lab.eng.blr.redhat.com:/home/alto13
Options Reconfigured:
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable



attaching the sosreports

Comment 3 Shyamsundar 2014-07-31 22:13:44 UTC
Observations:

Able to reproduce this with ease, started with a 2 brick setup and moved to 3->rebalance->4->rebalance and was able to see the problem after add brick and before rebalance as the directory gets created after rebalance in the newly added brick.

Steps to create the issue seems basic, create the volume and add some files and a directory and some files inside this directory as well. All from the NFS mount point. Also list all entries from the mount point before changing the graph by an add-brick.

Post add brick try listing any entry inside the created directory and ESTALE.

The issue seems to be that we get a dht_access, that determines the local->cached_subvol as the newly added brick. This comes from dht_local_init where local->cached_subvol is taken from layout->list[0].xlator.

On sending a access to this brick, we get an ESTALE error, as the directory does not exist here (obviously) and then we decide to do a dht_rebalance_complete_check in dht_access_cbk where the discover happens, BUT this syncop is sent to the one subvolume which is the cached_subvol as a result gets ESTALE again in discover part of the lookup (as this is from NFS we get nameless lookups).

Overall, the start point seems to be determining the cached_subvol incorrectly. This seems to be due to some layout sorting that puts the 0-0 layout first (the newly added brick) and so the calls end up here.

Comment 4 Shyamsundar 2014-07-31 22:55:04 UTC
Tested with the following patch to change the sorting of layouts that are 0-0

--- a/xlators/cluster/dht/src/dht-layout.c
+++ b/xlators/cluster/dht/src/dht-layout.c
@@ -483,10 +483,10 @@ dht_layout_entry_cmp (dht_layout_t *layout, int i, int j)
 {
         int64_t diff = 0;
 
-        /* swap zero'ed out layouts to front, if needed */
+        /* swap zero'ed out layouts to back, if needed */
         if (!layout->list[j].start && !layout->list[j].stop) {
-                diff = (int64_t) layout->list[i].stop
-                       - (int64_t) layout->list[j].stop;
+                diff = (int64_t) layout->list[j].stop
+                       - (int64_t) layout->list[i].stop;
                        goto out;
         }
         diff = (int64_t) layout->list[i].start

The result was that the problem was not noticed or disappeared.

Need to understand what the layout->list[0] should be for directories? Should this not be the hashed subvolume (where determination of this is possible?).

For files this would be the cached subvol I think.

Comment 6 Susant Kumar Palai 2014-08-11 12:57:07 UTC
Some questions on this bug:

When I try to issue "cd dir" then dht_access gets triggered
from nfs mount.

Now let's get a small picture of dht_access function:
==========================================================================
if ((op_ret == -1) && (op_errno == ENOTCONN) &&  
            IA_ISDIR(local->loc.inode->ia_type)) {

                subvol = dht_subvol_next_available (this, prev->this);
                if (!subvol)
                        goto out;

                /* check if we are done with visiting every node */
                if (subvol == local->cached_subvol) {
                        goto out;
                }

                STACK_WIND (frame, dht_access_cbk, subvol, subvol->fops->access,
                            &local->loc, local->rebalance.flags, NULL);
                return 0;
        }
        if ((op_ret == -1) && dht_inode_missing(op_errno)) {
                /* File would be migrated to other node */
                local->op_errno = op_errno;
                local->rebalance.target_op_fn = dht_access2;
                ret = dht_rebalance_complete_check (frame->this, frame);
                if (!ret)
                        return 0;
        }


================================================
Behaviour: As the cached may not have the dir entry(for many reasons) dht_access fails.

Q1: In case a directory is not present on one of the subvol why should not we continue the check of accessibility of the directory on other subvols. Instead we check "dht_rebalance_complete_check".

dht_rebalance_complete_check:
============================================
 /* getxattr on cached_subvol for 'linkto' value. Do path based getxattr
         * as root:root. If a fd is already open, access check wont be done*/

        if (!local->loc.inode) {
                ret = syncop_fgetxattr (src_node, local->fd, &dict,
                                        conf->link_xattr_name);
        } else {
                SYNCTASK_SETID (0, 0);
                ret = syncop_getxattr (src_node, &local->loc, &dict,
                                       conf->link_xattr_name);
                SYNCTASK_SETID (frame->root->uid, frame->root->gid);
        }


Q2. We unconditionally want to fetch the linkto value assuming the inode represents a regular file and might be under migration. Why ?
===========================================================================

Shyam,
   Mostly for the above bug we can handle the following way[Just the starting point of the fixing]. I tested and was not able to hit the issue. 

if ((op_ret == -1) && (op_errno == ENOTCONN || op_errno == ENOTCONN || 
    op_errno == ESTALE) && IA_ISDIR(local->loc.inode->ia_type)) {

                subvol = dht_subvol_next_available (this, prev->this);
                if (!subvol)
                        goto out;
}

The intention is to iterate over the next subvols assuming some
subvolumes have directories.

Need further investigation to find whether this has any relation with the splunk issue.

Comment 7 Shyamsundar 2014-08-11 13:29:39 UTC
@Susant, Comment #6 is right. We probably need to add ENOENT as well to the list of errors where we will check other subvolumes of DHT for the directory.

I am assuming Splunk is hitting things along the same lines, based on available data in the log files and Avati's debugging of the same.

It would/should have hit this path, and then we get into some indeterinate state in between and causes the issue is my understanding. Although we need to tie the knots in order before the claim :)

Comment 8 Sayan Saha 2014-08-11 14:28:56 UTC
I think this is too important a functionality to be pushed off to a z-stream release. I would consider this a blocker for RHSS 3.0

Comment 9 Shyamsundar 2014-08-12 17:55:11 UTC
Fix submitted here,  http://review.gluster.org/8462

Fixed issue as identified in the various comments of this bug.

Comment 10 Susant Kumar Palai 2014-08-18 11:09:58 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/30795/

Upstream patch: http://review.gluster.org/#/c/8462/

Comment 11 shylesh 2014-09-19 09:58:27 UTC
 verified on glusterfs-3.6.0.28-1

Comment 13 errata-xmlrpc 2014-09-22 19:44:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html