Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1335373

Summary:	cyclic dentry loop in inode table
Product:	[Community] GlusterFS	Reporter:	Raghavendra G <rgowdapp>
Component:	fuse	Assignee:	bugs <bugs>
Status:	CLOSED UPSTREAM	QA Contact:
Severity:	low	Docs Contact:
Priority:	low
Version:	mainline	CC:	bugs, csaba, khiremat, kiyer
Target Milestone:	---	Keywords:	Reopened, Triaged
Target Release:	---	Flags:	khiremat: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-12 12:51:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1092606
Bug Blocks:

Description Raghavendra G 2016-05-12 06:31:09 UTC

Description of problem:

Assume the following directory structure:

               1
              / \
             2   3
             |
             a
            / \
           b   c

Now, Consider following sequence of operations:
1. lookup <2,a> at storage/posix.
2. rename <2,a> to <3,a> at storage/posix
3. rename completes at protocol/server and it modifies dentries appropriately.
4. lookup completes at protocol/server and it creates a dentry <2,a> in inode table.


After the above sequence of operations, the dentry structure in inode table looks like:
            
               1
              / \
             2   3
             |   |   
             a   a
                / \
               b   c


Now, if we mv 2 into either of b or c, we'll have a loop at a.

               1
                \
                 3
                 |   
                 a
                / \
               b   c
               |
               2
               |
               a

The loop here being (a, b, 2, a).

Note that, though I've given example of protocol/server here, in theory this loop can be formed in any access-layer (like fuse-bridge, api, nfsv3 etc) doing inode management.

Version-Release number of selected component (if applicable):
Day 1 bug, present in all releases

How reproducible:
race condition

Steps to Reproduce:
1. There is no specific test case, though we've seen this scenario in many other tests.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raghavendra G 2016-05-12 06:36:01 UTC

http://review.gluster.org/#/c/13451 is one way of solving this issue. DFS should be loaded as an immediate child of these access layers. With the dentry operations being serialized, we don't see the race outlined above.

Comment 2 Raghavendra G 2016-05-17 09:21:43 UTC

Other problems due to stale dentries for a directory are discussed in this mail thread:
http://comments.gmane.org/gmane.comp.file-systems.gluster.devel/15287

Comment 3 Csaba Henk 2017-09-01 17:39:50 UTC

Setting to POST in acknowledgement of https://review.gluster.org/13451 as an attempt to solve this.

Comment 4 Amar Tumballi 2019-05-11 02:08:45 UTC

This can be resolved by `gluster volume set $VOL features.sdfs enable` (the above patch finally made it to repo).

Comment 5 Raghavendra G 2019-05-11 02:22:00 UTC

(In reply to Amar Tumballi from comment #4)
> This can be resolved by `gluster volume set $VOL features.sdfs enable` (the
> above patch finally made it to repo).

But doesn't enabling sdfs regresses performance significantly and hence not a viable solution? Note that we have run into this problem even in client's inode table. Serializing on client makes glusterfs almost useless due to serious performance drop. So, even though sdfs is available (that too only on bricks not on client, but this problem exists on client too), in its current form it cannot be used practically and hence I would say this bug is not fixed.

Comment 6 Raghavendra G 2019-05-11 02:25:49 UTC

(In reply to Raghavendra G from comment #5)
> (In reply to Amar Tumballi from comment #4)
> > This can be resolved by `gluster volume set $VOL features.sdfs enable` (the
> > above patch finally made it to repo).
> 
> But doesn't enabling sdfs regresses performance significantly and hence not
> a viable solution? 

commit 829337ed3971a53086f1562d826e79d4f3e3ed39
Author: Amar Tumballi <amarts>
Date:   Mon Jan 28 18:30:24 2019 +0530

    features/sdfs: disable by default
    
    With the feature enabled, some of the performance testing results,
    specially those which create millions of small files, got approximately
    4x regression compared to version before enabling this.
    
    On master without this patch:  765 creates/sec
    On master with this patch   : 3380 creates/sec
    
    Also there seems to be regression caused by this in 'ls -l' workload.
    
    On master without this patch:  3030 files/sec
    On master with this patch   : 16610 files/sec
    
    This is a feature added to handle multiple clients parallely operating
    (specially those which race for file creates with same name) on a single
    namespace/directory. Considering that is < 3% of Gluster's usecase right
    now, it makes sense to disable the feature by default, so we don't
    penalize the default users who doesn't bother about this usecase.
    Also note that the client side translators, specially, distribute,
    replicate and disperse already handle the issue upto 99.5% of the cases
    without SDFS, so it makes sense to keep the feature disabled by default.
    
    Credits: Shyamsunder <srangana> for running the tests and
    getting the numbers.
    
    Change-Id: Iec49ce1d82e621e9db25eb633fcb1d932e74f4fc
    Updates: bz#1670031
    Signed-off-by: Amar Tumballi <amarts>

Comment 7 Raghavendra G 2019-05-11 02:26:56 UTC

Moving back this to assigned till the discussion about perf impact of sdfs is resolved.

Comment 8 Raghavendra G 2019-05-11 02:37:01 UTC

Another data point for this bug is cyclic dentry loops can cause *serious* performance regression in lookup codepath. Especially if cyclic loops are formed relatively deep in directory hierarchy as they increase the dentry search time exponentially (2 pow number-of-duplicate-dentries). So, this bug is correctness as well as a performance one.

And I've seen the issue in one of production environments (though we couldn't exactly measure perf impact of this in this setup, but perf impact was seen in test setups) - https://bugzilla.redhat.com/show_bug.cgi?id=1696353#c11

Comment 9 Raghavendra G 2019-05-11 03:52:50 UTC

s/cyclic dentry loops/stale dentries.

What I've said about performance in previous comments is still valid.

Comment 10 Amar Tumballi 2019-05-11 10:11:07 UTC

Raghavendra, Yes, SDFS has performance impact. But the feature is still present and can be enabled if user's work load demands it.

My reasoning of closing this bug was mainly because for none of our users, this particular case was not hit (ie, there are no reports of it). So, focusing on that work when no one needs it is meaningless, that too when there are options available.

Comment 11 Raghavendra G 2019-05-11 13:19:19 UTC

Stale dentries on client itable can end up making both src and dst exist after a "mv src dst" succeeds and they resolve to same inode.

From one of the production setups using SAS:

md-cache doesn't resolve a (parent, basename) pair to an inode. Instead its the access layers (fuse, gfapi) that resolve the path to an inode. md-cache just gives back the stat stored in the context of inode. Which means both,
* (Nodeid:139842860710128, dm_errori.sas7bdat)
* (Nodeid:139842860710128, dm_errori.sas7bdat.lck)

are resolving to same inode. Since they resolve to same inode, lookup is served from same cache and hence identical stats. When md-cache is turned off, both lookups go on same inode, but server-resolver on brick ignores the gfid sent by client. Instead it resolves the entry freshly and hence will get a different inode for each lookup.

So the actual problem is in fuse inode table, where there are two dentries for the same inode. How did we end up in that situation? Its likely that a lookup (Nodeid:139842860710128, dm_errori.sas7bdat.lck) was racing with rename (dm_errori.sas7bdat, dm_errori.sas7bdat.lck) and rename updated inode table on client first. The lookup which hit storage/posix before rename relinked the stale dentry. Without lookups reaching bricks, client never got a chance to flush its stale dentry. The problem is exactly the same DFS is trying to solve on brick stack.

So, the role of stat-prefetch here is that it is preventing lookups from reaching bricks. Once lookups reach bricks, stale dentry from inode table will be purged and the error condition goes away.

Comment 12 Raghavendra G 2019-05-11 13:28:17 UTC

(In reply to Amar Tumballi from comment #10)
> Raghavendra, Yes, SDFS has performance impact. But the feature is still
> present and can be enabled if user's work load demands it.
> 
> My reasoning of closing this bug was mainly because for none of our users,
> this particular case was not hit (ie, there are no reports of it). So,
> focusing on that work when no one needs it is meaningless, that too when
> there are options available.

The second claim about no user has hit it is wrong. To repost the link I posted in comment #8 - https://bugzilla.redhat.com/show_bug.cgi?id=1696353#c11. Also I've found this scenario on SAS setups too - https://bugzilla.redhat.com/show_bug.cgi?id=1581306#c47

I've already reasoned about non-feasibility of sdfs:
1. Its not available on client. There is no option which can load it in client graphs. It can be loaded only on bricks. So, even if the user can accept perf impact (which is highly unlikely, see point 2) the solution is not complete.
2. The commit I posted in comment #6 measured the impact of sdfs being loaded on brick stack. We don't have any perf data to measure the impact if it gets loaded on client graph. It's highly likely that perf impact is much greater than the numbers posted in comment #6

All it takes to hit this bug is rename heavy workloads and there are many of them as the following paradigm is pretty common while copying a file (rsync, postgresql, SAS etc to give specific examples):
* create a tmp file
* write to it
* rename tmp file to actual file

Comment 13 Raghavendra G 2019-05-11 13:40:09 UTC

(In reply to Raghavendra G from comment #12)
> (In reply to Amar Tumballi from comment #10)
> > Raghavendra, Yes, SDFS has performance impact. But the feature is still
> > present and can be enabled if user's work load demands it.
> > 
> > My reasoning of closing this bug was mainly because for none of our users,
> > this particular case was not hit (ie, there are no reports of it). So,
> > focusing on that work when no one needs it is meaningless, that too when
> > there are options available.
> 
> The second claim about no user has hit it is wrong. To repost the link I
> posted in comment #8 -
> https://bugzilla.redhat.com/show_bug.cgi?id=1696353#c11. Also I've found
> this scenario on SAS setups too -
> https://bugzilla.redhat.com/show_bug.cgi?id=1581306#c47

Also, https://bugzilla.redhat.com/show_bug.cgi?id=1600923 on geo-rep setups. IIRC, this bug when hit affects Geo-rep. Putting needinfo on Kotresh and Rochelle to explain how this bug, affects geo-rep.

Comment 18 Worker Ant 2020-03-12 12:51:12 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/962, and will be tracked there from now on. Visit GitHub issues URL for further details