1543279 – Moving multiple temporary files to the same destination concurrently causes ESTALE error

Bug 1543279 - Moving multiple temporary files to the same destination concurrently causes ESTALE error

Summary: Moving multiple temporary files to the same destination concurrently causes E...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1378550 1488120
Blocks:	1425421 1503134 1576291
TreeView+	depends on / blocked

Reported:	2018-02-08 06:56 UTC by Raghavendra G
Modified:	2018-10-23 15:06 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1488120
Clones:	1576291 (view as bug list)
Environment:
Last Closed:	2018-06-20 17:59:24 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Raghavendra G 2018-02-08 06:58:26 UTC

Description of problem:
We have an application which leverage POSIX atomic move semantic. Therefore, we allow the same file to be uploaded multiple times, since it can be commited atomically to the file system. However, when multiple clients try to upload the same file concurrently, some gets a ESTALE error on the move operation.

Version-Release number of selected component (if applicable):
3.7.5, 3.8.4

How reproducible:
It can be reproduced by creating lots of temporary file concurrently, on multiple machines, and to try to move them to the same final location.

Steps to Reproduce:
1. Log on multiple machines
1. Execute "while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done &"
2. Wait until the move command fails

Actual results:
mv: cannot move ‘test5f4c981f-efcb-4ba8-b017-cf4acb76abcc’ to ‘test’: No such file or directory
mv: cannot move ‘test7cf00867-4982-4206-abcf-e5e836460eda’ to ‘test’: No such file or directory
mv: cannot move ‘testcacb6c40-c164-435f-b118-7a14687bf4bd’ to ‘test’: No such file or directory
mv: cannot move ‘test956ff19d-0a16-49bd-a877-df18311570dc’ to ‘test’: No such file or directory
mv: cannot move ‘test6e36eb01-9e54-4b50-8de8-cebb063554ba’ to ‘test’: Structure needs cleaning

Expected results:
No output because no error

Additional info:

--- Additional comment from Pranith Kumar K on 2016-10-17 05:22:19 EDT ---

Du, Nitya,
       Based on my debugging inodelk keeps failing with ESTALE. When I checked dht_rename(), I see that the inodelk is done both on source and destination inodes. But because the test above can lead to deletion of the file we are trying to lock on by the other 'while ()...' process the inodelk fails with ESTALE. When I changed the test to rename to independent filenames, then everything works as expected.
On mount1:
while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done

On mount2:
while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test2" -f; done

Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.

--- Additional comment from Raghavendra G on 2016-10-17 07:09:14 EDT ---

(In reply to Pranith Kumar K from comment #1)
> Du, Nitya,
>        Based on my debugging inodelk keeps failing with ESTALE. When I
> checked dht_rename(), I see that the inodelk is done both on source and
> destination inodes. But because the test above can lead to deletion of the
> file we are trying to lock on by the other 'while ()...' process the inodelk
> fails with ESTALE. When I changed the test to rename to independent
> filenames, then everything works as expected.
> On mount1:
> while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> "test$uuid" "test" -f; done
> 
> On mount2:
> while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> "test$uuid" "test2" -f; done
> 
> Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.

locking in dht_rename has two purposes:
1. serialize and ensure atomicity (of each rename) when two parallel renames are done on the same file.
2. serialize a rename with file migration during rebalance.

The current use-case falls into category 1. I think using entrylk instead of inodelk solves the problem. However need to think more about this.

Assigning bug to Kotresh as he is working on synchronization issues.

--- Additional comment from Pranith Kumar K on 2016-10-17 08:10:22 EDT ---

(In reply to Raghavendra G from comment #2)
> (In reply to Pranith Kumar K from comment #1)
> > Du, Nitya,
> >        Based on my debugging inodelk keeps failing with ESTALE. When I
> > checked dht_rename(), I see that the inodelk is done both on source and
> > destination inodes. But because the test above can lead to deletion of the
> > file we are trying to lock on by the other 'while ()...' process the inodelk
> > fails with ESTALE. When I changed the test to rename to independent
> > filenames, then everything works as expected.
> > On mount1:
> > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> > "test$uuid" "test" -f; done
> > 
> > On mount2:
> > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> > "test$uuid" "test2" -f; done
> > 
> > Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.
> 
> locking in dht_rename has two purposes:
> 1. serialize and ensure atomicity (of each rename) when two parallel renames
> are done on the same file.
> 2. serialize a rename with file migration during rebalance.
> 
> The current use-case falls into category 1. I think using entrylk instead of
> inodelk solves the problem. However need to think more about this.
> 
> Assigning bug to Kotresh as he is working on synchronization issues.

Just a word of caution, that it is important to do it in backward compatible way.

--- Additional comment from Raghavendra G on 2017-04-22 03:28:45 EDT ---

As Pranith explained, it's a bug in dht_rename code. The fact that dht_rename expects a lock to be successful on "dst" in "mv src dst" is not posix compliant. 

<man 2 rename>
       ENOENT The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string.
</man>

It should ignore ESTALE/ENOENT errors while trying to acquire lock on "dst" inode. The issue is that "dst" exists when a lookup happened, but it got deleted by the time a rename fop hits dht. Dht, relying on the information it got in lookup sends a lock on "dst" which fails with ESTALE. As mentioned in the bz, exploring using entrylk instead of inodelk is one option. I'll get back to you on this. Sorry about the delay.

Comment 2 Worker Ant 2018-02-12 13:40:55 UTC

REVIEW: https://review.gluster.org/19546 (cluster/dht: store the 'reaction' on failures per lock) posted (#1) for review on master by Raghavendra G

Comment 3 Worker Ant 2018-02-12 13:41:53 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: acquire entrylk instead of inodelk during renames) posted (#1) for review on master by Raghavendra G

Comment 4 Worker Ant 2018-02-13 05:54:00 UTC

REVIEW: https://review.gluster.org/19556 (libglusterfs/syncop: Add syncop_entrylk) posted (#1) for review on master by Raghavendra G

Comment 5 Worker Ant 2018-02-13 13:07:12 UTC

COMMIT: https://review.gluster.org/19556 committed in master by "Raghavendra G" <rgowdapp> with a commit message- libglusterfs/syncop: Add syncop_entrylk

Change-Id: Idd86b9f0fa144c2316ab6276e2def28b696ae18a
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp>

Comment 6 Worker Ant 2018-02-23 03:20:29 UTC

COMMIT: https://review.gluster.org/19546 committed in master by "Raghavendra G" <rgowdapp> with a commit message- cluster/dht: store the 'reaction' on failures per lock

Currently its passed in dht_blocking_inode(entry)lk, which would be a
global value for all the locks passed in the argument. This would
be a limitation for cases where we want to ignore failures on only few
locks and fail for others.

Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp>

Comment 7 Worker Ant 2018-03-13 11:14:54 UTC

REVIEW: https://review.gluster.org/19707 (tests/bug-1110262.t: fix a race condition) posted (#1) for review on master by Raghavendra G

Comment 8 Worker Ant 2018-03-13 13:02:19 UTC

COMMIT: https://review.gluster.org/19707 committed in master by "Shyamsundar Ranganathan" <srangana> with a commit message- tests/bug-1110262.t: fix a race condition

This test does:

1. mount a volume
2. kill a brick in the volume
3. mkdir (/somedir)

In my local tests and in [1], I see that mkdir in step 3 fails because
there is no dht-layout on root directory.

The reason I think is by the time first lookup on "/" hit dht, a brick
was killed as per step 2. This means layout was not healed for "/" and
since this is a new volume, no layout is present on it. Note that the
first lookup done on "/" by fuse-bridge is not synchronized with
parent process of daemonized glusterfs mount completing. IOW, by the
time glusterfs cmd executed there is no guarantee that lookup on "/"
is complete. So, if step 2 races ahead of fuse_first_lookup on "/", we
end up with an invalid dht-layout on "/" resulting in failures.

Doint an operation like ls makes sure that lookup on "/" is completed
before we kill a brick

Change-Id: Ie0c4e442c4c629fad6f7ae850437e3d63fe4bea9
Signed-off-by: Raghavendra G <rgowdapp>
BUG: 1543279

Comment 9 Worker Ant 2018-03-16 06:55:39 UTC

REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#1) for review on master by Raghavendra G

Comment 10 Worker Ant 2018-03-17 08:05:37 UTC

REVIEW: https://review.gluster.org/19732 (server/resolver: don't trust inode-table for RESLOVE_NOT) posted (#1) for review on master by Raghavendra G

Comment 11 Worker Ant 2018-03-27 04:47:09 UTC

REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#11) for review on master by Raghavendra G

Comment 12 Worker Ant 2018-03-27 04:48:10 UTC

REVIEW: https://review.gluster.org/19732 (server/resolver: don't trust inode-table for RESOLVE_NOT) posted (#8) for review on master by Raghavendra G

Comment 13 Worker Ant 2018-03-27 04:49:10 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#21) for review on master by Raghavendra G

Comment 14 Worker Ant 2018-03-27 05:58:23 UTC

REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#12) for review on master by Raghavendra G

Comment 15 Worker Ant 2018-03-27 06:00:20 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#22) for review on master by Raghavendra G

Comment 16 Worker Ant 2018-03-27 06:30:24 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#23) for review on master by Raghavendra G

Comment 17 Worker Ant 2018-03-27 06:42:14 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#24) for review on master by Raghavendra G

Comment 18 Worker Ant 2018-03-27 07:39:30 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#25) for review on master by Raghavendra G

Comment 19 Worker Ant 2018-03-27 07:58:50 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#26) for review on master by Raghavendra G

Comment 20 Worker Ant 2018-03-27 11:44:08 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#27) for review on master by Raghavendra G

Comment 21 Worker Ant 2018-04-30 05:01:24 UTC

REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#28) for review on master by Raghavendra G

Comment 22 Worker Ant 2018-04-30 05:13:47 UTC

COMMIT: https://review.gluster.org/19732 committed in master by "Raghavendra G" <rgowdapp> with a commit message- server/resolver: don't trust inode-table for RESOLVE_NOT

There have been known races between fops which add a dentry (like
lookup, create, mknod etc) and fops that remove a dentry (like rename,
unlink, rmdir etc) due to which stale dentries are left out in inode
table even though the dentry doesn't exist on backend. For eg.,
consider a lookup (parent/bname) and unlink (parent/bname) racing in
the following order:

* lookup hits storage/posix and finds that dentry exists
* unlink removes the dentry on storage/posix
* unlink reaches protocol/server where the dentry (parent/bname) is
  unlinked from the inode
* lookup reaches protocol/server and creates a dentry (parent/bname)
  on the inode

Now we've a stale dentry (parent/bname) associated with the inode in
itable. This situation is bad for fops like link, create etc which
invoke resolver with type RESOLVE_NOT. These fops fail with EEXIST
even though there is no such dentry on backend fs. This issue can be
solved in two ways:

* Enable "dentry fop serializer" xlator [1].
  # gluster volume set features.sdfs on

* Make sure resolver does a lookup on backend when it finds a dentry
  in itable and validates the state of itable.
   - If a dentry is not found, unlink those stale dentries from itable
     and continue with fop
   - If dentry is found, fail the fop with EEXIST

This patch implements second solution as sdfs is not enabled by
default in brick xlator stack. Once sdfs is enabled by default, this
patch can be reverted.

[1] https://github.com/gluster/glusterfs/issues/397

Change-Id: Ia8bb0cf97f97cb0e72639bce8aadb0f6d3f4a34a
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp>

Comment 23 Worker Ant 2018-05-07 05:37:32 UTC

COMMIT: https://review.gluster.org/19727 committed in master by "N Balachandran" <nbalacha> with a commit message- cluster/dht: log error only if layout healing is required

selfhealing of directory is invoked on two conditions:
1. no layout on disk or layout has some anomalies (holes/overlaps)
2. mds xattr is not set on the directory

When dht_selfheal_directory is called with a correct layout just to
set mds xattr, we see error msgs complaining about "not able to form
layout on directory", which is misleading as the layout is
correct. So, log this msg only if layout has anomalies.

Change-Id: I4af25246fc3a2450c2426e9902d1a5b372eab125
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp>

Comment 24 Worker Ant 2018-05-07 06:55:35 UTC

COMMIT: https://review.gluster.org/19547 committed in master by "Raghavendra G" <rgowdapp> with a commit message- cluster/dht: fixes to parallel renames to same destination codepath

Test case:
 # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
   "test$uuid" "test" -f || break; echo "done:$uuid"; done

 This script was run in parallel from multiple mountpoints

Along the course of getting the above usecase working, many issues
were found:

Issue 1:
=======
consider a case of rename (src, dst). We can encounter a situation
where,
* dst is a file present at the time of lookup
* dst is removed by the time rename fop reaches glusterfs

In this scenario, acquring inodelk on dst fails with ESTALE resulting
in failure of rename. However, as per POSIX irrespective of whether
dst is present or not, rename should be successful. Acquiring entrylk
provides synchronization even in races like this.

Algorithm:
1. Take inodelks on src and dst (if dst is present) on respective
   cached subvols. These inodelks are done to preserve backward
   compatibility with older clients, so that synchronization is
   preserved when a volume is mounted by clients of different
   versions. Once relevant older versions (3.10, 3.12, 3.13) reach
   EOL, this code can be removed.
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
3. protect namespace of src and dst. To protect namespace of a file,
   take inodelk on parent on hashed subvol, then take entrylk on the
   same subvol on parent with basename of file. inodelk on parent is
   done to guard against changes to parent layout so that hashed
   subvol won't change during rename.
4. <rest of rename continues>
5. unlock all locks

Issue 2:
========
linkfile creation in lookup codepath can race with a rename. Imagine
the following scenario:
* lookup finds a data-file with gfid - gfid-dst - without a
  corresponding linkto file on hashed-subvol. It decides to create
  linkto file with gfid - gfid-dst.
    - Note that some codepaths of dht-rename deletes linkto file of
      dst as first step. So, a lookup racing with an in-progress
      rename can easily run into this situation.
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
  and hence gfid of data-file changes to gfid-src with path dst-path.
* lookup proceeds and creates linkto file - dst-path - with gfid -
  dst-gfid - on hashed-subvol.
* rename tries to create a linkto file dst-path with src-gfid on
  hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
  during linkto file creation.

Now we've ended with dst-path having different gfids - dst-gfid on
linkto file and src-gfid on data file. Future lookups on dst-path will
always fail with ESTALE, due to differing gfids.

The fix is to synchronize linkfile creation in lookup path with rename
using the same mechanism of protecting namespace explained in solution
of Issue 1. Once locks are acquired, before proceeding with linkfile
creation, we check whether conditions for linkto file creation are
still valid. If not, we skip linkto file creation.

Issue 3:
========
gfid of dst-path can change by the time locks are acquired. This
means, either another rename overwrote dst-path or dst-path was
deleted and recreated by a different client. When this happens,
cached-subvol for dst can change. If rename proceeds with old-gfid and
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
with different gfids on different subvols, more than one data-file
being present etc.

Fix is to do the lookup with a new inode after protecting namespace of
dst. Post lookup, we've to compare gfids and correct local state
appropriately to be in sync with backend.

Issue 4:
========
During revalidate lookup, if following a linkto file doesn't lead to a
valid data-file, local->cached-subvol was not reset to NULL. This
means we would be operating on a stale state which can lead to
inconsistency. As a fix, reset it to NULL before proceeding with
lookup everywhere.

Issue 5:
========
Stale dentries left out in inode table on brick resulted in failures
of link fop even though the file/dentry didn't exist on backend fs. A
patch is submitted to fix this issue. Please check the dependency tree
of current patch on gerrit for details

In short, we fix the problem by not blindly trusting the
inode-table. Instead we validate whether dentry is present by doing
lookup on backend fs.

Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
updates: bz#1543279
BUG: 1543279
Signed-off-by: Raghavendra G <rgowdapp>

Comment 25 Worker Ant 2018-05-25 11:56:32 UTC

REVIEW: https://review.gluster.org/20085 (tests/bug-1543279: mark it as bad) posted (#1) for review on master by Raghavendra G

Comment 26 Raghavendra G 2018-05-25 11:56:51 UTC

Still some races are not fixed. See following patch for details:
https://review.gluster.org/#/c/20085/

Comment 27 Worker Ant 2018-05-25 12:55:55 UTC

COMMIT: https://review.gluster.org/20085 committed in master by "Raghavendra G" <rgowdapp> with a commit message- tests/bug-1543279: mark it as bad

There seems to be races which are not fixed by commit
9704d203f0. Though the test itself is not bad, it is failing very
frequently. So, till the issue is fixed, marking this test as bad.

updates: bz#1543279

Change-Id: I4a5869da1586178914ffb9563414879e6beab183
Signed-off-by: Raghavendra G <rgowdapp>

Comment 28 Shyamsundar 2018-06-20 17:59:24 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 29 Shyamsundar 2018-10-23 15:06:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.