Bug 1488120

Summary:	Moving multiple temporary files to the same destination concurrently causes ESTALE error
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Raghavendra G <rgowdapp>
Component:	distribute	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED ERRATA	QA Contact:	Prasad Desala <tdesala>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.3	CC:	amukherj, bugs, couture.danny, jgalvez, nbalacha, pkarampu, rbhat, rgowdapp, rhinduja, rhs-bugs, sheggodu, simon.turcotte-langevin, storage-qa-internal, tdesala
Target Milestone:	---	Keywords:	Triaged
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-10	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1378550
Clones:	1543279 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:35:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1378550
Bug Blocks:	1425421, 1503134, 1543279, 1576291

Description Raghavendra G 2017-09-04 11:23:28 UTC

+++ This bug was initially created as a clone of Bug #1378550 +++

Description of problem:
We have an application which leverage POSIX atomic move semantic. Therefore, we allow the same file to be uploaded multiple times, since it can be commited atomically to the file system. However, when multiple clients try to upload the same file concurrently, some gets a ESTALE error on the move operation.

Version-Release number of selected component (if applicable):
3.7.5, 3.8.4

How reproducible:
It can be reproduced by creating lots of temporary file concurrently, on multiple machines, and to try to move them to the same final location.

Steps to Reproduce:
1. Log on multiple machines
1. Execute "while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done &"
2. Wait until the move command fails

Actual results:
mv: cannot move ‘test5f4c981f-efcb-4ba8-b017-cf4acb76abcc’ to ‘test’: No such file or directory
mv: cannot move ‘test7cf00867-4982-4206-abcf-e5e836460eda’ to ‘test’: No such file or directory
mv: cannot move ‘testcacb6c40-c164-435f-b118-7a14687bf4bd’ to ‘test’: No such file or directory
mv: cannot move ‘test956ff19d-0a16-49bd-a877-df18311570dc’ to ‘test’: No such file or directory
mv: cannot move ‘test6e36eb01-9e54-4b50-8de8-cebb063554ba’ to ‘test’: Structure needs cleaning

Expected results:
No output because no error

Additional info:

--- Additional comment from Pranith Kumar K on 2016-10-17 05:22:19 EDT ---

Du, Nitya,
       Based on my debugging inodelk keeps failing with ESTALE. When I checked dht_rename(), I see that the inodelk is done both on source and destination inodes. But because the test above can lead to deletion of the file we are trying to lock on by the other 'while ()...' process the inodelk fails with ESTALE. When I changed the test to rename to independent filenames, then everything works as expected.
On mount1:
while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done

On mount2:
while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test2" -f; done

Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.

--- Additional comment from Raghavendra G on 2016-10-17 07:09:14 EDT ---

(In reply to Pranith Kumar K from comment #1)
> Du, Nitya,
>        Based on my debugging inodelk keeps failing with ESTALE. When I
> checked dht_rename(), I see that the inodelk is done both on source and
> destination inodes. But because the test above can lead to deletion of the
> file we are trying to lock on by the other 'while ()...' process the inodelk
> fails with ESTALE. When I changed the test to rename to independent
> filenames, then everything works as expected.
> On mount1:
> while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> "test$uuid" "test" -f; done
> 
> On mount2:
> while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> "test$uuid" "test2" -f; done
> 
> Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.

locking in dht_rename has two purposes:
1. serialize and ensure atomicity (of each rename) when two parallel renames are done on the same file.
2. serialize a rename with file migration during rebalance.

The current use-case falls into category 1. I think using entrylk instead of inodelk solves the problem. However need to think more about this.

Assigning bug to Kotresh as he is working on synchronization issues.

--- Additional comment from Pranith Kumar K on 2016-10-17 08:10:22 EDT ---

(In reply to Raghavendra G from comment #2)
> (In reply to Pranith Kumar K from comment #1)
> > Du, Nitya,
> >        Based on my debugging inodelk keeps failing with ESTALE. When I
> > checked dht_rename(), I see that the inodelk is done both on source and
> > destination inodes. But because the test above can lead to deletion of the
> > file we are trying to lock on by the other 'while ()...' process the inodelk
> > fails with ESTALE. When I changed the test to rename to independent
> > filenames, then everything works as expected.
> > On mount1:
> > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> > "test$uuid" "test" -f; done
> > 
> > On mount2:
> > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
> > "test$uuid" "test2" -f; done
> > 
> > Not sure how to fix this in DHT though. For now re-assigning the bug to DHT.
> 
> locking in dht_rename has two purposes:
> 1. serialize and ensure atomicity (of each rename) when two parallel renames
> are done on the same file.
> 2. serialize a rename with file migration during rebalance.
> 
> The current use-case falls into category 1. I think using entrylk instead of
> inodelk solves the problem. However need to think more about this.
> 
> Assigning bug to Kotresh as he is working on synchronization issues.

Just a word of caution, that it is important to do it in backward compatible way.

--- Additional comment from Raghavendra G on 2017-04-22 03:28:45 EDT ---

As Pranith explained, it's a bug in dht_rename code. The fact that dht_rename expects a lock to be successful on "dst" in "mv src dst" is not posix compliant. 

<man 2 rename>
       ENOENT The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string.
</man>

It should ignore ESTALE/ENOENT errors while trying to acquire lock on "dst" inode. The issue is that "dst" exists when a lookup happened, but it got deleted by the time a rename fop hits dht. Dht, relying on the information it got in lookup sends a lock on "dst" which fails with ESTALE. As mentioned in the bz, exploring using entrylk instead of inodelk is one option. I'll get back to you on this. Sorry about the delay.

--- Additional comment from Raghavendra G on 2017-04-22 03:31:56 EDT ---

Kotresh is working on synchronization issues on directories [1]. This issue is on files and won't be fixed by [1]. Hence resetting the assignee to default owner of component.

[1] https://review.gluster.org/15472

Comment 6 Raghavendra G 2018-03-12 04:58:07 UTC

The initial RCA of failing inodelks while necessary, is not sufficient to get the use-case working. Over the course of getting testcase working, I found following issues:

1 Entrylk has to be secured on hashed subvol as cached-subvol can change due to migration of file. Credits for identifying this during code review goes to Nithya. This part of the solution is done.
2 After locking, lookup done on dst should handle the error scenario of ESTALE, as gfid associated with path will change due to rename(s) that happened between lookup and the time lock is acquired. This part of the solution is done.
3 dht_lookup has to handle the scenario of gfid of linkto and data files differing during fresh lookup as it might be done in the middle of a rename. While I've a hacky way of resolving this (by making fuse-bridge retry even fresh lookup during ESTALE), I need to implement a proper fix (To make dht_lookup to retry the lookup under locks when it encounters this scenario).
4. linkfile creation during lookup has to be done under lock which synchronizes linkfile creation with any renames involving the file. This part of the solution is implemented. But, I am still thinking through whether this locking is actually required. IOW, I am not able to find the RCA which requires this solution. But, having this lock gets test working.
5. server_link has to handle the scenario of a stale dentry left in the inode table due to a racing lookup and rename involving the dentry. I've an hacky implementation which changes resolve-type to RESOLVE_MAY from RESOLVE_NOT in server_link. But correct solution would to be enhance RESOLVE_NOT implementation in server-resolver to check on backend for existance of file (and then fail if file exists on backend too) before failing the fop on finding an inode in itable. This part of the solution is still pending.

With the above set of solutions, I am able to get the test working (with 4 clients simultaneously executing the above script and on client continuously doing lookup on the contents of directory in which renames are being done) for couple of hours. But after that I end up with rename failing and two dst data files in the volume. I am in the process of debugging this.

Comment 7 Raghavendra G 2018-03-12 14:10:09 UTC

(In reply to Raghavendra G from comment #6)
> 4. linkfile creation during lookup has to be done under lock which
> synchronizes linkfile creation with any renames involving the file. This
> part of the solution is implemented. But, I am still thinking through
> whether this locking is actually required. IOW, I am not able to find the
> RCA which requires this solution. But, having this lock gets test working.

The reason we need locks here is because a half done rename can result in multiple gfids for the same path (dst) (though this is transient which will get corrected once rename is complete - either successfully or a failure. The exception is client crashing in the middle of a rename). Gfid of cached file at the time of lookup (outside locks) can be different by the time linkfile is created. This results in a permanent condition of linkto file having a different gfid than data file. So, lookup before attempting linkto creation,
* acquire entrylk on parent, so that renames are blocked.
* check whether conditions for linkto creation are still valid - like data-file has the same gfid as the inode in glusterfs process, linkto file abset etc. If any of these checks fail, abandon linkto creation.

> With the above set of solutions, I am able to get the test working (with 4
> clients simultaneously executing the above script and on client continuously
> doing lookup on the contents of directory in which renames are being done)
> for couple of hours. But after that I end up with rename failing  and two
> dst data files in the volume. I am in the process of debugging this.

Previously I was not verifying conditions for creation of linkto are still valid _after_ acquiring entrylk. This resulted in lookup of dst failing with ESTALE and dst-cached getting set as NULL. Subsequent renames would result in more than one data file, with each having different gfids. Tests have been running successfully for the past hour and I am optimistic that they'll continue to run successfully.

Comment 8 Raghavendra G 2018-03-13 01:28:05 UTC

Tests were running overnight and no errors are seen. I also reverted fixes 3 and 5 as I had a doubt that they are not contributing to failures. So, final fix will have 1, 2 and 4 along with added checks in linkfile creation during lookup.

Comment 9 Nithya Balachandran 2018-03-27 09:43:05 UTC

*** Bug 1425421 has been marked as a duplicate of this bug. ***

Comment 10 Atin Mukherjee 2018-04-30 05:21:40 UTC

Do we have a patch posted in upstream against comment 8?

Comment 11 Raghavendra G 2018-04-30 05:31:56 UTC

(In reply to Atin Mukherjee from comment #10)
> Do we have a patch posted in upstream against comment 8?

https://review.gluster.org/#/c/19547/

Comment 15 Prasad Desala 2018-07-31 10:51:42 UTC

Verified this BZ on glusterfs version: 3.12.2-14.el7rhgs.x86_64.

From 8 clients, started moving multiple temp files to the same destination. It ran for almost 3 hrs without any issues and then rename started throwing EEXIST errors on the client mount points.

Also, have seen issues while doing some fops on the destination file. Filed separate issues to track those issues,

https://bugzilla.redhat.com/show_bug.cgi?id=1609210
https://bugzilla.redhat.com/show_bug.cgi?id=1609224
https://bugzilla.redhat.com/show_bug.cgi?id=1610258

Considering that ESTALE/ENOENT are no more seen while moving multiple temp files to the same destination. I am moving this BZ to Verified.

Comment 17 errata-xmlrpc 2018-09-04 06:35:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607