Description of problem: We have an application which leverage POSIX atomic move semantic. Therefore, we allow the same file to be uploaded multiple times, since it can be commited atomically to the file system. However, when multiple clients try to upload the same file concurrently, some gets a ESTALE error on the move operation. Version-Release number of selected component (if applicable): 3.7.5, 3.8.4 How reproducible: It can be reproduced by creating lots of temporary file concurrently, on multiple machines, and to try to move them to the same final location. Steps to Reproduce: 1. Log on multiple machines 1. Execute "while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done &" 2. Wait until the move command fails Actual results: mv: cannot move ‘test5f4c981f-efcb-4ba8-b017-cf4acb76abcc’ to ‘test’: No such file or directory mv: cannot move ‘test7cf00867-4982-4206-abcf-e5e836460eda’ to ‘test’: No such file or directory mv: cannot move ‘testcacb6c40-c164-435f-b118-7a14687bf4bd’ to ‘test’: No such file or directory mv: cannot move ‘test956ff19d-0a16-49bd-a877-df18311570dc’ to ‘test’: No such file or directory mv: cannot move ‘test6e36eb01-9e54-4b50-8de8-cebb063554ba’ to ‘test’: Structure needs cleaning Expected results: No output because no error Additional info: --- Additional comment from Pranith Kumar K on 2016-10-17 05:22:19 EDT --- Du, Nitya, Based on my debugging inodelk keeps failing with ESTALE. When I checked dht_rename(), I see that the inodelk is done both on source and destination inodes. But because the test above can lead to deletion of the file we are trying to lock on by the other 'while ()...' process the inodelk fails with ESTALE. When I changed the test to rename to independent filenames, then everything works as expected. On mount1: while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f; done On mount2: while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test2" -f; done Not sure how to fix this in DHT though. For now re-assigning the bug to DHT. --- Additional comment from Raghavendra G on 2016-10-17 07:09:14 EDT --- (In reply to Pranith Kumar K from comment #1) > Du, Nitya, > Based on my debugging inodelk keeps failing with ESTALE. When I > checked dht_rename(), I see that the inodelk is done both on source and > destination inodes. But because the test above can lead to deletion of the > file we are trying to lock on by the other 'while ()...' process the inodelk > fails with ESTALE. When I changed the test to rename to independent > filenames, then everything works as expected. > On mount1: > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv > "test$uuid" "test" -f; done > > On mount2: > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv > "test$uuid" "test2" -f; done > > Not sure how to fix this in DHT though. For now re-assigning the bug to DHT. locking in dht_rename has two purposes: 1. serialize and ensure atomicity (of each rename) when two parallel renames are done on the same file. 2. serialize a rename with file migration during rebalance. The current use-case falls into category 1. I think using entrylk instead of inodelk solves the problem. However need to think more about this. Assigning bug to Kotresh as he is working on synchronization issues. --- Additional comment from Pranith Kumar K on 2016-10-17 08:10:22 EDT --- (In reply to Raghavendra G from comment #2) > (In reply to Pranith Kumar K from comment #1) > > Du, Nitya, > > Based on my debugging inodelk keeps failing with ESTALE. When I > > checked dht_rename(), I see that the inodelk is done both on source and > > destination inodes. But because the test above can lead to deletion of the > > file we are trying to lock on by the other 'while ()...' process the inodelk > > fails with ESTALE. When I changed the test to rename to independent > > filenames, then everything works as expected. > > On mount1: > > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv > > "test$uuid" "test" -f; done > > > > On mount2: > > while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv > > "test$uuid" "test2" -f; done > > > > Not sure how to fix this in DHT though. For now re-assigning the bug to DHT. > > locking in dht_rename has two purposes: > 1. serialize and ensure atomicity (of each rename) when two parallel renames > are done on the same file. > 2. serialize a rename with file migration during rebalance. > > The current use-case falls into category 1. I think using entrylk instead of > inodelk solves the problem. However need to think more about this. > > Assigning bug to Kotresh as he is working on synchronization issues. Just a word of caution, that it is important to do it in backward compatible way. --- Additional comment from Raghavendra G on 2017-04-22 03:28:45 EDT --- As Pranith explained, it's a bug in dht_rename code. The fact that dht_rename expects a lock to be successful on "dst" in "mv src dst" is not posix compliant. <man 2 rename> ENOENT The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string. </man> It should ignore ESTALE/ENOENT errors while trying to acquire lock on "dst" inode. The issue is that "dst" exists when a lookup happened, but it got deleted by the time a rename fop hits dht. Dht, relying on the information it got in lookup sends a lock on "dst" which fails with ESTALE. As mentioned in the bz, exploring using entrylk instead of inodelk is one option. I'll get back to you on this. Sorry about the delay.
REVIEW: https://review.gluster.org/19546 (cluster/dht: store the 'reaction' on failures per lock) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: acquire entrylk instead of inodelk during renames) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19556 (libglusterfs/syncop: Add syncop_entrylk) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/19556 committed in master by "Raghavendra G" <rgowdapp> with a commit message- libglusterfs/syncop: Add syncop_entrylk Change-Id: Idd86b9f0fa144c2316ab6276e2def28b696ae18a BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp>
COMMIT: https://review.gluster.org/19546 committed in master by "Raghavendra G" <rgowdapp> with a commit message- cluster/dht: store the 'reaction' on failures per lock Currently its passed in dht_blocking_inode(entry)lk, which would be a global value for all the locks passed in the argument. This would be a limitation for cases where we want to ignore failures on only few locks and fail for others. Change-Id: I02cfbcaafb593ad8140c0e5af725c866b630fb6b BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp>
REVIEW: https://review.gluster.org/19707 (tests/bug-1110262.t: fix a race condition) posted (#1) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/19707 committed in master by "Shyamsundar Ranganathan" <srangana> with a commit message- tests/bug-1110262.t: fix a race condition This test does: 1. mount a volume 2. kill a brick in the volume 3. mkdir (/somedir) In my local tests and in [1], I see that mkdir in step 3 fails because there is no dht-layout on root directory. The reason I think is by the time first lookup on "/" hit dht, a brick was killed as per step 2. This means layout was not healed for "/" and since this is a new volume, no layout is present on it. Note that the first lookup done on "/" by fuse-bridge is not synchronized with parent process of daemonized glusterfs mount completing. IOW, by the time glusterfs cmd executed there is no guarantee that lookup on "/" is complete. So, if step 2 races ahead of fuse_first_lookup on "/", we end up with an invalid dht-layout on "/" resulting in failures. Doint an operation like ls makes sure that lookup on "/" is completed before we kill a brick Change-Id: Ie0c4e442c4c629fad6f7ae850437e3d63fe4bea9 Signed-off-by: Raghavendra G <rgowdapp> BUG: 1543279
REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19732 (server/resolver: don't trust inode-table for RESLOVE_NOT) posted (#1) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#11) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19732 (server/resolver: don't trust inode-table for RESOLVE_NOT) posted (#8) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#21) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19727 (cluster/dht: log error only if layout healing is required) posted (#12) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#22) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#23) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#24) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#25) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#26) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#27) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/19547 (cluster/dht: fixes to parallel renames to same destination codepath) posted (#28) for review on master by Raghavendra G
COMMIT: https://review.gluster.org/19732 committed in master by "Raghavendra G" <rgowdapp> with a commit message- server/resolver: don't trust inode-table for RESOLVE_NOT There have been known races between fops which add a dentry (like lookup, create, mknod etc) and fops that remove a dentry (like rename, unlink, rmdir etc) due to which stale dentries are left out in inode table even though the dentry doesn't exist on backend. For eg., consider a lookup (parent/bname) and unlink (parent/bname) racing in the following order: * lookup hits storage/posix and finds that dentry exists * unlink removes the dentry on storage/posix * unlink reaches protocol/server where the dentry (parent/bname) is unlinked from the inode * lookup reaches protocol/server and creates a dentry (parent/bname) on the inode Now we've a stale dentry (parent/bname) associated with the inode in itable. This situation is bad for fops like link, create etc which invoke resolver with type RESOLVE_NOT. These fops fail with EEXIST even though there is no such dentry on backend fs. This issue can be solved in two ways: * Enable "dentry fop serializer" xlator [1]. # gluster volume set features.sdfs on * Make sure resolver does a lookup on backend when it finds a dentry in itable and validates the state of itable. - If a dentry is not found, unlink those stale dentries from itable and continue with fop - If dentry is found, fail the fop with EEXIST This patch implements second solution as sdfs is not enabled by default in brick xlator stack. Once sdfs is enabled by default, this patch can be reverted. [1] https://github.com/gluster/glusterfs/issues/397 Change-Id: Ia8bb0cf97f97cb0e72639bce8aadb0f6d3f4a34a updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp>
COMMIT: https://review.gluster.org/19727 committed in master by "N Balachandran" <nbalacha> with a commit message- cluster/dht: log error only if layout healing is required selfhealing of directory is invoked on two conditions: 1. no layout on disk or layout has some anomalies (holes/overlaps) 2. mds xattr is not set on the directory When dht_selfheal_directory is called with a correct layout just to set mds xattr, we see error msgs complaining about "not able to form layout on directory", which is misleading as the layout is correct. So, log this msg only if layout has anomalies. Change-Id: I4af25246fc3a2450c2426e9902d1a5b372eab125 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp>
COMMIT: https://review.gluster.org/19547 committed in master by "Raghavendra G" <rgowdapp> with a commit message- cluster/dht: fixes to parallel renames to same destination codepath Test case: # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv "test$uuid" "test" -f || break; echo "done:$uuid"; done This script was run in parallel from multiple mountpoints Along the course of getting the above usecase working, many issues were found: Issue 1: ======= consider a case of rename (src, dst). We can encounter a situation where, * dst is a file present at the time of lookup * dst is removed by the time rename fop reaches glusterfs In this scenario, acquring inodelk on dst fails with ESTALE resulting in failure of rename. However, as per POSIX irrespective of whether dst is present or not, rename should be successful. Acquiring entrylk provides synchronization even in races like this. Algorithm: 1. Take inodelks on src and dst (if dst is present) on respective cached subvols. These inodelks are done to preserve backward compatibility with older clients, so that synchronization is preserved when a volume is mounted by clients of different versions. Once relevant older versions (3.10, 3.12, 3.13) reach EOL, this code can be removed. 2. Ignore ENOENT/ESTALE errors of inodelk on dst. 3. protect namespace of src and dst. To protect namespace of a file, take inodelk on parent on hashed subvol, then take entrylk on the same subvol on parent with basename of file. inodelk on parent is done to guard against changes to parent layout so that hashed subvol won't change during rename. 4. <rest of rename continues> 5. unlock all locks Issue 2: ======== linkfile creation in lookup codepath can race with a rename. Imagine the following scenario: * lookup finds a data-file with gfid - gfid-dst - without a corresponding linkto file on hashed-subvol. It decides to create linkto file with gfid - gfid-dst. - Note that some codepaths of dht-rename deletes linkto file of dst as first step. So, a lookup racing with an in-progress rename can easily run into this situation. * a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file and hence gfid of data-file changes to gfid-src with path dst-path. * lookup proceeds and creates linkto file - dst-path - with gfid - dst-gfid - on hashed-subvol. * rename tries to create a linkto file dst-path with src-gfid on hashed-subvol, but it fails with EEXIST. But EEXIST is ignored during linkto file creation. Now we've ended with dst-path having different gfids - dst-gfid on linkto file and src-gfid on data file. Future lookups on dst-path will always fail with ESTALE, due to differing gfids. The fix is to synchronize linkfile creation in lookup path with rename using the same mechanism of protecting namespace explained in solution of Issue 1. Once locks are acquired, before proceeding with linkfile creation, we check whether conditions for linkto file creation are still valid. If not, we skip linkto file creation. Issue 3: ======== gfid of dst-path can change by the time locks are acquired. This means, either another rename overwrote dst-path or dst-path was deleted and recreated by a different client. When this happens, cached-subvol for dst can change. If rename proceeds with old-gfid and old-cached subvol, we'll end up in inconsistent state(s) like dst-path with different gfids on different subvols, more than one data-file being present etc. Fix is to do the lookup with a new inode after protecting namespace of dst. Post lookup, we've to compare gfids and correct local state appropriately to be in sync with backend. Issue 4: ======== During revalidate lookup, if following a linkto file doesn't lead to a valid data-file, local->cached-subvol was not reset to NULL. This means we would be operating on a stale state which can lead to inconsistency. As a fix, reset it to NULL before proceeding with lookup everywhere. Issue 5: ======== Stale dentries left out in inode table on brick resulted in failures of link fop even though the file/dentry didn't exist on backend fs. A patch is submitted to fix this issue. Please check the dependency tree of current patch on gerrit for details In short, we fix the problem by not blindly trusting the inode-table. Instead we validate whether dentry is present by doing lookup on backend fs. Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165 updates: bz#1543279 BUG: 1543279 Signed-off-by: Raghavendra G <rgowdapp>
REVIEW: https://review.gluster.org/20085 (tests/bug-1543279: mark it as bad) posted (#1) for review on master by Raghavendra G
Still some races are not fixed. See following patch for details: https://review.gluster.org/#/c/20085/
COMMIT: https://review.gluster.org/20085 committed in master by "Raghavendra G" <rgowdapp> with a commit message- tests/bug-1543279: mark it as bad There seems to be races which are not fixed by commit 9704d203f0. Though the test itself is not bad, it is failing very frequently. So, till the issue is fixed, marking this test as bad. updates: bz#1543279 Change-Id: I4a5869da1586178914ffb9563414879e6beab183 Signed-off-by: Raghavendra G <rgowdapp>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report. glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html [2] https://www.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/