Bug 762766 (GLUSTER-1034) - rename() is not atomic
Summary: rename() is not atomic
Keywords:
Status: CLOSED EOL
Alias: GLUSTER-1034
Product: GlusterFS
Classification: Community
Component: fuse
Version: 3.5.2
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-28 09:23 UTC by Dushyanth Harinath
Modified: 2016-06-17 15:56 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-06-17 15:56:43 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
reader.c (339 bytes, text/x-c)
2010-06-28 06:24 UTC, Dushyanth Harinath
no flags Details
writer.c (1.01 KB, text/x-c)
2010-06-28 06:24 UTC, Dushyanth Harinath
no flags Details
volfiles.zip (1.57 KB, application/zip)
2010-06-28 06:28 UTC, Dushyanth Harinath
no flags Details
Arris test program for reproduction (20.00 KB, application/x-tar)
2011-07-07 03:31 UTC, Stuart Friedberg
no flags Details

Description Dushyanth Harinath 2010-06-28 06:24:08 UTC
Created attachment 238 [details]
This is a log of a chat in irc.concentric.net in channel #phazed, which also evidences the security problems by the number of users named gdm.

Comment 1 Dushyanth Harinath 2010-06-28 06:24:30 UTC
Created attachment 239 [details]
patch for .spec file

Comment 2 Dushyanth Harinath 2010-06-28 06:28:56 UTC
Created attachment 240 [details]
new wersion of mktemp-1.4-linux.patch

Comment 3 Dushyanth Harinath 2010-06-28 09:23:44 UTC
While testing frequent dovecot index corruption on gluster, we have noted that rename() is not being handled atomically by gluster.

Attached are two test programs and gluster config to reproduce this bug.

reader.c and writer.c. Usage:

gcc reader.c -o reader -Wall
gcc writer.c -o writer -Wall
touch dovecot.index
./writer&
./reader

It should keep running and not printing anything. In local filesystems
it does that. But when running on glusterfs, it fails:

#./reader
open(dovecot.index): No such file or directory

This should never be happening. Apparently glusterfs doesn't handle
rename() atomically in all situations.

Comment 4 Anand Avati 2010-07-06 03:52:56 UTC
PATCH: http://patches.gluster.com/patch/3528 in master (mount/fuse: Handle setting entry-timeout to 0.)

Comment 5 Anand Avati 2010-07-06 03:53:00 UTC
PATCH: http://patches.gluster.com/patch/3526 in release-3.0 (glusterfsd: Handle setting entry-timeout to 0.)

Comment 6 Anand Avati 2010-07-06 03:53:04 UTC
PATCH: http://patches.gluster.com/patch/3527 in release-3.0 (mount/fuse: Handle setting entry-timeout to 0.)

Comment 7 Anand Avati 2010-07-06 09:58:44 UTC
PATCH: http://patches.gluster.com/patch/3536 in master (glusterfsd: Handle setting entry-timeout to 0)

Comment 8 Amar Tumballi 2010-10-05 06:01:20 UTC
Most of the self-heal (replicate related) bugs are now fixed with 3.1.0 branch. As we are just week behind the GA release time.. we would like you to test the particular bug in 3.1.0RC releases, and let us know if its fixed.

Comment 9 Amar Tumballi 2011-04-25 09:33:07 UTC
Please update the status of this bug as its been more than 6months since its filed (bug id < 2000)

Please resolve it with proper resolution if its not valid anymore. If its still valid and not critical, move it to 'enhancement' severity.

Comment 10 Amar Tumballi 2011-06-28 02:47:50 UTC
Some one in QA team, can you confirm whether this issue is fixed now ? (may be in master/3.2.1), and also make sure to include the scripts provided in our sanity.

Comment 11 Stuart Friedberg 2011-07-05 21:04:30 UTC
(In reply to comment #10)
> can you confirm whether this issue is fixed now ?

I was about to file a duplicate bug for this problem, found in both 3.1.3 and 3.2.1.  So, NO, this is not fixed in 3.2.1.

In fact, I can add a bit more characterization to the problem.  We are using replicated volumes and two test programs.  One program renames new files onto old files.  The other program opens and reads (only) the old files by name.  The open/read test program gets frequent ENOENT returns from both open(2) and from read(2) folling a successful open, but only when run on the same node as the rename test program.  When run from another node, the open/read test program never returns ENOENT.

In fact, we have a dandy test case where one instance of the rename test program is launched on each node hosting a brick, renaming a disjoint set of files.  One instance of the open/read test program is launched on each of the same nodes, but scanning the entire set of files being renamed.  In other words, all instances of the open/read program are scanning the same total set of files, and each instance of the rename program are touching a non-overlapping set of files (in different directories, in fact).  Each instance of the open/read program gets ENOENT errors for exactly and only the files being touched by the rename program running on the same node.

Under our operating conditions, we can generate multiple spurious ENOENT returns per second.

Comment 12 shishir gowda 2011-07-06 07:01:47 UTC
Hi,

There is work around for this issue which is related to renames followed by stat/open (Along with all the patches already committed).

http://patches.gluster.com/patch/3632/

This prevents queuing of rename/stat/open in io-threats.

Please let us know if this patch fixes the issue.

This fix is a work around for this specific case, and hence will not be available in the releases

Comment 13 Stuart Friedberg 2011-07-07 03:31:18 UTC
Created attachment 545

Comment 14 Stuart Friedberg 2011-07-07 03:31:35 UTC
(In reply to comment #12)
> Please let us know if this patch fixes the issue.

We applied the patch to 3.2.1 and gave it a try. Unfortunately,
that doesn't seem to have any impact on the problem we're seeing.

I've attached our test programs, with instructions.

Comment 15 Stuart Friedberg 2011-07-07 22:03:25 UTC
(In reply to comment #14)
> We applied the patch to 3.2.1 and gave it a try. Unfortunately,
> that doesn't seem to have any impact on the problem we're seeing.

I've been reminded of something in our test configuration that may be
relevant to reproducing the problem.  Our bricks are all on USB flash
devices, which are _slow_ and therefore could have a significant
impact on any timing-related bugs.

Assuming the timing is relevant, the dm-delay device mapper target
(DM_DELAY Kconfig option) can be used to increase read and write
latency for selected devices.  That would be especially useful if
you put timing-stressed regression tests in your QA suite.

Comment 16 Anand Avati 2011-07-08 01:30:08 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > We applied the patch to 3.2.1 and gave it a try. Unfortunately,
> > that doesn't seem to have any impact on the problem we're seeing.
> 
> I've been reminded of something in our test configuration that may be
> relevant to reproducing the problem.  Our bricks are all on USB flash
> devices, which are _slow_ and therefore could have a significant
> impact on any timing-related bugs.
> 
> Assuming the timing is relevant, the dm-delay device mapper target
> (DM_DELAY Kconfig option) can be used to increase read and write
> latency for selected devices.  That would be especially useful if
> you put timing-stressed regression tests in your QA suite.

Thanks Stuart. We will test with this and get back to you..

Avati

Comment 17 Amar Tumballi 2011-09-30 05:58:06 UTC
We just need to run tests for checking for rename failures. In our preliminary testing, we couldn't find issues. Can't take this bug as 'blocker' for 3.3.0 release.

Comment 18 Stuart Friedberg 2011-09-30 18:40:16 UTC
(In reply to comment #17)
Just confirming that the problem has not gone away spontaneously for us,
and that this is a critical problem preventing use of gluster for our
application.

Please let me know if there is anything additional information that would
be helpful in reproducing this problem.  In our environment, we can
produce this multiple times a second.

Also, our suspicions point to either the fuse module or the gluster-fuse
interface.  This "smells" like a bad cached entry on the node doing the
rename.  The repositories and all remote nodes have no difficulty seeing
the results of the rename.  Only the node doing the rename is affected.

Comment 19 Stuart Friedberg 2011-10-21 02:39:25 UTC
We just upgraded the backing store for the gluster repository on our test cluster from USB flash drives to SSD's, and we are still seeing multiple occurrences per second.  So it is not necessary to have incredibly slow devices to reproduce this problem.  Same test programs as previously submitted.

About 80% of the errors are ENOENT on open(2), while the rest are ENOENT on read(2) after a successful open(2) (which remains wonderfully bizarre to me).

Comment 20 Niels de Vos 2014-09-25 07:53:07 UTC
This is still an issue with glusterfs-3.5.2. The steps in comment #3 result in the following error:

    open(dovecot.index): No such file or directory

This error is hit in just a few seconds while running on a single-brick volume. Mounting with '-o entry-timeout=0' does not make a noticeable difference.

Comment 21 Anand Avati 2014-09-25 21:38:03 UTC
Several changes are needed for the test program to work

1. FUSE needs to send open intent along with lookup, if the lookup is being performed for an open()

2. Gluster needs to somehow use the intent flag, and keep the server side inode "alive" (as if already opened) for some time for a future actual open's guaranteed success even in the event of the file getting unlink'ed or rename'd over.

3. Server side resolver must serialize over dentries to fix the lookup/open + unlink race.

1 and 2 are somewhat related, 3 fixes a separate issue. all fixes are needed for this bug.

Comment 22 meher.gara@gmail.com 2015-02-06 15:59:37 UTC
Guys any news on this ? I still have this issue.
I have a file not found issue while renaming.
Thanks.

Comment 23 Maddy Markovitz 2015-05-01 00:10:53 UTC
Hi, I am also having this issue, running GlusterFS 3.6.2.  I'm using a test setup similar to the ones that other users have, and I get problems trying to read during a rename - usually a "Stale file handle" message.  
Can we raise the priority on this?  Needing rename() to behave atomically is a common use case, and it also breaks POSIX compliance.

Comment 24 dietmar 2015-10-20 15:33:16 UTC
Hi all,

we also detected problems with rename. After testing our own file system, it turned out to be a problem in fuse, i.e. you can reproduce this bug on the basic fusexmp file system, or any other fuse based file system (sshfs for example).

Comment 25 Niels de Vos 2016-06-17 15:56:43 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.


Note You need to log in before you can comment on or make changes to this bug.