Description of problem:
processes on two nodes loop doing this in one dir:
1. create a new file foo.pid.count
2. rename foo.pid.count bar
We'll regularly get ENOENT returned from the rename(2).
Version-Release number of selected component (if applicable):
reported on irc, seen by users running dovecot imap server
Steps to Reproduce:
no rename errors
Created attachment 128507 [details]
test that shows the problem, run this on two machines at once in one dir
Created attachment 128513 [details]
updated version of test printing open errors
I've attached a slightly modified version of the test that
will make the error cases clearer:
- if there's one instance of this test running on each
node, then I just get ENOENT errors from rename(2).
- if there are two instances of this test running on a node,
then I get ENOENT errors from rename(2) and ENOENT errors from
the open(2) of "rename.test" (open errors only on the node
running two instances).
Look like we have a couple of mail server issues that may all relate to this
issue. I suspect we do not invalidate dentry correctly when the unlink (rename)
is done by another node.
I'll start to work on this issue today.
There are two issues here:
1) I'm not sure this test case is valid. When two processes try to rename
different files to the very same name, should they use either posix lock
or flock on the directory? The reason ENOENT doesn't show up in ext3 is
because VFS layer does an early "lock_rename" (it locks the directories)
*before* it does the final lookup (the filename without pathname). By the
time GFS is called, VFS layer has obtained the file's dentry. So there is
a (big) window between VFS layer getting the file's dentry (it has inode
pointer) and GFS could do anything to notify this node that the inode has
been removed by another cluster node. If we want GFS behaves the same way
as ext3, VFS's "lock_rename" has to call us. Even upstream folks agree
to call us in "lock_rename", there is currently no hook in VFS that can do
2) The locking sequence in gfs_rename() doesn't look right either. Will update
this issue soon.
It is a valid test case. The rename call should act as if its atomic so that
given a destination that always exists, it should find it and replace it.
Exactly which file it finds is of course subject to races in this particular
case, but this kind of thing can be used as part of locking schemes etc.
Rename used to do its locking at the VFS layer in a similar way to GFS, by
using the addresses of the various mutexes (well semaphores in those days) to
avoid deadlocks. That was found to have problems, hence the top down approach
(i.e. parents before children).
I think we can use the same lock ordering in GFS to avoid deadlocks as we
already use that ordering in file creation for example, where there can be no
Dealing with the race however, is a more tricky problem as you say...
hmm... how about this, instead of messing around with VFS's lock_rename() (and
upstream folks), we (GFS) could simply unhashing the dentry and freeing the
inode that are passed to us by VFS. Then start over again (with performance hit
expected). This way we could get rid of this ENOENT error.
In any case, we still need to examine ("fix" is a better word) gfs_rename
If thats possible, and I believe it is, then its the best approach open to us I
think. I'd be quite happy to see if fixed that way and I think the upstream
folks will too.
This issue is messier than I expected. The "open" is broken too - will give that
(open) some more thoughts tomorrow. However, I do have a draft patch for rename
now - after some more sanity checks, will post here for review. Good news is
that this may be the cause of several double-free inode panic we have been
seeing under mail server environment.
This applies to GFS2 as well, btw. Can you explain a bit more about the "open"
You missed Dave's comment #3 above (as I did :)). Look like open() can become
the victim of this stale dentry issue too.
Ah, I see it now :-) In which case it should really be solved by locking at look
up time with the open intents. We need to check if the current implementation
gives us enough information to solve that in the nameidata structure. If not we
may need to patch the upstream VFS, at least for GFS2.
The patch didn't work well. Look like dropping dentry would create orphan GFS
inodes that can not be de-allocated. That will result umount hangs (and other
obvious combination such as buffer/memory leak, etc). Still scratching my head
at this moment.
Created attachment 135183 [details]
Draft patch (with restriction)
When rename(fromA, toB) is called, if another node happens to
either delete and/or rename the "toB" file at the same time,
current GFS could return ENOENT and fails the rename() call.
This seems to violate rename API. This patch fixes that.
Restriction: If the "toB" file has been referenced before
rename() (and has not been released) at the same node, while
another node deletes/renames it, this patch follows the current
GFS logic - that is, it will return ENOENT and fail the call.
Using the above test program as an example, if we don't include
the logic of "oldfd" as in: "oldfd = open(fname, O_RDWR);", this
patch will fix the issue. However, if we let that "open" stay,
I don't know how to fix it at this moment.
Sorry, the second node must do a rename (not delete) for this error to occur. It
is late in the night so my head is not clear.
We need to discuss the possibility of pushing for lock_rename() upstream
changes. If lock_rename can call us, ~10 lines of code should do the trick, vs.
the current ~300 lines of workaround. The difficulty (of GFS workaround) is that
by the time we find inode is stale, it is already inside the onion layers of 5
VFS locks and 5 GFS glocks - so it is not possible to call existing routines.
This implies lots of cut, paste, unlock, re-lock, etc. Very annoying and dangerous.
Created attachment 135292 [details]
Working draft patch
Test ran fine last night - look like it completely fixed the issue without
restrictions. Upload the draft patch here so it won't get lost.
What I'll do next week (my machine needs to get reset to work on something
else) is to re-fine this gigantic patch into three smaller patches for easy
management and review:
Patch 1 will be locking stuff.
Patch 2 will be inode changes.
Patch 3 will be the core rename changes.
Created attachment 146972 [details]
This is the 3rd patch that was scheduled to check into CVS last week. However,
the final sanity check shows that it will try to read an already deleted file
that causes EIO. It ends up withdrawing the filesystem. I originally tested
this on two i686 SMP machines but later one of my test nodes was replaced with
an UP x86_64 machine. The error always happens with UP x86_64 and looks to me
that both nodes think they have the lock. So the i686 SMP machine deletes the
file while x86_64 UP machine tries to read it. Issue seems to somewhere around
the asynchronous locking.
I'll take a closer look tonight (didn't have a chance to go to it in office
today). If I can't finish this up, Ben's help will be greatly appreciated.
I see the error with this patch on two i686 machines with SMP kernels. One
machine has two CPUs. the other only has one. For what it's worth, I have always
seen the error on the one with two CPUs.
It does relate to the asynchronous lockings..
1) The panic is always with the "faster" node. When other (slower) node is
still updating the on-disk structures, the faster node thinks it gets
the lock and does a gfs_dir_search() that reads in garbage. The new
rename patch relies on the new ino from the gfs_dir_search() call to
update its dentry. Since the contents of the new ino is gargabe, next
disk read would sometime go to never-land.
2) In previous testing, I always had gfs global rename lock. It was the first
lock taken using gfs_glock_nq_init (that uses synchronous lock by default).
So I never saw this problem.
3) There are few more bugs in gfs_rename(), for example, it never checks
whether odip == ndip and requests these two locks (odip and ndip)
regardless. Under synchronous locking, this would immediately generate
an assertion failure but asynchronous locking let it get away with this.
Interestingly GFS2 doesn't have this issue (it does check odip == ndip
before requesting the lock.
The good news is that after changing the rename locking into synchrnous
mode, gfs no longer crashes.
The bad news is that gfs_rename now deadlocks "correctly" from time to
time. For example, each rename is normally associated with 4 locks (two
directories and two files) - it never bothered to check their sequence,
comparing to other part of the GFS kernel. In one instance, it deadlocks
with gfs_lookupi where it locked file first, followed by directory while
rename lock directory first, followed by file lock.
The issue is in the routine "gfs_glock_nq_m" where it siliently switches
the locking mode into asynchronous, regardless the request. I'm still
deciding whether I should chase the asynchronous locking using this
bugzilla (and fix gfs_glock_nq_m() accordingly) or just fix gfs_rename
now and leave the asynchronous locking issue (and gfs_glock_nq_m) for
next update (RHEL 4.6).
Chasing glock async locking issues could take a while ...
I've isolated the changes within gfs_rename() and leave gfs_glock_nq_m()
un-touched. Will need to look into asynchronous locking issues as soon as
we get a chance.
After debug statements are cleaned-up, will check the changes into
Messing with rename is like walking on a place full of land-mine. Each
time I try a new test case, a new deadlock explodes. The newest deadlock
is very messy to get away under current GFS's lock ordering rule.
When rename finds its destination file has been deleted by another node,
it already has all the locks (two directory locks plus two file locks).
My original patch discards the stale dentry and creates a new dentry
based on the new file (created by another node). We need to lock this
new file with the new ino. If the new ino is smaller than its directory
lock, we could deadlock with another node if it happens to be doing a
lookup on this file (where it would get this lock first, followed by
directory lock and this process, on the other hand, has the directory
lock already but will be required to get this new lock). I tried to
enforce the lock order by releasing all the old locks, then re-acquire
them all again but it makes the code very messy. Under Dave's test
program, it looks very terrible since as soon as I release the directory
locks, another node has picked up these locks and removes this new file
A compromise made by today's CVS check-in is to check the new ino
number to see whether it could cause deadlock. If yes, then we give
up by returning the original error return (-ENOENT).
Sadly move this bugzilla into modified state. This is about all I can do
in RHEL 4.5.
So in short sentence, after such a long and exhausting effort, we only reduce
the possibility of returning this -ENOENT (that violates the rename API) by
roughly 25%, if GFS allocates the ino in true random manner.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.