Bug 190475 - error when two nodes rename two files to same new name
Summary: error when two nodes rename two files to same new name
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Wendy Cheng
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks: 204616
TreeView+ depends on / blocked
 
Reported: 2006-05-02 18:55 UTC by David Teigland
Modified: 2010-01-12 03:11 UTC (History)
2 users (show)

Fixed In Version: RHBA-2007-0142
Clone Of:
Environment:
Last Closed: 2007-05-10 21:13:03 UTC
Embargoed:


Attachments (Terms of Use)
test that shows the problem, run this on two machines at once in one dir (787 bytes, text/plain)
2006-05-02 18:55 UTC, David Teigland
no flags Details
updated version of test printing open errors (942 bytes, text/plain)
2006-05-02 19:54 UTC, David Teigland
no flags Details
Draft patch (with restriction) (13.69 KB, patch)
2006-08-30 04:59 UTC, Wendy Cheng
no flags Details | Diff
OCFS2 implementation (9.38 KB, text/plain)
2006-08-30 14:54 UTC, Wendy Cheng
no flags Details
Working draft patch (14.00 KB, patch)
2006-08-31 14:38 UTC, Wendy Cheng
no flags Details | Diff
gfs_rename_core.patch (5.91 KB, text/plain)
2007-01-30 22:38 UTC, Wendy Cheng
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0142 0 normal SHIPPED_LIVE GFS-kernel bug fix update 2007-05-10 21:12:20 UTC

Description David Teigland 2006-05-02 18:55:14 UTC
Description of problem:

processes on two nodes loop doing this in one dir:
1. create a new file foo.pid.count
2. rename foo.pid.count bar

We'll regularly get ENOENT returned from the rename(2).

Version-Release number of selected component (if applicable):


How reproducible:

always
reported on irc, seen by users running dovecot imap server

Steps to Reproduce:
1.
2.
3.
  
Actual results:

rename errors

Expected results:

no rename errors

Additional info:

Comment 1 David Teigland 2006-05-02 18:55:14 UTC
Created attachment 128507 [details]
test that shows the problem, run this on two machines at once in one dir

Comment 2 David Teigland 2006-05-02 19:54:53 UTC
Created attachment 128513 [details]
updated version of test printing open errors

Comment 3 David Teigland 2006-05-02 19:59:25 UTC
I've attached a slightly modified version of the test that
will make the error cases clearer:

- if there's one instance of this test running on each
node, then I just get ENOENT errors from rename(2).

- if there are two instances of this test running on a node,
then I get ENOENT errors from rename(2) and ENOENT errors from
the open(2) of "rename.test" (open errors only on the node
running two instances).


Comment 4 Wendy Cheng 2006-08-16 15:27:44 UTC
Look like we have a couple of mail server issues that may all relate to this
issue. I suspect we do not invalidate dentry correctly when the unlink (rename)
is done by another node.

I'll start to work on this issue today. 

Comment 5 Wendy Cheng 2006-08-17 14:26:59 UTC
There are two issues here:

1) I'm not sure this test case is valid. When two processes try to rename 
   different files to the very same name, should they use either posix lock 
   or flock on the directory? The reason ENOENT doesn't show up in ext3 is 
   because VFS layer does an early "lock_rename" (it locks the directories)
   *before* it does the final lookup (the filename without pathname). By the
   time GFS is called, VFS layer has obtained the file's dentry. So there is
   a (big) window between VFS layer getting the file's dentry (it has inode 
   pointer) and GFS could do anything to notify this node that the inode has
   been removed by another cluster node. If we want GFS behaves the same way
   as ext3, VFS's "lock_rename" has to call us. Even upstream folks agree
   to call us in "lock_rename", there is currently no hook in VFS that can do
   this job. 
2) The locking sequence in gfs_rename() doesn't look right either. Will update
   this issue soon.

Comment 6 Steve Whitehouse 2006-08-17 14:48:14 UTC
It is a valid test case. The rename call should act as if its atomic so that
given a destination that always exists, it should find it and replace it.
Exactly which file it finds is of course subject to races in this particular
case, but this kind of thing can be used as part of locking schemes etc.

Rename used to do its locking at the VFS layer in a similar way to GFS, by
using the addresses of the various mutexes (well semaphores in those days) to
avoid deadlocks. That was found to have problems, hence the top down approach
(i.e. parents before children).

I think we can use the same lock ordering in GFS to avoid deadlocks as we
already use that ordering in file creation for example, where there can be no
deadlocks.

Dealing with the race however, is a more tricky problem as you say...




Comment 7 Wendy Cheng 2006-08-17 15:06:25 UTC
hmm... how about this, instead of messing around with VFS's lock_rename() (and
upstream folks), we (GFS) could simply unhashing the dentry and freeing the 
inode that are passed to us by VFS. Then start over again (with performance hit 
expected). This way we could get rid of this ENOENT error.

In any case, we still need to examine ("fix" is a better word) gfs_rename 
locking issues. 

Comment 8 Steve Whitehouse 2006-08-17 15:18:11 UTC
If thats possible, and I believe it is, then its the best approach open to us I
think. I'd be quite happy to see if fixed that way and I think the upstream
folks will too.

Comment 9 Wendy Cheng 2006-08-18 04:32:32 UTC
This issue is messier than I expected. The "open" is broken too - will give that
(open) some more thoughts tomorrow. However, I do have a draft patch for rename
now - after some more sanity checks, will post here for review. Good news is
that this may be the cause of several double-free inode panic we have been
seeing under mail server environment.   

Comment 10 Steve Whitehouse 2006-08-18 08:24:28 UTC
This applies to GFS2 as well, btw. Can you explain a bit more about the "open"
issue?

Comment 11 Wendy Cheng 2006-08-18 14:20:10 UTC
You missed Dave's comment #3 above (as I did :)). Look like open() can become
the victim of this stale dentry issue too. 

Comment 12 Steve Whitehouse 2006-08-18 14:25:17 UTC
Ah, I see it now :-) In which case it should really be solved by locking at look
up time with the open intents. We need to check if the current implementation
gives us enough information to solve that in the nameidata structure. If not we
may need to patch the upstream VFS, at least for GFS2.


Comment 13 Wendy Cheng 2006-08-21 21:51:54 UTC
The patch didn't work well. Look like dropping dentry would create orphan GFS 
inodes that can not be de-allocated. That will result umount hangs (and other
obvious combination such as buffer/memory leak, etc). Still scratching my head 
at this moment.   




Comment 14 Wendy Cheng 2006-08-30 04:59:50 UTC
Created attachment 135183 [details]
Draft patch (with restriction)

When rename(fromA, toB) is called, if another node happens to 
either delete and/or rename the "toB" file at the same time, 
current GFS could return ENOENT and fails the rename() call. 
This seems to violate rename API. This patch fixes that.

Restriction: If the "toB" file has been referenced before 
rename() (and has not been released) at the same node, while 
another node deletes/renames it, this patch follows the current 
GFS logic - that is, it will return ENOENT and fail the call.

Using the above test program as an example, if we don't include 
the logic of "oldfd" as in: "oldfd = open(fname, O_RDWR);", this 
patch will fix the issue. However, if we let that "open" stay, 
I don't know how to fix it at this moment.

Comment 15 Wendy Cheng 2006-08-30 05:18:20 UTC
Sorry, the second node must do a rename (not delete) for this error to occur. It
is late in the night so my head is not clear. 


Comment 16 Wendy Cheng 2006-08-30 14:06:33 UTC
We need to discuss the possibility of pushing for lock_rename() upstream
changes. If lock_rename can call us, ~10 lines of code should do the trick, vs.
the current ~300 lines of workaround. The difficulty (of GFS workaround) is that
by the time we find inode is stale, it is already inside the onion layers of 5
VFS locks and 5 GFS glocks - so it is not possible to call existing routines.
This implies lots of cut, paste, unlock, re-lock, etc. Very annoying and dangerous. 



Comment 19 Wendy Cheng 2006-08-31 14:38:09 UTC
Created attachment 135292 [details]
Working draft patch

Test ran fine last night - look like it completely fixed the issue without 
restrictions. Upload the draft patch here so it won't get lost. 

What I'll do next week (my machine needs to get reset to work on something 
else) is to re-fine this gigantic patch into three smaller patches for easy
management and review:

Patch 1 will be locking stuff.
Patch 2 will be inode changes.
Patch 3 will be the core rename changes.

Comment 21 Wendy Cheng 2007-01-30 22:38:47 UTC
Created attachment 146972 [details]
gfs_rename_core.patch

This is the 3rd patch that was scheduled to check into CVS last week. However, 

the final sanity check shows that it will try to read an already deleted file
that causes EIO. It ends up withdrawing the filesystem. I originally tested
this on two i686 SMP machines but later one of my test nodes was replaced with
an UP x86_64 machine. The error always happens with UP x86_64 and looks to me
that both nodes think they have the lock. So the i686 SMP machine deletes the
file while x86_64 UP machine tries to read it. Issue seems to somewhere around
the asynchronous locking.

I'll take a closer look tonight (didn't have a chance to go to it in office
today). If I can't finish this up, Ben's help will be greatly appreciated.

Comment 22 Ben Marzinski 2007-02-02 21:12:34 UTC
I see the error with this patch on two i686 machines with SMP kernels. One
machine has two CPUs. the other only has one.  For what it's worth, I have always
seen the error on the one with two CPUs.

Comment 23 Wendy Cheng 2007-02-09 23:29:21 UTC
It does relate to the asynchronous lockings..

1) The panic is always with the "faster" node. When other (slower) node is 
   still updating the on-disk structures, the faster node thinks it gets 
   the lock and does a gfs_dir_search() that reads in garbage. The new
   rename patch relies on the new ino from the gfs_dir_search() call to 
   update its dentry. Since the contents of the new ino is gargabe, next
   disk read would sometime go to never-land. 
2) In previous testing, I always had gfs global rename lock. It was the first
   lock taken using gfs_glock_nq_init (that uses synchronous lock by default).
   So I never saw this problem. 
3) There are few more bugs in gfs_rename(), for example, it never checks 
   whether odip == ndip and requests these two locks (odip and ndip) 
   regardless. Under synchronous locking, this would immediately generate 
   an assertion failure but asynchronous locking let it get away with this. 
   Interestingly GFS2 doesn't have this issue (it does check odip == ndip
   before requesting the lock. 

Comment 24 Wendy Cheng 2007-02-11 06:31:32 UTC
The good news is that after changing the rename locking into synchrnous 
mode, gfs no longer crashes.

The bad news is that gfs_rename now deadlocks "correctly" from time to 
time. For example, each rename is normally associated with 4 locks (two
directories and two files) - it never bothered to check their sequence, 
comparing to other part of the GFS kernel. In one instance, it deadlocks
with gfs_lookupi where it locked file first, followed by directory while 
rename lock directory first, followed by file lock.

The issue is in the routine "gfs_glock_nq_m" where it siliently switches 
the locking mode into asynchronous, regardless the request. I'm still 
deciding whether I should chase the asynchronous locking using this 
bugzilla (and fix gfs_glock_nq_m() accordingly) or just fix gfs_rename 
now and leave the asynchronous locking issue (and gfs_glock_nq_m) for
next update (RHEL 4.6). 

Chasing glock async locking issues could take a while ...

Comment 25 Wendy Cheng 2007-02-12 19:51:09 UTC
I've isolated the changes within gfs_rename() and leave gfs_glock_nq_m()
un-touched. Will need to look into asynchronous locking issues as soon as 
we get a chance. 

After debug statements are cleaned-up, will check the changes into 
CVS today. 

Comment 26 Wendy Cheng 2007-02-14 23:11:45 UTC
Messing with rename is like walking on a place full of land-mine. Each 
time I try a new test case, a new deadlock explodes. The newest deadlock
is very messy to get away under current GFS's lock ordering rule.

When rename finds its destination file has been deleted by another node, 
it already has all the locks (two directory locks plus two file locks). 
My original patch discards the stale dentry and creates a new dentry 
based on the new file (created by another node). We need to lock this 
new file with the new ino. If the new ino is smaller than its directory 
lock, we could deadlock with another node if it happens to be doing a 
lookup on this file (where it would get this lock first, followed by 
directory lock and this process, on the other hand, has the directory 
lock already but will be required to get this new lock). I tried to 
enforce the lock order by releasing all the old locks, then re-acquire 
them all again but it makes the code very messy. Under Dave's test 
program, it looks very terrible since as soon as I release the directory 
locks, another node has picked up these locks and removes this new file 
*again* :(.

A compromise made by today's CVS check-in is to check the new ino 
number to see whether it could cause deadlock. If yes, then we give
up by returning the original error return (-ENOENT). 

Comment 27 Wendy Cheng 2007-02-14 23:21:03 UTC
Sadly move this bugzilla into modified state. This is about all I can do
in RHEL 4.5. 



Comment 28 Wendy Cheng 2007-02-14 23:27:26 UTC
So in short sentence, after such a long and exhausting effort, we only reduce
the possibility of returning this -ENOENT (that violates the rename API) by
roughly 25%, if GFS allocates the ino in true random manner. 

Comment 31 Red Hat Bugzilla 2007-05-10 21:13:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0142.html



Note You need to log in before you can comment on or make changes to this bug.