Description of problem: processes on two nodes loop doing this in one dir: 1. create a new file foo.pid.count 2. rename foo.pid.count bar We'll regularly get ENOENT returned from the rename(2). Version-Release number of selected component (if applicable): How reproducible: always reported on irc, seen by users running dovecot imap server Steps to Reproduce: 1. 2. 3. Actual results: rename errors Expected results: no rename errors Additional info:
Created attachment 128507 [details] test that shows the problem, run this on two machines at once in one dir
Created attachment 128513 [details] updated version of test printing open errors
I've attached a slightly modified version of the test that will make the error cases clearer: - if there's one instance of this test running on each node, then I just get ENOENT errors from rename(2). - if there are two instances of this test running on a node, then I get ENOENT errors from rename(2) and ENOENT errors from the open(2) of "rename.test" (open errors only on the node running two instances).
Look like we have a couple of mail server issues that may all relate to this issue. I suspect we do not invalidate dentry correctly when the unlink (rename) is done by another node. I'll start to work on this issue today.
There are two issues here: 1) I'm not sure this test case is valid. When two processes try to rename different files to the very same name, should they use either posix lock or flock on the directory? The reason ENOENT doesn't show up in ext3 is because VFS layer does an early "lock_rename" (it locks the directories) *before* it does the final lookup (the filename without pathname). By the time GFS is called, VFS layer has obtained the file's dentry. So there is a (big) window between VFS layer getting the file's dentry (it has inode pointer) and GFS could do anything to notify this node that the inode has been removed by another cluster node. If we want GFS behaves the same way as ext3, VFS's "lock_rename" has to call us. Even upstream folks agree to call us in "lock_rename", there is currently no hook in VFS that can do this job. 2) The locking sequence in gfs_rename() doesn't look right either. Will update this issue soon.
It is a valid test case. The rename call should act as if its atomic so that given a destination that always exists, it should find it and replace it. Exactly which file it finds is of course subject to races in this particular case, but this kind of thing can be used as part of locking schemes etc. Rename used to do its locking at the VFS layer in a similar way to GFS, by using the addresses of the various mutexes (well semaphores in those days) to avoid deadlocks. That was found to have problems, hence the top down approach (i.e. parents before children). I think we can use the same lock ordering in GFS to avoid deadlocks as we already use that ordering in file creation for example, where there can be no deadlocks. Dealing with the race however, is a more tricky problem as you say...
hmm... how about this, instead of messing around with VFS's lock_rename() (and upstream folks), we (GFS) could simply unhashing the dentry and freeing the inode that are passed to us by VFS. Then start over again (with performance hit expected). This way we could get rid of this ENOENT error. In any case, we still need to examine ("fix" is a better word) gfs_rename locking issues.
If thats possible, and I believe it is, then its the best approach open to us I think. I'd be quite happy to see if fixed that way and I think the upstream folks will too.
This issue is messier than I expected. The "open" is broken too - will give that (open) some more thoughts tomorrow. However, I do have a draft patch for rename now - after some more sanity checks, will post here for review. Good news is that this may be the cause of several double-free inode panic we have been seeing under mail server environment.
This applies to GFS2 as well, btw. Can you explain a bit more about the "open" issue?
You missed Dave's comment #3 above (as I did :)). Look like open() can become the victim of this stale dentry issue too.
Ah, I see it now :-) In which case it should really be solved by locking at look up time with the open intents. We need to check if the current implementation gives us enough information to solve that in the nameidata structure. If not we may need to patch the upstream VFS, at least for GFS2.
The patch didn't work well. Look like dropping dentry would create orphan GFS inodes that can not be de-allocated. That will result umount hangs (and other obvious combination such as buffer/memory leak, etc). Still scratching my head at this moment.
Created attachment 135183 [details] Draft patch (with restriction) When rename(fromA, toB) is called, if another node happens to either delete and/or rename the "toB" file at the same time, current GFS could return ENOENT and fails the rename() call. This seems to violate rename API. This patch fixes that. Restriction: If the "toB" file has been referenced before rename() (and has not been released) at the same node, while another node deletes/renames it, this patch follows the current GFS logic - that is, it will return ENOENT and fail the call. Using the above test program as an example, if we don't include the logic of "oldfd" as in: "oldfd = open(fname, O_RDWR);", this patch will fix the issue. However, if we let that "open" stay, I don't know how to fix it at this moment.
Sorry, the second node must do a rename (not delete) for this error to occur. It is late in the night so my head is not clear.
We need to discuss the possibility of pushing for lock_rename() upstream changes. If lock_rename can call us, ~10 lines of code should do the trick, vs. the current ~300 lines of workaround. The difficulty (of GFS workaround) is that by the time we find inode is stale, it is already inside the onion layers of 5 VFS locks and 5 GFS glocks - so it is not possible to call existing routines. This implies lots of cut, paste, unlock, re-lock, etc. Very annoying and dangerous.
Created attachment 135292 [details] Working draft patch Test ran fine last night - look like it completely fixed the issue without restrictions. Upload the draft patch here so it won't get lost. What I'll do next week (my machine needs to get reset to work on something else) is to re-fine this gigantic patch into three smaller patches for easy management and review: Patch 1 will be locking stuff. Patch 2 will be inode changes. Patch 3 will be the core rename changes.
Created attachment 146972 [details] gfs_rename_core.patch This is the 3rd patch that was scheduled to check into CVS last week. However, the final sanity check shows that it will try to read an already deleted file that causes EIO. It ends up withdrawing the filesystem. I originally tested this on two i686 SMP machines but later one of my test nodes was replaced with an UP x86_64 machine. The error always happens with UP x86_64 and looks to me that both nodes think they have the lock. So the i686 SMP machine deletes the file while x86_64 UP machine tries to read it. Issue seems to somewhere around the asynchronous locking. I'll take a closer look tonight (didn't have a chance to go to it in office today). If I can't finish this up, Ben's help will be greatly appreciated.
I see the error with this patch on two i686 machines with SMP kernels. One machine has two CPUs. the other only has one. For what it's worth, I have always seen the error on the one with two CPUs.
It does relate to the asynchronous lockings.. 1) The panic is always with the "faster" node. When other (slower) node is still updating the on-disk structures, the faster node thinks it gets the lock and does a gfs_dir_search() that reads in garbage. The new rename patch relies on the new ino from the gfs_dir_search() call to update its dentry. Since the contents of the new ino is gargabe, next disk read would sometime go to never-land. 2) In previous testing, I always had gfs global rename lock. It was the first lock taken using gfs_glock_nq_init (that uses synchronous lock by default). So I never saw this problem. 3) There are few more bugs in gfs_rename(), for example, it never checks whether odip == ndip and requests these two locks (odip and ndip) regardless. Under synchronous locking, this would immediately generate an assertion failure but asynchronous locking let it get away with this. Interestingly GFS2 doesn't have this issue (it does check odip == ndip before requesting the lock.
The good news is that after changing the rename locking into synchrnous mode, gfs no longer crashes. The bad news is that gfs_rename now deadlocks "correctly" from time to time. For example, each rename is normally associated with 4 locks (two directories and two files) - it never bothered to check their sequence, comparing to other part of the GFS kernel. In one instance, it deadlocks with gfs_lookupi where it locked file first, followed by directory while rename lock directory first, followed by file lock. The issue is in the routine "gfs_glock_nq_m" where it siliently switches the locking mode into asynchronous, regardless the request. I'm still deciding whether I should chase the asynchronous locking using this bugzilla (and fix gfs_glock_nq_m() accordingly) or just fix gfs_rename now and leave the asynchronous locking issue (and gfs_glock_nq_m) for next update (RHEL 4.6). Chasing glock async locking issues could take a while ...
I've isolated the changes within gfs_rename() and leave gfs_glock_nq_m() un-touched. Will need to look into asynchronous locking issues as soon as we get a chance. After debug statements are cleaned-up, will check the changes into CVS today.
Messing with rename is like walking on a place full of land-mine. Each time I try a new test case, a new deadlock explodes. The newest deadlock is very messy to get away under current GFS's lock ordering rule. When rename finds its destination file has been deleted by another node, it already has all the locks (two directory locks plus two file locks). My original patch discards the stale dentry and creates a new dentry based on the new file (created by another node). We need to lock this new file with the new ino. If the new ino is smaller than its directory lock, we could deadlock with another node if it happens to be doing a lookup on this file (where it would get this lock first, followed by directory lock and this process, on the other hand, has the directory lock already but will be required to get this new lock). I tried to enforce the lock order by releasing all the old locks, then re-acquire them all again but it makes the code very messy. Under Dave's test program, it looks very terrible since as soon as I release the directory locks, another node has picked up these locks and removes this new file *again* :(. A compromise made by today's CVS check-in is to check the new ino number to see whether it could cause deadlock. If yes, then we give up by returning the original error return (-ENOENT).
Sadly move this bugzilla into modified state. This is about all I can do in RHEL 4.5.
So in short sentence, after such a long and exhausting effort, we only reduce the possibility of returning this -ENOENT (that violates the rename API) by roughly 25%, if GFS allocates the ino in true random manner.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0142.html