Bug 479728 - NFS: unable to unmount file system
Summary: NFS: unable to unmount file system
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: All
OS: Linux
high
high
Target Milestone: beta
: ---
Assignee: Peter Staubach
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-01-12 18:36 UTC by Robert Peterson
Modified: 2009-05-18 19:24 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:24:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Proposed patch (1.08 KB, patch)
2009-02-17 19:52 UTC, Peter Staubach
no flags Details | Diff
Bob's patch (694 bytes, patch)
2009-02-19 16:09 UTC, Robert Peterson
no flags Details | Diff
proposed patch (1.59 KB, patch)
2009-02-24 15:58 UTC, Josef Bacik
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Robert Peterson 2009-01-12 18:36:23 UTC
Description of problem:
After I do some NFS operations, I am unable to umount my source
file system.

Version-Release number of selected component (if applicable):
RHEL4.6

The problem does not recreate in the 2.6.9-78.19 kernel, but it
does recreate in the 2.6.9-78.20 thru .25 kernels

How reproducible:
Easily

Steps to Reproduce:
1. mkfs ext3 or gfs file system
2. mount ext3 or gfs file system
3. service nfs start
4. mount nfs over the ext3 or gfs file system
5. do some I/O
6. stop all I/O
7. umount nfs
8. service nfs stop
9. umount ext3 or gfs file system
  
Actual results:
Unable to unmount source file system:
File system busy

Expected results:
Should be able to unmount

Additional info:
lsof shows no files in use.  All tasks relating to the fs have ended.

Comment 1 Robert Peterson 2009-02-05 16:53:08 UTC
Hi Peter,

Can you or Steve D, by any chance, post a link to the patch that
was added between .19 and .20.  I want to try reverting the patch
(by hand if necessary) in a newer (2.6.9-79 or 2.6.9-80) kernel to
see if I can get the problem to go away for some of my bugzilla work.

Comment 2 Peter Staubach 2009-02-05 17:11:43 UTC
You should be able to find the patch by looking at the changelog section
of the spec file?  The bugzilla was 278291.

Comment 3 Robert Peterson 2009-02-05 19:30:59 UTC
I verified that reverting the patch for 278291 allows me to unmount
the file system properly on an otherwise 2.6.9-80 kernel.

Comment 4 RHEL Program Management 2009-02-17 18:54:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Peter Staubach 2009-02-17 19:52:47 UTC
Created attachment 332285 [details]
Proposed patch

Comment 6 Peter Staubach 2009-02-17 19:56:04 UTC
Josef Bacik code reviewed the patch for bz278291 and discovered that a
reference counting problem had been added.  The pathname structs were
getting get()'d without being put()'d in the routine, exp_export().
This was leading to active inodes on the referenced file system(s).

I reviewed the entire routine and corrected two reference counting
problems.  One was the pathname reference counting problem and the
other was a problem with the references to the auth_domain.

With this patch, I am unable to so easily reproduce the busy file
systems that I could see before.

Comment 7 Robert Peterson 2009-02-17 22:08:18 UTC
I tried the attached patch and I hope I did everything correctly.
The patch did not fix my problem.  Not only was I unable to unmount
my underlying gfs file system, my test also took a lot longer to
run for some reason.  The test normally runs in 15 minutes, but took
25 minutes with the patch.  The performance thing may be unrelated,
but it was an oddity.

Comment 8 Robert Peterson 2009-02-17 22:24:53 UTC
AFAICT, I applied and built the patch correctly.

Comment 9 Robert Peterson 2009-02-19 16:09:30 UTC
Created attachment 332577 [details]
Bob's patch

Looking at the patch, it seems to me that the problem might be related
to the switching of tmp with new and the cache_put that happens later
in svc_expkey_lookup().  I tried making the expkey_put conditionally
put tmp or new depending on whether the old switcheroo happened.
It didn't work, but I may have screwed it up.  It caused problems.

Next, I tried this patch, but I still can't unmount after using it.

Comment 10 Josef Bacik 2009-02-20 21:08:57 UTC
finally got down to the basics to try and figure out what was going wrong.  Put some debug printk's in, one for every new export we added in svc_export_lookup and one for every new expkey we added in svc_expkey_lookup.  Also printed out which export we were grabbing a ref for in svc_expkey_update(?), and then when we free the expkey and export.  This is what I got

doing a new export, 00000101be71e600
new expkey 00000101be71e580
getting ref for export 00000101be71e600
new expkey 00000101be71e500
getting ref for export 00000101be71e600
doing a new export, 00000101be71e400
new expkey 00000101be71e380
getting ref for export 00000101be71e400
new expkey 00000101be71e300
getting ref for export 00000101be71e400
doing a new export, 00000101be71e200
new expkey 00000101be71e180
getting ref for export 00000101be71e200
new expkey 00000101be71e100
getting ref for export 00000101be71e200
doing a new export, 00000101be71e280
new expkey 000001021812f400
getting ref for export 00000101be71e280
new expkey 0000010184a81080
getting ref for export 00000101be71e280
nfsd: last server has exited
nfsd: unexporting all filesystems
freeing expkey 00000101be71e500
putting ref for export 00000101be71e600
freeing expkey 00000101be71e100
putting ref for export 00000101be71e200
freeing expkey 0000010184a81080
putting ref for export 00000101be71e280
freeing expkey 00000101be71e380
putting ref for export 00000101be71e400
freeing expkey 00000101be71e180
putting ref for export 00000101be71e200
freeing expkey 00000101be71e580
putting ref for export 00000101be71e600
freeing expkey 00000101be71e300
putting ref for export 00000101be71e400
freeing export 00000101be71e400
freeing export 00000101be71e600
freeing export 00000101be71e200

Exports
00000101be71e600
00000101be71e400
00000101be71e200
00000101be71e280

Expkeys -> export
00000101be71e580 -> 00000101be71e600  freed/cleared
00000101be71e500 -> 00000101be71e600  freed/cleared
00000101be71e380 -> 00000101be71e400  freed/cleared
00000101be71e300 -> 00000101be71e400  freed/cleared
00000101be71e180 -> 00000101be71e200  freed/cleared
00000101be71e100 -> 00000101be71e200  freed/cleared
000001021812f400 -> 00000101be71e280
0000010184a81080 -> 00000101be71e280  freed/cleared

Sooo this is an expkey accounting problem, which is leading to one of the exports not being freed, which is leading to a dangling ref on the vfsmnt, which keeps us from unmounting.  So now to find the dangling expkey ref.

Comment 11 Josef Bacik 2009-02-20 22:37:37 UTC
doing a new export, 0000010215e98580
cache 0000010215e98580 expire_time
new expkey 0000010215e98500
getting ref for export 0000010215e98580
new expkey 0000010215e98480
getting ref for export 0000010215e98580
doing a new export, 0000010215e98380
new expkey 0000010215e98300
getting ref for export 0000010215e98380
new expkey 0000010215e98280
getting ref for export 0000010215e98380
doing a new export, 0000010215e98180
new expkey 0000010215e98100
getting ref for export 0000010215e98180
new expkey 0000010215e98080
getting ref for export 0000010215e98180
cache 0000010215e98200 expire_time
cache 0000010215e98400 expire_time
cache 0000010215e98600 expire_time
cache 0000010215e98080 expire_time
cache 0000010215e98280 expire_time
cache 0000010215e98500 expire_time
cache 0000010215e98300 expire_time
cache 0000010215e98100 expire_time
cache 0000010215e98480 expire_time
cache 0000010215e98180 expire_time
cache 0000010215e98380 expire_time
cache 0000010215e98580 expire_time
cache 0000010215e98200 expire_time
cache 0000010215e98400 expire_time
cache 0000010215e98600 expire_time
cache 0000010215e98080 expire_time
cache 0000010215e98280 expire_time
cache 0000010215e98500 expire_time
cache 0000010215e98300 expire_time
cache 0000010215e98100 expire_time
cache 0000010215e98480 expire_time
cache 0000010215e98180 expire_time
cache 0000010215e98380 expire_time
cache 0000010215e98580 expire_time
cache 0000010217f13e00 expire_time
cache 0000010215e98400 expire_time
cache 0000010215e98600 expire_time
doing a new export, 0000010215e98200
new expkey 0000010217276680
getting ref for export 0000010215e98200
cache 00000102168b4f00 expire_time
cache 0000010215e98400 expire_time
cache 0000010215e98600 expire_time	
new expkey 2 00000101b07b2880
getting ref for export 0000010215e98200
cache 0000010215e98080 expire_time
cache 0000010215e98280 expire_time
cache 00000101b07b2880 expire_time
cache 0000010215e98500 expire_time
cache 0000010215e98300 expire_time
cache 0000010215e98100 expire_time
cache 0000010215e98480 expire_time
nfsd: last server has exited
nfsd: unexporting all filesystems
freeing expkey 0000010215e98080
putting ref for export 0000010215e98180
freeing expkey 0000010215e98280
putting ref for export 0000010215e98380
freeing expkey 00000101b07b2880
putting ref for export 0000010215e98200
freeing expkey 0000010215e98500
putting ref for export 0000010215e98580
freeing expkey 0000010215e98300
putting ref for export 0000010215e98380
freeing expkey 0000010215e98100
putting ref for export 0000010215e98180
freeing expkey 0000010215e98480
putting ref for export 0000010215e98580
freeing export 0000010215e98180
freeing export 0000010215e98380
cache 0000010215e98200 not being flushed b/c of refcnt, which is 2
freeing export 0000010215e98580
rpciod: active tasks at shutdown?!
RPC: failed to contact portmap (errno -5).
cache 0000010215e98200 not being flushed b/c of refcnt, which is 2


Going to have to look at this later, but here's what we know.  Since we cannot free 0000010215e98200, which is an svc_export, and nothing else, we _know_ that the expkey that is holding the ref on the svc_export is not on the list, because we didn't get any other "not being flushed" messages.  We know this expkey was not free'ed

new expkey 2 00000101b07b2880
getting ref for export 0000010215e98200

the new expkey 2 message means it was added to the list here

if (set) {
       if (test_bit(CACHE_VALID, &tmp->h.flags)) { /* need to swap in new */
               struct svc_expkey *t2;

               printk(KERN_ERR "new expkey 2 %p\n", new);
               new->h.next = tmp->h.next;
               *hp = &new->h;
               tmp->h.next = NULL;
               t2 = tmp; tmp = new; new = t2;
       }
       if (test_bit(CACHE_NEGATIVE,  &item->h.flags)) {
               WARN_ON(test_bit(CACHE_HAS_REF,
                                &tmp->h.flags));
               set_bit(CACHE_NEGATIVE, &tmp->h.flags);
       } else {
               WARN_ON(test_bit(CACHE_HAS_REF,
                                &tmp->h.flags));
               svc_expkey_update(tmp, item);
               clear_bit(CACHE_NEGATIVE, &tmp->h.flags);
       }
}

Now I cannot figure out why thats wrong, and I think it has to do with staring at this screen for too long and I'm terribly hungry, so I'm going to take a fresh look at it on monday.

Comment 12 Robert Peterson 2009-02-20 22:50:47 UTC
This is precisely the algorithm I was tweaking on Thursday.  I made
four different attempts to fix it and each of them caused kernel
panic problems in odd places.  :7)  Now I'm leaving it to the experts.

Comment 13 Josef Bacik 2009-02-20 23:23:47 UTC
Now that i've had a moment to think, I'm pretty sure I know the solution.  Since we've removed the old expkey from the cache list, it will never get culled by cache_flush(), so it never gets freed.  What we need to do is after

t2 = tmp; tmp = new; new = t2;

do svc_expkey_put(&new->h, &svc_expkey_cache);

and then at the end it will get freed.  I will test this now.

Comment 14 Josef Bacik 2009-02-21 14:33:04 UTC
ok I tested this morning and it seems that fixed the problem.  I'm going to test some more on monday, but that should fix it.

Comment 15 Josef Bacik 2009-02-23 14:32:29 UTC
Ok that fix isn't right, we should only have one ref on that thing atm, so when we do the put before we return it should be freed, so we need to figure out who is grabbing the extra ref on the expkey and not putting it.

Comment 16 Josef Bacik 2009-02-23 22:08:57 UTC
Ok I've figured out whats going on.  So we have two people do a svc_expkey_lookup at the same time via exp_find_key.  Process A goes through, can't find what we're looking for, creates a new svc_expkey and restarts the loop, Process B does the same thing.  Process A then still doesn't find the svc_expkey and adds the new one in there, and comes back which does a cache_check, which in turn gets another ref on the svc_expkey via cache_make_upcall.  Process B comes through its loop again, and finds the svc_expkey that Process A added.  So it rips the old key out and puts the new one in, and does a put on what Process A did, but unfortunately because of the cache_make_upcall, we are still holding a ref.  The old way this was taken care of was by doing a cache_fresh() on new, which would kill the pending cache update and the ref on the cache, so the final svc_expkey_put would destroy the old expkey, but we don't do that anymore, we just update the expirey time, which doesn't make a difference because we've removed the thing from the hashtable.  So what we need to do is do this

if (test_and_clear_bit(CACHE_PENDING, &new->h.flags))
        queue_loose(&svc_expkey_cache, head);

after the

new->h.next = tmp->h.next;
*hp = &new->h;
tmp->h.next = NULL;
t2 = tmp; tmp = new; new = t2;

part of svc_expkey_lookup, which will clear everything out and clear that pending ref.  I will test with this tomorrow and make sure it works.

Comment 17 Josef Bacik 2009-02-24 15:58:32 UTC
Created attachment 333056 [details]
proposed patch

Ok heres the final patch that fixes the problem.  It is a combination of ps's fix and the stuff needed to kill the extra ref on tmp.  Basically we have to set CACHE_VALID on new so when we call cache_fresh() it doesn't do any of the cache_revisit_request() crap, it just does the queue_loose(), which will free it up.  I'm not an NFS person so theres probably a better way to do this, but this is the best I've come up with.  I tested it and it worked, I'm going to build a scratch build and run that kernel to make doubly sure.

Comment 18 Robert Peterson 2009-02-25 14:34:28 UTC
Regardless of what anyone says about the merits or correctness of
the patch, Josef's version from comment #17 does, in fact, solve
the problem for me.

Comment 19 Vivek Goyal 2009-02-26 16:54:49 UTC
Reverted following patch in 82.EL

"sunrpc: fix kernel crash in sunrpc cache_clean"

This was the patch which caused this regression.

This is an stop gap measure and a good fix (which fixes the reverted patch) should go in later.

Comment 21 Robert Peterson 2009-04-09 21:11:20 UTC
I have a concern over this patch.  I may have found another code path
that makes it impossible to unmount.  This was an NFS error path.
I did this set of commands (by accident):

NFS server:
-----------
mkfs.gfs -O -t "bobs_roth:roth1" -p lock_dlm -j3 /dev/roth_vg/roth1
mount -tgfs /dev/roth_vg/roth1 /mnt/gfs
service nfs start

NFS client:
-----------
mount roth-01:/mnt/gfs /mnt/gfs
cd /mnt/gfs
(a bunch of NFS IO--the "genesis" test)
cd -

NFS Server:
-----------
service nfs stop
(Then I realized I forgot to umount on the client BEFORE I stopped the
NFS server--so I went to do that)

NFS Client:
-----------
umount /mnt/gfs
umount.nfs: roth-01:/mnt/gfs: not found / mounted or server not reachable
umount.nfs: roth-01:/mnt/gfs: not found / mounted or server not reachable
(makes sense--nfs can't contact the NFS server because the server was
stopped.  So I restarted it so that I could cleanly unmount.)

NFS Server:
-----------
service nfs start

NFS Client:
-----------
umount /mnt/gfs
(this time it was successful)

NFS Server:
-----------
service nfs stop
umount /mnt/gfs
/sbin/umount.gfs: /mnt/gfs: device is busy.
/sbin/umount.gfs: /mnt/gfs: device is busy.

(Now there's no way to unmount the source file system).

Comment 26 errata-xmlrpc 2009-05-18 19:24:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.