Bug 698806 - [abrt] kernel: BUG: Dentry ffff880137e720c0{i=2d319,n=/} still in use (1) [unmount of autofs autofs]: TAINTED Die
Summary: [abrt] kernel: BUG: Dentry ffff880137e720c0{i=2d319,n=/} still in use (1) [un...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 15
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ian Kent
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: abrt_hash:d1cd5d7af45fd157b4679bb0e7d...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-04-21 20:24 UTC by Marco Hartgring
Modified: 2011-09-26 15:33 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-26 15:33:13 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
File: backtrace (2.99 KB, text/plain)
2011-04-21 20:24 UTC, Marco Hartgring
no flags Details
nfsstat (1.94 KB, text/plain)
2011-04-22 13:42 UTC, Marco Hartgring
no flags Details
abrt_log_20100422153545 (3.71 KB, application/octet-stream)
2011-04-22 13:56 UTC, Marco Hartgring
no flags Details
abrt_log_20100422154421 (3.84 KB, application/octet-stream)
2011-04-22 13:57 UTC, Marco Hartgring
no flags Details
crash_reboot_log (4.30 KB, application/octet-stream)
2011-04-22 14:11 UTC, Marco Hartgring
no flags Details
vfs - check non-mountpoint dentry might block in __follow_mount_rcu() (1.94 KB, patch)
2011-04-22 14:27 UTC, Ian Kent
no flags Details | Diff
autofs4 - reinstate last used update on access (5.46 KB, patch)
2011-04-22 14:28 UTC, Ian Kent
no flags Details | Diff
autofs4 - fix dentry leak in autofs4_expire_direct() (1.17 KB, patch)
2011-04-22 14:29 UTC, Ian Kent
no flags Details | Diff
autofs4 - fix autofs4_expire_indirect() traversal (2.01 KB, patch)
2011-04-22 14:30 UTC, Ian Kent
no flags Details | Diff
autofs4 - fix d_manage() return on rcu-walk (685 bytes, patch)
2011-04-22 14:32 UTC, Ian Kent
no flags Details | Diff
autofs4 - remove autofs4_lock (7.19 KB, patch)
2011-04-22 14:33 UTC, Ian Kent
no flags Details | Diff
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() (1.18 KB, patch)
2011-04-22 14:35 UTC, Ian Kent
no flags Details | Diff

Description Marco Hartgring 2011-04-21 20:24:30 UTC
abrt version: 2.0.1
cmdline: ro root=/dev/mapper/vg_starbuck-lv_root rd_LUKS_UUID=luks-9d65fb86-1ea5-4399-9e03-590df5d86a5a rd_LVM_LV=vg_starbuck/lv_root rd_LVM_LV=vg_starbuck/lv_swap rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us nouveau.modeset=0 rdblacklist=nouveau quiet
component: kernel
kernel_tainted: 129
kernel: 2.6.38.2-9.fc15.x86_64
reason: BUG: Dentry ffff880137e720c0{i=2d319,n=/} still in use (1) [unmount of autofs autofs]
architecture: x86_64
package: kernel
os_release: Fedora release 15 (Lovelock)
time: 1303416690

Text file: backtrace, 3061 bytes

comment
-----
This happens randomly when having a nfs4 filesystem automounted on /net
From what I read on random mailing lists this is an upstream bug.

event_log
-----
2011-04-21-22:12:44> The report was appended to /tmp/abrt.log
2011-04-21-22:24:24> Submitting oops report to http://submit.kerneloops.org/submitoops.php
2011-04-21-22:24:25  Kernel oops report was uploaded

reported_to
-----
file: /tmp/abrt.log
kerneloops: URL=http://submit.kerneloops.org/submitoops.php

Comment 1 Marco Hartgring 2011-04-21 20:24:32 UTC
Created attachment 493982 [details]
File: backtrace

Comment 2 Jeff Layton 2011-04-22 11:22:46 UTC
Yep, I've seen this personally too...

My (hand wavy) suspicion is that it's related to the RCU pathwalk patches that went into 2.6.38, but it may be something else entirely. What we could really use is a reliable reproducer for this.

One thing that might be helpful is that when this occurs, collect the output of nfsstat -c. That might allow us to rule out some codepaths.

Comment 3 Ian Kent 2011-04-22 13:29:41 UTC
What do the autofs maps you are using look like?

Comment 4 Ian Kent 2011-04-22 13:31:23 UTC
(In reply to comment #3)
> What do the autofs maps you are using look like?

Hang on, you say your using the hosts map.
Do the hosts you are using have many exports?

Comment 5 Ian Kent 2011-04-22 13:34:42 UTC
(In reply to comment #2)
> Yep, I've seen this personally too...

I think there is more than one problem causing these.

Comment 6 Marco Hartgring 2011-04-22 13:41:04 UTC
I have two servers that I connect to with autofs. One F14 x86_64 and one F14 i686 install (both always fully updated).
It seems that this occurs more frequently on the i686 install.
As requested I've now attached a nfsstat -c of when this error just occurred, under a minute.

Comment 7 Marco Hartgring 2011-04-22 13:42:49 UTC
Created attachment 494221 [details]
nfsstat

Comment 8 Ian Kent 2011-04-22 13:45:15 UTC
(In reply to comment #5)
> (In reply to comment #2)
> > Yep, I've seen this personally too...
> 
> I think there is more than one problem causing these.

First thing to do is to update with the autofs patches that
went into 2.6.39-rc. I'll work on getting a kernel built
with those.

The down side is, if this really isn't an autofs problem,
these patches will probably hide the real bug. OTOH I have
a kernel.org bug, slightly different to this, that these
patches didn't fix, but they do fix the problem that my
autofs submount-test shows up that has a back trace just
like this.

Comment 9 Marco Hartgring 2011-04-22 13:55:55 UTC
I just checked and this nfsstat is related to an auto-unmount from the F14 i686 install which is a nfs3 (!) export, only one filesystem is exported from this server.
The other exports, from the F14 x86_64 install which exports multiple filesystems, have almost no problems.

I've attached two other abrt crash logs, both related to the last crash.
abrt_log_20100422153545 is the log of the exact moment of the last crash.
abrt_log_20100422154421 is the log of me doing 'umount -l /net' after a SIGKILL of the remaining automount process. This seems the only way to resolve this.

The problem that remains is that I am now no longer able to restart my F15 install, it just stalls. I'll do that now and make a note of what happens.

Comment 10 Marco Hartgring 2011-04-22 13:56:47 UTC
Created attachment 494222 [details]
abrt_log_20100422153545

Comment 11 Marco Hartgring 2011-04-22 13:57:10 UTC
Created attachment 494223 [details]
abrt_log_20100422154421

Comment 12 Jeff Layton 2011-04-22 13:59:50 UTC
You may be right that this is due to a number of different problems. I didn't notice before that the original problem in this bug was due to unmounting an autofs mount. I've seen similar oopses when unmounting nfs4 mounts too:

    http://www.spinics.net/lists/linux-nfs/msg20232.html

...perhaps these problems are related? I'm still at a bit of a loss as to how best to attack this though.

Comment 13 Marco Hartgring 2011-04-22 14:11:21 UTC
Jeff, that seems very similair to what I'm experiencing.

And sorry for not being clear from the start, I knew it had something to do with a timed unmount. I just wanted to create a placeholder and fill in the details as time passed by.

I've attached my /var/log/messages of the reboot, as you can see there's nothing special regarding me rebooting... I had to poweroff my system.

Please ignore the SELinux errors, I'm still working on those and until those are sorted (I might make bugs) I'm running permissive.

Comment 14 Marco Hartgring 2011-04-22 14:11:42 UTC
Created attachment 494228 [details]
crash_reboot_log

Comment 15 Ian Kent 2011-04-22 14:18:02 UTC
(In reply to comment #12)
> You may be right that this is due to a number of different problems. I didn't
> notice before that the original problem in this bug was due to unmounting an
> autofs mount. I've seen similar oopses when unmounting nfs4 mounts too:
> 
>     http://www.spinics.net/lists/linux-nfs/msg20232.html
> 
> ...perhaps these problems are related? I'm still at a bit of a loss as to how
> best to attack this though.

There is definitely a possible dentry ref count leak in the
autofs expire code in 2.6.38. The real problem is that it
should be hard to trigger but the reports we get it seems
that people are able to trigger it easily. Possibly the
reason it is easy to trigger in some cases is because the
rcu-walk series changed one of the traversals from a
directory entry list traversal to a depth first tree
traversal in the expire check code. Though I couldn't
work out why that would make it happen more easily
either.

In any case the autofs patches that are going into 2.6.39
fixed the problems I was able to force in testing.

It will take a little while for me to dig them out and build
a test kernel.

Comment 16 Ian Kent 2011-04-22 14:26:39 UTC
Since I'm not going to get onto this until tomorrow I
can post the patches so you can see what the changes
are, at least.

Comment 17 Ian Kent 2011-04-22 14:27:54 UTC
Created attachment 494237 [details]
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

Comment 18 Ian Kent 2011-04-22 14:28:50 UTC
Created attachment 494238 [details]
autofs4 - reinstate last used update on access

Comment 19 Ian Kent 2011-04-22 14:29:52 UTC
Created attachment 494239 [details]
autofs4 - fix dentry leak in autofs4_expire_direct()

Comment 20 Ian Kent 2011-04-22 14:30:58 UTC
Created attachment 494240 [details]
autofs4 - fix autofs4_expire_indirect() traversal

Comment 21 Ian Kent 2011-04-22 14:32:05 UTC
Created attachment 494241 [details]
autofs4 - fix d_manage() return on rcu-walk

Comment 22 Ian Kent 2011-04-22 14:33:03 UTC
Created attachment 494242 [details]
autofs4 - remove autofs4_lock

Comment 23 Ian Kent 2011-04-22 14:35:36 UTC
Created attachment 494243 [details]
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()

This patch isn't a result of testing on 2.6.38, it is a
contributed patch I had hanging around. So, for the sake
of completeness wrt. what went into 2.6.39 I have included
it here as well.

Comment 24 Ian Kent 2011-04-22 14:38:58 UTC
Clearly not all these patches are likely related to the
problem here. But this wasn't the only problem that I
found that was related to the unexpected concurrent
merge of the rcu-walk and the vfs-automount series in
2.6.38. Ouch!

Comment 25 Ian Kent 2011-04-28 02:10:50 UTC
Sorry to take so long to build the test kernel.

A kernel with the above patches can be found at:
http://people.redhat.com/~ikent/kernel-2.6.38.3-18.bz698806.1.fc15

Please try this out and let me know how it goes.

Comment 26 Marco Hartgring 2011-04-28 07:48:42 UTC
No worries, I'm installing as I am writing this.

Comment 27 Marco Hartgring 2011-04-28 08:09:50 UTC
Hmm, I forgot that to really test I also need the devel packages for my nvidia driver.

Comment 28 Marco Hartgring 2011-04-28 08:21:12 UTC
I'll use nouveau for now. In general I don't like to change something while testing a problem.

Comment 29 Marco Hartgring 2011-04-28 09:30:28 UTC
Good news, everyone! The patched kernel seems to work for me, I've been testing and timed auto unmounts don't crash the kernel.

Comment 30 Jeff Layton 2011-04-28 12:36:52 UTC
Reassigning to Ian since he's doing all the work here anyway :)

Comment 31 Ian Kent 2011-06-20 03:32:59 UTC
I've added Kyle McMartin to the cc list here.

Kyle, if we aren't going to see 2.6.39 for F15 sometime soon
we really should apply this patch series.

Can you help please?

Comment 32 Mark T. Kennedy 2011-07-06 20:15:06 UTC
any update on this?

Comment 33 Ian Kent 2011-07-08 11:00:06 UTC
(In reply to comment #32)
> any update on this?

I was hopeing Kyle would get around to adding these patches
but we need to wait a little anyway because it looks like
there will be another patch going upstream shortly. See bug
#719607 for more information.

Comment 34 Mark T. Kennedy 2011-08-23 15:39:34 UTC
so this problem is fixed by the #719607 patch and that will be part of some v3.X kernel update?

Comment 35 Chuck Ebbert 2011-08-25 05:11:54 UTC
(In reply to comment #34)
> so this problem is fixed by the #719607 patch and that will be part of some
> v3.X kernel update?

The fix for bug 719607 will be in 2.6.40.3-2

Comment 36 Josh Boyer 2011-09-26 15:33:13 UTC
This should be fixed per comment #35.


Note You need to log in before you can comment on or make changes to this bug.