Loop checks added to VFS to prevent kernel crashes
The NFS client was previously failing to detect a directory loop for some NFS server directory structures. This failure could cause NFS inodes to remain referenced after attempting to unmount the file system, leading to a kernel crash. This update adds loop checks to VFS, which effectively prevents this problem from occurring.
Description of problem:
Customer reports multiple machines running 2.6.32-504.3.3.el6 crashing repeatedly with a backtrace similar to the below. There are a few variations of the crash, but all of them are the kswapd doing a shrink_icache_memory and ending up crashing due to:
1. an inode pointer which is allocated and looks valid
2. an inode.i_sb pointer which points at an address in allocated size-NNNN slab (we've seen size-1024 and size-64), but the memory being pointed to is not a valid super_block structure.
3. some field of the super_block structure which is supposed to be a pointer is used / dereferenced (such as s_op->clear_inode), causing the kernel crash
Prior to the crash, we see a lot of these:
VFS: Busy inodes after unmount of 0:8c. Self-destruct in 5 seconds. Have a nice day...
Then the crash comes, with an oops similar to this:
BUG: unable to handle kernel paging request at 0000000000a31658
IP: [<ffffffff811ab95e>] clear_inode+0x9e/0x140
PGD 435baa067 PUD 4373d1067 PMD 435bb3067 PTE 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
Modules linked in: tcp_diag inet_diag nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 vsock(U) ipv6 ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables microcode ppdev vmware_balloon parport_pc parport sg vmci(U) i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom vmw_pvscsi vmxnet(U) pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Pid: 40, comm: kswapd0 Not tainted 2.6.32-504.3.3.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff811ab95e>] [<ffffffff811ab95e>] clear_inode+0x9e/0x140
RSP: 0018:ffff880434f8dbf0 EFLAGS: 00010246
RAX: 0000000000a315f0 RBX: ffff8802beb79408 RCX: 0000000000000034
RDX: 0000000000000000 RSI: ffff88028db29718 RDI: ffff8802beb79408
RBP: ffff880434f8dc00 R08: ffff8802beb79418 R09: fd544e08748abe07
R10: ffff88014a913458 R11: ffff88014a913040 R12: ffff8802beb79540
R13: 0000000000000044 R14: ffffffff81fd5820 R15: ffff8802beb79418
FS: 0000000000000000(0000) GS:ffff880036600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000a31658 CR3: 00000004378eb000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 40, threadinfo ffff880434f8c000, task ffff880434f3aae0)
ffff8802beb79408 ffff880434f8dc60 ffff880434f8dc40 ffffffff811aba40
<d> ffffc90001812000 0000000000000080 ffff880112f1fd38 0000000000000080
<d> ffff880434f8dc60 0000000000000080 ffff880434f8dca0 ffffffff811abd94
[<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81140ab0>] ? kswapd+0x0/0x3b0
[<ffffffff8109e5d0>] ? kthread+0x0/0xc0
[<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: 00 00 b9 02 00 00 00 48 c7 c2 80 ac 1a 81 be 08 00 00 00 e8 25 f3 37 00 48 89 df e8 8d 45 04 00 48 8b 83 08 01 00 00 48 8b 40 38 <48> 8b 40 68 48 85 c0 74 05 48 89 df ff d0 0f b7 83 ae 00 00 00
RIP [<ffffffff811ab95e>] clear_inode+0x9e/0x140
Version-Release number of selected component (if applicable):
One customer has seen it 4 times on the same machine (twice in May, twice in June), and one time on a second machine (Aug 13).
Steps to Reproduce:
All machines which crash are using automount, and it seems the automounter is auto-unmounting NFS shares which have active inodes on them.
We see a lot of these messages:
VFS: Busy inodes after unmount of 0:8c. Self-destruct in 5 seconds. Have a nice day...
Then sometime after the messages (may be hours or days), the kernel crashes
No messages and no kernel crash.
We have vmcores and have given the customer one round of systemtap, which I can attach here for reference, since you will see some output in one of the vmcores if you look at all of them.
I think this looks like some reference counting issue on the super_block structure, i.e. an overput or underget of super_block.
Customer claims they were running fine (on RHEL6.5 kernels) until upgrading to the RHEL6.6 kernel series. As of yet I do not have definitive proof of how long they were running fine on RHEL6.5 kernel but I've no reason to disbelieve their claim, and I could find no similar bugs reported in the RHEL6 kernel series. Their so for now I think we should assume this is a regression.
Customer is running automount but the /etc/auto.master looks straightforward with just an /etc/auto.direct mounting NFSv3 shares.
I'll attach a sosreport and point at all of the vmcores we have so far.
Looking at the commits between RHEL6.5 and RHEL6.6
I wonder if either of these could have possibly introduced a regression:
[fs] vfs: fix ref count leak in path_mountpoint()
[fs] Fix mountpoint reference leakage in linkat
Created attachment 1063571 [details]
bash shell script with systemtap which was used for partial debugging
The attached bash + systemtap script produces output like the following as seen in some of the vmcores kernel logs:
VFS: Busy inodes after unmount of 0:b4. Self-destruct in 5 seconds. Have a nice day...
Mon Jul 6 23:15:18 2015 PDT: automount(2406) => umount(4784) => umount.nfs(4785) (/sbin/umount.nfs /mnt/nfs): invalidate_inodes returning 1 for 'nfs' super_block 0xffff880434d7f000 (0:b3)
(In reply to Dave Wysochanski from comment #4)
> Looking at the commits between RHEL6.5 and RHEL6.6
> I wonder if either of these could have possibly introduced a regression:
> [fs] vfs: fix ref count leak in path_mountpoint()
Maybe not this one specifically.
It applies only when the last component of the umount path
is a symlink and that doesn't appear to be the case here.
That doesn't mean a corner case for the associated series
has been missed due to differences between upstream and
It's worth considering but there was quite a bit of testing
surrounding the series.
> [fs] Fix mountpoint reference leakage in linkat
This looks like a straight forward bug fix.
The get and put on the path, old_path, are entirely local
so it looks like the path_put() was in fact missing on
Created attachment 1064086 [details]
Analysis of code paths which touch super_block.s_active and why we want to do this
Created attachment 1064087 [details]
WIP systemtap script to trace super_block.s_active - bulk still needs filled in and this may not work - may need a kernel patch
Looking through the vfs related changes from rhel-6.5 to
6.6 there aren't many at all.
The most significant change was the addition of the
path_mountpoint() series which is meant to avoid
revalidation of the last path component during path
This was mostly needed to prevent hangs on NFS umount
in cases where the server is unresponsive (I believe).
But it was also required for a similar hang with autofs
direct mounts at startup.
So we need to confirm that the problem hasn't been
seen on kernel revisions prior to revision 451 (ie.
on rhel-6.5 kernels, rev 431).
If we can confirm that then we can focus instrumentation
in the functions that patch series adds or changes and
there are only a few.
As far as s_active goes I think that's a symptom and
not a cause because the code where the VFS message is
issued is invoked when the mount reference count goes
to 0 to cleanup the in memory mount tree and discard
the super block.
If the mount reference count isn't right then all sorts
of bad things happen on the final mntput().
So it's probably better to assume this is a mount
reference count problem.
In rhel-6 the struct vfsmount still holds the mount
reference count so that's where we should look for
reference count accounting problems and specifically
on entry and exit of the functions I mentioned above.
The difficult part will be working out what the counts
should be (there are so many special cases) so we know
a problem when we see it.
I'll have a look at the functions and see if I can spot
anything and try and work out what we should expect.
But without a reproducer we have an uphill battle since
the customer will get sick of testing very quickly.
(In reply to Ian Kent from comment #23)
> Looking through the vfs related changes from rhel-6.5 to
> 6.6 there aren't many at all.
> I'll have a look at the functions and see if I can spot
> anything and try and work out what we should expect.
After looking through the path_mountpoint() change I can't
see a problem with the mount ref count.
It occurs to me that autofs wouldn't be selecting these
mounts to expire unless the mount ref count indicated they
weren't in use so the problem likely occurred before the
path_mountpoint() code was called.
The autofs expire code hasn't changed for a long long time
so I doubt the problem lies there.
There are a couple of calls that lead to path_mountpoint()
due to autofs ioctls but an initial check indicates they
are ok too.
I'll work through this again but I'm not hopeful.
Then it will be back to the changelog.
(In reply to Ian Kent from comment #24)
> After looking through the path_mountpoint() change I can't
> see a problem with the mount ref count.
> It occurs to me that autofs wouldn't be selecting these
> mounts to expire unless the mount ref count indicated they
> weren't in use so the problem likely occurred before the
> path_mountpoint() code was called.
> The autofs expire code hasn't changed for a long long time
> so I doubt the problem lies there.
Actually that's not right.
There was a change that went into rhel-6.6 for expiry of
If there's a symlink present in an autofs file system only
the last access time can be checked and when it expires it
amounts to an unlink() done by the daemon, not a umount.
I don't see how that could be a problem here but a debug
log and some knowledge of what's in the filesystem should
tell us if confusion has crept in somewhere.
Created attachment 1064376 [details]
verbose automount debugging
Here is verbose automount debugging, beginning with the first mount of a filesystem that will result in the 'Busy inodes', until all the filesystems that produce the 'Busy inodes' are unmounted.
This is just one time through, however each cycle of mount/unmount for these 6 particular /users/datadomain mounts looks about the same.
(In reply to Frank Sorenson from comment #28)
> Created attachment 1064376 [details]
> verbose automount debugging
> Here is verbose automount debugging, beginning with the first mount of a
> filesystem that will result in the 'Busy inodes', until all the filesystems
> that produce the 'Busy inodes' are unmounted.
> This is just one time through, however each cycle of mount/unmount for these
> 6 particular /users/datadomain mounts looks about the same.
When you post debug logs you should post the entire log not
just fragments and give any time stamps of likely problem
activity, if possible.
It may be harder for me to work those logs but often something
has happened long before the problem is actually seen.
Created attachment 1068648 [details]
case 01441432 - find plus tcpdump
I can reproduce on RHEL6 2.6.32-504.el6.x86_64:
[root@rhel6 ~]# cd /exports
[root@rhel6 exports]# mkdir exports_loop
[root@rhel6 exports]# mount -o bind ../exports exports_loop
[root@rhel6 exports]# cd
[root@rhel6 ~]# /etc/init.d/nfs start
Starting RPC svcgssd: [ OK ]
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: [ OK ]
Starting NFS daemon: [ OK ]
[root@rhel6 ~]# exportfs -ua
[root@rhel6 ~]# exportfs -o crossmnt *:/exports
[root@rhel6 ~]# mount rhel6:/exports /manmnt/rhel6
[root@rhel6 ~]# stat /manmnt/rhel6/exports_loop/.
Size: 4096 Blocks: 8 IO Block: 262144 directory
Device: 18h/24d Inode: 275493 Links: 6
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-08-31 10:08:20.883717287 -0400
Modify: 2015-08-31 10:08:14.451617433 -0400
Change: 2015-08-31 10:08:14.451617433 -0400
[root@rhel6 ~]# umount /manmnt/rhel6
[root@rhel6 ~]# tail -1 /var/log/messages
Aug 31 10:11:13 rhel6 kernel: VFS: Busy inodes after unmount of 0:18. Self-destruct in 5 seconds. Have a nice day...
I think it's safe to rule out autofs at this point.
Upstream fedora server seems not to create this problem.
Nice work Ben! I tried your simple reproducer on a few NFS client versions. Here are the results. NOTE that "FAIL" means we were able to mount the share, and after umount I saw "VFS: Busy inodes after unmount of 0:17. Self-destruct in 5 seconds." in /var/log/messages
RHEL5u11 (2.6.18-402.el5): PASS - mount fails with "mount.nfs: Stale NFS file handle"
RHEL6u0 (2.6.32-71.40.1.el6): FAIL
RHEL6u1 (2.6.32-131.4.1.el6): FAIL
* NOTE: Interestingly the 'find' returned a lot of these
Aug 31 09:27:18 rhel6u1-node1 kernel: fsid 0:16: expected fileid 0x54fe3, got 0x409d6
Aug 31 09:27:18 rhel6u1-node1 kernel: NFS: server 192.168.122.49 error: fileid changed
RHEL6u2 (2.6.32-220.2.1.el6): FAIL
RHEL6u3 (2.6.32-279.11.1.el6): FAIL
RHEL6u4 (2.6.32-358.el6): FAIL
RHEL6u5 (2.6.32-431.32.1.el6): FAIL
RHEL7u0 (3.10.0-123.el7): PASS - mount fails with "Too many levels of symbolic links"
Assuming Ben's reproducer is the same as the customer's problem, I think we can remove Regression since it doesn't look like RHEL6 has always had this bug.
One small correction - all my tests used the default NFS version for the release. So RHEL6-RHEL7 it was an NFSv4 mount, and RHEL5 was NFSv3. I tried RHEL5 NFSv4 mount and I could mount it, but I don't get the "busy inodes after umount" message.
Generate reproduce case as Comment 65. Thanks Ben!
I'm seeing some strange differences in behavior, depending on what exactly I'm exporting and mounting:
nfs server side:
# mkdir -p /loop_test/dir_loop /exports/loop_test/dir_loop
# mount -o bind /loop_test /loop_test/dir_loop
# mount -o bind /exports/loop_test /exports/loop_test/dir_loop
# echo -e "/loop_test\t*(rw,crossmnt)" >> /etc/exports
# echo -e "/exports/loop_test\t*(rw,crossmnt)" >> /etc/exports
# exportfs -arv
# find /loop_test /exports/loop_test
find: File system loop detected; `/loop_test/dir_loop' is part of the same file system loop as `/loop_test'.
find: File system loop detected; `/exports/loop_test/dir_loop' is part of the same file system loop as `/exports/loop_test'.
nfs client side:
# mkdir -p /mnt/loop_test
# mount vm17:/loop_test /mnt/loop_test/
# find /mnt/loop_test/
find: File system loop detected; `/mnt/loop_test/dir_loop' is part of the same file system loop as `/mnt/loop_test/'.
# umount /mnt/loop_test
# dmesg -c
[47853.323277] VFS: Busy inodes after unmount of 0:19. Self-destruct in 5 seconds. Have a nice day...
# mount vm17:/exports/loop_test /mnt/loop_test/
# find /mnt/loop_test
find: `/mnt/loop_test/dir_loop': Too many levels of symbolic links
# umount /mnt/loop_test/
# dmesg -c
* when mounting server:/export, we get the 'File system loop detected', followed by the 'Busy inodes after unmount'
* when mounting server:/directory/export, we get the 'Too many levels of symbolic links' message instead, and no 'Busy inodes after unmount'
*** Bug 1236695 has been marked as a duplicate of this bug. ***
Patch(es) available on kernel-2.6.32-582.el6
Move to VERIFIED according to Comment 100
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.