Bug 582321

Summary:

VFS: Busy inodes after unmount issue.

Product:

Red Hat Enterprise Linux 5

Reporter:

Sachin Prabhu <sprabhu>

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.4

CC:

bugzilla, dhoward, dhowells, iannis, ikent, jlayton, jpirko, jwest, lmcilroy, rwheeler, steved, tao, vincent, yugzhang

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

When an NFS server exported a file system with an explicit "fsid=[file_system_ID]", an NFS client mounted that file system on one mount point and a subdirectory of that file system on a separate mount point, then if the server re-exported that file system after un-exporting and unmounting it, it was possible for the NFS client to unmount those mount points and receive the following error message: "VFS: Busy inodes after unmount..." Additionally, it was possible to crash the NFS client's kernel in this situation.

Story Points:

---

Clone Of:

Clones:

587225 (view as bug list)

Environment:

Last Closed:

2011-01-13 21:27:02 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

596384

Attachments:

Description	Flags
suggested change to prevent dropping a disconnected dentry	none

Description Sachin Prabhu 2010-04-14 16:11:31 UTC

We have a report of a system crashing while waiting for a spinlock in prune_cache.

NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x2, got 0x1193a1
VxVM VVR vxio V-5-0-330 Unable to connect to rlink fipsprod_to_hayward on rvg fipsprod_rvg: Timeout on remote<5>VxVM VVR vxio V-5-0-265 Rlink fipsprod_to_hayward connected to remoteVFS: Busy inodes after unmount of 0:19. Self-destruct in 5 seconds.  Have a nice day...
VxVM VVR vxio V-5-0-267 Rlink fipsprod_to_hayward disconnecting due to ack timeout on update message
VxVM VVR vxio V-5-0-266 Rlink fipsprod_to_hayward disconnected from remote
VxVM VVR vxio V-5-0-330 Unable to connect to rlink fipsprod_to_hayward on rvg fipsprod_rvg: Rlink already connected on remote<4>VxVM VVR vxio V-5-0-233 Node 192.168.14.80 disconnected from node 192.168.15.39
VxVM VVR vxio V-5-0-91 Connected from node 192.168.14.80 to node 192.168.15.39<5>VxVM VVR vxio V-5-0-265 Rlink fipsprod_to_hayward connected to remote<5>VxVM VVR vxio V-5-0-101 DCM replay complete on rlink fipsprod_to_hayward<3>NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x1193a1, got 0x2
NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x1193a1, got 0x2
NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x1193a1, got 0x2
VxVM VVR vxio V-5-0-0 Disconnecting rlink fipsprod_to_hayward due to excessive retries.
VxVM VVR vxio V-5-0-0 Disconnecting rlink fipsprod_to_hayward due to excessive retries.
VxVM VVR vxio V-5-0-0 Disconnecting rlink fipsprod_to_hayward due to excessive retries.
VxVM VVR vxio V-5-0-0 Disconnecting rlink fipsprod_to_hayward due to excessive retries.
VxVM VVR vxio V-5-0-0 Disconnecting rlink fipsprod_to_hayward due to excessive retries.
VxVM VVR vxio V-5-0-266 Rlink fipsprod_to_hayward disconnected from remote
VxVM VVR vxio V-5-0-233 Node 192.168.14.80 disconnected from node 192.168.15.39
VxVM VVR vxio V-5-0-91 Connected from node 192.168.14.80 to node 192.168.15.39<5>VxVM VVR vxio V-5-0-330 Unable to connect to rlink fipsprod_to_hayward on rvg fipsprod_rvg: Timeout on remote<5>VxVM VVR vxio V-5-0-265 Rlink fipsprod_to_hayward connected to remoteNMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: iptable_filter ip_tables x_tables vxfen(PFU) gab(PU) llt(PU) nfsd exportfs auth_rpcgss nfs lockd fscache nfs_acl sunrpc dmpjbod(PU) dmpap(PU) dmpaa(PU) vxspec(PFU) vxio(PFU) vxdmp(PU) fdd(PFU) vxportal(PFU) vxfs(PU) dm_mirror dm_log dm_multipath scsi_dh dm_mod video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i5000_edac edac_mc sg shpchp bnx2 pcspkr tg3 serio_raw hpilo lpfc scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 503, comm: kswapd0 Tainted: PF     2.6.18-164.6.1.el5 #1
RIP: 0010:[<ffffffff80064bd8>]  [<ffffffff80064bd8>] .text.lock.spinlock+0xe/0x30
RSP: 0018:ffff81052f331d38  EFLAGS: 00000086
RAX: 0000000000000282 RBX: ffff81051cc3e870 RCX: 0000000000000001
RDX: ffffffff803114e0 RSI: 0000000000000286 RDI: ffff81051cc3e874
RBP: ffff81051cc3e874 R08: ffff8104c9068738 R09: ffff81017bfc9078
R10: ffff81017bfc98e8 R11: ffffffff80103948 R12: 0000000000000000
R13: 0000000000000022 R14: 0000000000000000 R15: 0000000000000580
FS:  0000000000000000(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003c32e98640 CR3: 000000032ce85000 CR4: 00000000000006e0
Process kswapd0 (pid: 503, threadinfo ffff81052f330000, task ffff81052f4857a0)
Stack:  ffffffff8000b9dc ffff81017bfc9078 ffff81051cc3e870 ffff81022ec08b70
 ffffffff8002e92f 0000000000000282 ffff81052ffece00 000000000002d94c
 000000000000008d 00000000000000d0 ffffffff800ea1dd 0000000000000160
Call Trace:
 [<ffffffff8000b9dc>] __down_read_trylock+0x15/0x44
 [<ffffffff8002e92f>] prune_dcache+0xf5/0x149
 [<ffffffff800ea1dd>] shrink_dcache_memory+0x17/0x30
 [<ffffffff8003f3b6>] shrink_slab+0xdc/0x153
 [<ffffffff80057da5>] kswapd+0x343/0x46c
 [<ffffffff8009fc08>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80057a62>] kswapd+0x0/0x46c
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003297c>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287e>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 83 3f 00 7e f9 e9 f9 fe ff ff f3 90 83 3f 00 7e f9 e9 f8 fe 
BUG: warning at arch/x86_64/kernel/crash.c:163/nmi_shootdown_cpus() (Tainted: PF    )

Note the following messages

NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x2, got 0x1193a1

VFS: Busy inodes after unmount of 0:19. Self-destruct in 5 seconds.  Have a nice day...

and a number of Vxfs related messages.

The CPU appears to have been waiting for a spinlock at the following location

static void prune_dcache(int count, struct super_block *sb)
{
..

                s_umount = &dentry->d_sb->s_umount;
                if (down_read_trylock(s_umount)) {  <-- tries to grab lock here and hangs.
..
}

The memory location where the super_block referenced by that dentry should have been appears to have been freed and in use by some other part of the code.

The dentry being pruned is as follows
crash> dentry 0xffff81022ec08b70
struct dentry {
  d_count = {
    counter = 0
  }, 
..
  d_inode = 0x0, 
..
  d_parent = 0xffff810515f84b70, 
..
  d_op = 0xffffffff88774ee0, 
  d_sb = 0xffff81051cc3e800, 
..
  d_iname = "cd\000hrc\000\067\061]\000y\000\062\064-16:14:32\000\000\000\000\000\000\000\000\000\000\000"
}
This looks like a negative dentry.

The parent dentry which also appears to be the root dentry is as follows

crash> dentry 0xffff810515f84b70
struct dentry {
  d_count = {
    counter = 1
  }, 
  d_flags = 20,        --> DCACHE_DISCONNECTED | DCACHE_UNHASHED
..
  d_inode = 0xffff810514f57260, 
..
  d_parent = 0xffff810515f84b70, 
..
  d_op = 0xffffffff88774ee0, 
  d_sb = 0xffff81051cc3e800, 
..
  d_iname = "\000dmap\000\000\000\000\000\000\000/lib64/libc.so.6\000\000\000\000\000\000\000"
}

This has a valid inode which probably triggered the VFS: Busy inode message

crash> inode.i_count 0xffff810514f57260
  i_count = {
    counter = 1
  }, 

System Details:
2.6.18-164.6.1.el5 SMP x86_64

Comment 3 Issue Tracker 2010-04-14 20:35:44 UTC

Event posted on 04-14-2010 05:35pm BRT by fbl

Hi,

More interesting data from this vmcore:
...
Looking at the parent

crash> dentry 0xffff810515f84b70
struct dentry {
 d_count = {
   counter = 1
 },
 d_flags = 20,        --> DCACHE_DISCONNECTED | DCACHE_UNHASHED
..
 d_inode = 0xffff810514f57260, 
...

crash> eval 0xffff810514f57260 - 0x1e0
hexadecimal: ffff810514f57080  
struct nfs_inode {
  fileid = 0x2, 
  fh = {
    size = 0x8, 
    data =
"\001\000\001\000,\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  }, 
  flags = 0xe, 
  cache_validity = 0x3b, 
  read_cache_jiffies = 0x107ee4cba, 
  attrtimeo = 0x7530, 
  attrtimeo_timestamp = 0x10851de52, 
  change_attr = 0x36827f90, 
  last_updated = 0x107ee4cbc, 
  cache_change_attribute = 0x107ee4cb7, 
  data_updates = {
    counter = 0x0
  }, 
  access_cache = {
    rb_node = 0xffff8103c34ca900
  }, 
  access_cache_entry_lru = {
    next = 0xffff8103c34ca918, 
    prev = 0xffff8103c34ca918
  }, 
  access_cache_inode_lru = {
    next = 0xffff81035dce1990, 
    prev = 0xffff810274811b50
  }, 
  acl_access = 0xfffffffffffffff5, 
  acl_default = 0x0, 
  cookieverf = {0x0, 0x0}, 
  req_lock = {
    raw_lock = {
      slock = 0x1
    }
  }, 
  dirty = {
    next = 0xffff810514f571a0, 
    prev = 0xffff810514f571a0
  }, 
  commit = {
    next = 0xffff810514f571b0, 
    prev = 0xffff810514f571b0
  }, 
  nfs_page_tree = {
    height = 0x0, 
    gfp_mask = 0x220, 
    rnode = 0x0
  }, 
  ndirty = 0x0, 
  ncommit = 0x0, 
  npages = 0x0, 
  open_files = {
    next = 0xffff810514f571e0, 
    prev = 0xffff810514f571e0
  }, 
  silly_count = {
    counter = 0x1
  }, 
  silly_list = {
    first = 0x0
  }, 
  waitqueue = {
    lock = {
      raw_lock = {
        slock = 0x1
      }
    }, 
    task_list = {
      next = 0xffff810514f57208, 
      prev = 0xffff810514f57208
    }
  }, 
  nfs4_acl = 0x0, 
  open_states = {
    next = 0xffff810514f57220, 
    prev = 0xffff810514f57220
  }, 
  delegation = 0x0, 
  delegation_state = 0x0, 
  rwsem = {
    activity = 0x0, 
    wait_lock = {
      raw_lock = {
        slock = 0x1
      }
    }, 
    wait_list = {
      next = 0xffff810514f57248, 
      prev = 0xffff810514f57248
    }
  }, 
  fscache = 0x0, 
  vfs_inode = {
    i_hash = {
      next = 0x0, 
      pprev = 0xffff81000b49a460
    }, 
... <all remaining data matches with previous inode dump>

So, the parent inode fsid is 0x2 and it matches with the following log
entry right before the crash:

...
NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x2, got 0x1193a1
NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x2, got 0x1193a1
NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x2, got 0x1193a1
NFS: server fisias-lnx-vcs error: fileid changed
...

and it seems to come from:
/*
 * Many nfs protocol calls return the new file attributes after
 * an operation.  Here we update the inode to reflect the state
 * of the server's inode.
 *
 * This is a bit tricky because we have to make sure all dirty pages
 * have been sent off to the server before calling
invalidate_inode_pages.
 * To make sure no other process adds more write requests while we try
 * our best to flush them, we make them sleep during the attribute
refresh.
 *
 * A very similar scenario holds for the dir cache.
 */
static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)

fbl


This event sent from IssueTracker by fbl 
 issue 759533

Comment 4 Issue Tracker 2010-04-14 20:38:25 UTC

Event posted on 04-14-2010 05:38pm BRT by fbl

sorry, s/fsid/fileid/ in my previous post.
fbl


This event sent from IssueTracker by fbl 
 issue 759533

Comment 9 Jeff Layton 2010-04-15 14:42:13 UTC

It does look like a NFS inode here...and the root cause of the problem does appear to be the "Busy inodes after umount" issue. Here's basically what happened I think:

A NFS filesystem was unmounted, but at least one of inodes in it couldn't be freed. That caused the "Busy inodes after umount" message to pop at umount time. The superblock was freed and the memory was recycled. The inodes and dentries associated with it were still sitting in the cache however, and when the VM tried to prune the dcache due to memory pressure it ended up dereferencing the now recycled superblock.

The question of course is "Why weren't the inodes and dentries pruned at umount?"

Comment 10 Jeff Layton 2010-04-15 15:11:10 UTC

Flags field in the root nfs_inode is 0xe, which is:

#define NFS_INO_ADVISE_RDPLUS   (1)             /* advise readdirplus */
#define NFS_INO_STALE           (2)             /* possible stale inode */
#define NFS_INO_ACL_LRU_SET     (3)             /* Inode is on the LRU list */

...which indicates that it was marked as stale at some point (as expected). The i_ino is also "2", which matches these messages:

NFS: server fisias-lnx-vcs error: fileid changed
fsid 0:19: expected fileid 0x1193a1, got 0x2

...so I suspect that the problem is likely related to the fact that the fileid for the root inode of the mount changed.

Comment 11 Ian Kent 2010-04-15 15:24:00 UTC

(In reply to comment #0)
> 
> The parent dentry which also appears to be the root dentry is as follows
> 
> crash> dentry 0xffff810515f84b70
> struct dentry {
>   d_count = {
>     counter = 1
>   }, 
>   d_flags = 20,        --> DCACHE_DISCONNECTED | DCACHE_UNHASHED
> ..
>   d_inode = 0xffff810514f57260, 
> ..
>   d_parent = 0xffff810515f84b70, 
> ..
>   d_op = 0xffffffff88774ee0, 
>   d_sb = 0xffff81051cc3e800, 
> ..
>   d_iname =
> "\000dmap\000\000\000\000\000\000\000/lib64/libc.so.6\000\000\000\000\000\000\000"
> }
> 
> This has a valid inode which probably triggered the VFS: Busy inode message

Maybe my comment will be helpful, not sure, but here goes
anyway.

Probably not, unless it has already been successfully umounted.
The root dentry should remain valid until it is dput
(dentry_iput() does the real business I think) during umount.
The busy inode message usually results from dentrys within
the root still being active in some way, but the child above
is either well and truly "gone" or is in the process of being
created. If it is really "gone" then any process that tries
to use it is in error. Except for the dcache garbage collection
routines (that we see here), of course. The other thing that
comes to mind is that the VFS will not "see" unhashed dentrys
during lookups so they should't be available to modules unless
they have socked an address away somewhere and shouldn't become
negative unless a reference was not taken.

Is something playing funny games with dentrys here?
Is VxFS trying to use an invalid dentry?

> 
> crash> inode.i_count 0xffff810514f57260
>   i_count = {
>     counter = 1
>   }, 

Yes, as it should be if it hasn't finished being umounted.

Ian

Comment 18 Jeff Layton 2010-04-15 17:42:30 UTC

Well...I can make these messages pop with the hypothetical scenario I came up with:

export a filesystem from a server
mount it on a client
cd into it do some stuff (ls's and such)
unexport it from server
umount server filesystem
export directory with same fsid
stat root of share

...and I get this. At that point, I get stale NFS filehandles on the client. Unmounting it though doesn't seem to cause problems with busy inodes however. So perhaps there's another factor that contributes to this?

NFS: server dantu.rdu.redhat.com error: fileid changed
fsid 0:18: expected fileid 0x2, got 0x378001

Comment 21 Jeff Layton 2010-04-15 18:17:16 UTC

If it's suspected that the hypothetical situation that I've layed out is part of the problem here. One thing they may want to consider in their clusters is using the "mp" export option which will only export a directory if the filesystem is mounted. See the exports(5) manpage for info.

That might help prevent the issue from occurring while we try to determine the cause.

Comment 24 Ian Kent 2010-04-15 18:30:03 UTC

(In reply to comment #20)
> (In reply to comment #15)
> > 
> > Hypothetical situation:
> > 
> > The service failed over and the underlying filesystem wasn't mounted, but the
> > export was set up. Because it was exported with an explicit fsid= value, it
> > looks like the same filesystem but the server ended up exporting the mountpoint
> > of the directory instead.
> 
> That would mean that the VCS configuration is grossly wrong
> which is unlikley. The VCS configuration will consist of
> groups of resources that depend on other groups. For example
> the NFS service group will depend on the file systems group
> and will not be able to start until VCS is happy the filesystems
> are successfully setup.

On second thoughts, maybe not.

IIRC, the failover activity depends on site supplied scripts.
If the ones responsible for the file system mounts returned
an incorrect result that could cause this.

If there is a failure then the whole process stops waiting
for admin intervention so that is another opportunity for
out of order actions.

What happens if in your test above you mount the filesystem
over the top of the mount point after it is exported?

Comment 28 Lachlan McIlroy 2010-04-16 07:58:30 UTC

Created attachment 407030 [details]
suggested change to prevent dropping a disconnected dentry

Comment 33 Jeff Layton 2010-04-19 18:49:20 UTC

The patch seems to fix the problem shown by the reproducer.

What is less clear is whether this problem is the actual problem that the customer hit. There's no way to say that it is with 100% certainty. What we can say is that the problem they hit is consistent with the problem that the reproducer shows.

That's probably the best we can hope for with this. At this point, I'm just waiting for Al to push the patch upstream. Once that's done, we can propose it for RHEL5. I also need to see whether the problem is reproducible on rawhide/rhel6 as well.

Comment 35 Jeff Layton 2010-04-20 11:43:39 UTC

I'm unable to reproduce this so far on rawhide kernels. This commit seems like it's probably papering over the bug:

commit 4e99a1ff3410c627a428d5ddb6cd2e7bc908a486
Author: Trond Myklebust <Trond.Myklebust>
Date:   Thu Mar 6 12:34:59 2008 -0500

    NFS: Fix dentry revalidation for NFSv4 referrals and mountpoint crossings
    
...that said, it seems a little fragile to depend on this. The code could change at some point or there may be some way to avoid this check in the current code (exclusive create?).

Either way, I think Al's proposed patch is reasonable.

Comment 36 David Howells 2010-04-20 13:00:56 UTC

I'll add my ACK on Al's patch.

Comment 40 Jarod Wilson 2010-05-19 19:00:45 UTC

in kernel-2.6.18-199.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 49 Douglas Silas 2010-06-28 20:26:09 UTC

Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
When an NFS server exported a file system with an explicit "fsid=[file_system_ID]", an NFS client mounted that file system on one mount point and a subdirectory of that file system on a separate mount point, then if the server re-exported that file system after un-exporting and unmounting it, it was possible for the NFS client to unmount those mount points and receive the following error message:

    "VFS: Busy inodes after unmount..."

Additionally, it was possible to crash the NFS client's kernel in this situation.

Comment 52 errata-xmlrpc 2011-01-13 21:27:02 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html