Created attachment 344619 [details] reproducer Description of problem: I hit the following oops when I was trying to get rid of stale nfs4 mount. This can be reproduced using attached reproducer and following steps below. Version-Release number of selected component (if applicable): kernel 2.6.9-89.ELsmp How reproducible: occasionaly Steps to Reproduce: (on terminal 1:) 0. untar 436092.tar.gz; cd 436092 1. make run repeat this step until you see line similar to: ./runtest.sh: line 35: 5030 Terminated ./reproducer /mnt/testarea/nfs_436092/testfile_436092 (on terminal 2:) 2. umount /mnt/testarea/nfs_436092 umount: /mnt/testarea/nfs_436092: device is busy umount: /mnt/testarea/nfs_436092: device is busy 3. ls -l /mnt/testarea/nfs_436092 (hangs) (don't kill this yet) (on terminal 1:) Ctrl-C (kills the rhts test) (on terminal 2:) Ctrl-C (kills the ls command) 4. umount /mnt/testarea/nfs_436092 (hangs) output (device is busy) like in step 2 is NOT shown (on terminal 1:) 5. umount -f /mnt/testarea/nfs_436092 umount2: Invalid argument umount: /mnt/testarea/nfs_436092: not mounted umount2: Invalid argument umount: /mnt/testarea/nfs_436092: not mounted 6. while true; do service nfs restart; done (this will hang on startup in a few secs) repeat this step after you see hang at least 6 times (Ctrl-C) 7. service nfs stop 8. reboot Actual results: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... INIT: Sending processes the KILL signal INIT: Pid 4184 [id 1] seems to hang INIT: Pid 4185 [id 2] seems to hang INIT: Pid 4186 [id 3] seems to hang INIT: Pid 4187 [id 4] seems to hang INIT: Pid 4188 [id 5] seems to hang INIT: Pid 4189 [id 6] seems to hang Stopping HAL daemon: [FAILED] Stopping system message bus: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd:[ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Shutting down smartd: [FAILED] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping autofs: [ OK ] Shutting down ntpd: [ OK ] Stopping NFS statd: [ OK ] Stopping irqbalance: [ OK ] Stopping portmap: [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Shutting down interface eth0: [ OK ] Shutting down loopback interface: [ OK ] Stopping pcmcia: unloading Kernel Card Services [ OK ] Starting killall: [ OK ] Sending all processes the TERM signal... Sending all processes the KILL signal... RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} PML4 7726e067 PGD 0 Oops: 0000 [1] SMP CPU 2 Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc cpufreq_powersave loop joydev button battery ac ehci_hcd uhci_hcd i5000_edac edac_mc hw_random bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod Pid: 10268, comm: rpc.nfsd Not tainted 2.6.9-89.ELsmp RIP: 0010:[<ffffffffa01e42b2>] <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} RSP: 0018:0000010076475e68 EFLAGS: 00010246 RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 000001000000d000 RDX: 00000000fffffffb RSI: 00000000fffffffb RDI: 0000000000000000 RBP: 0000000000000008 R08: 00000000fffffffb R09: 0000ffff80411b20 R10: ffffffff80321080 R11: ffffffff802af8d5 R12: 00000000fffffffb R13: 0000000000000002 R14: 0000007fbfffb470 R15: 0000000000002000 FS: 0000002a958a5b00(0000) GS:ffffffff80504600(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000038 CR3: 00000000024fe000 CR4: 00000000000006e0 Process rpc.nfsd (pid: 10268, threadinfo 0000010076474000, task 000001007d055030) Stack: 0000010076fcf008 0000010076475e98 0000000000000008 ffffffffa01e4d29 0000010076475e99 0000010076fcf009 0000007fbfff0038 0000007fbfffb470 0000000000002000 ffffffff8015f36b Call Trace:<ffffffffa01e4d29>{:nfsd:write_threads+131} <ffffffff8015f36b>{get_zeroed_page+107} <ffffffff8019bdb5>{simple_transaction_get+152} <ffffffffa01e48ee>{:nfsd:nfsctl_transaction_write+78} <ffffffff8017c4ce>{vfs_write+207} <ffffffff8017c5b6>{sys_write+69} <ffffffff801102f6>{system_call+126} Code: 83 7f 38 01 0f 94 c3 e8 6c d9 f4 ff 85 db 74 15 48 c7 05 43 RIP <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} RSP <0000010076475e68> CR2: 0000000000000038 <0>Kernel panic - not syncing: Oops Expected results: No oops. Filesystem is unmounted properly. Additional info:
Snippet of disassembly: 2a0: e8 00 00 00 00 callq 2a5 <nfsd_svc+0x293> 2a5: ff c5 inc %ebp 2a7: 78 da js 283 <nfsd_svc+0x271> 2a9: 48 8b 3d 00 00 00 00 mov 0(%rip),%rdi # 2b0 <nfsd_svc+0x29e> 2b0: 31 db xor %ebx,%ebx 2b2: 83 7f 38 01 cmpl $0x1,0x38(%rdi) <<<< crash here 2b6: 0f 94 c3 sete %bl 2b9: e8 00 00 00 00 callq 2be <nfsd_svc+0x2ac> 2be: 85 db test %ebx,%ebx 2c0: 74 15 je 2d7 <nfsd_svc+0x2c5> 2c2: 48 c7 05 00 00 00 00 movq $0x0,0(%rip) # 2cd <nfsd_svc+0x2bb> ....%rdi was 0 so that explains the crash. Looking now to determine where that is in the C sources.
Looks like it crashed here. Maybe it raced with nfsd also setting nfsd_serv to NULL? none_left = (nfsd_serv->sv_nrthreads == 1); <<< crash here (?) svc_destroy(nfsd_serv); /* Release server */ if (none_left) { nfsd_serv = NULL; nfsd_racache_shutdown(); nfs4_state_shutdown(); }
Looks like we need to backport the patches that add the nfsd_mutex. In rhel4 (and rhel5 for that matter) the nfsd_serv pointer and its members are protected by the BKL. That's inadequate since these values need protection across sleeps as well. I'll see what can be done to backport that functionality to RHEL4 though it may take a little while (need to ensure that we're not dependent on other RPC changes before removing the BKL here). Jan, would it be possible for you to also test your reproducer on RHEL5? I suspect it has the same problem.
Hi Jeff. I tried to reproduce on RHEL5.4 GA. No success. It seems that 5.4 is immune to this bug. To be sure I retried on RHEL4.8 GA (-89.EL kernel). This triggered the panic on 1st attempt. In 5.4 I've noticed that forcing umount on the mounted nfs filesystem simply solved the stale mount. I removed the force umount in runtest.sh so it could hang on umount and tried to follow further steps in my reproducer. No success.
Created attachment 359792 [details] patch1 -- unlock_kernel before exiting Not fully related to this problem, but a fix in the same area. knfsd doesn't unlock_kernel before exiting. do_exit() calls schedule() and that papers over this bug, but we might as well fix it while we're in here.
Created attachment 359793 [details] patch2 -- handle nfsd_serv variable more cleanly I think this is the patch that fixed this upsteam and in RHEL5. Upstream added a new callback to the svc_serv struct to do this however. We can't reasonably do that for kabi reasons, but this patch should provide the same effect w/o breaking kabi.
Jan, the kernels on my people.redhat.com page have these patches Could you test them and let me know whether they fix the reproducer for you? http://people.redhat.com/jlayton/
Hi Jeff, tested on your kernel 2.6.9-89.11.EL.jtltest.70smp. Could not reproduce the issue. Filesystem can be umounted without problem.
*** Bug 509127 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.36.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
can reproduced on RHEL4_U8 on i386 and get the same kernel panic: [root@hp-xw9400-01 ~]# uname -a Linux hp-xw9400-01.rhts.eng.bos.redhat.com 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:34:33 EDT 2009 i686 athlon i386 GNU/Linux [root@hp-xw9400-01 ~]# [root@hp-xw9400-01 ~]# nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 512 connecting to server localhost VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... Broadcast message from root (pts/0) (Thu Jan 13 21:54:32 2011): INIT: Sending processes the TERM signal INIT: Sending processes the KILL signal INIT: Pid 4882 [id 1] seems to hang INIT: Pid 4883 [id 2] seems to hang INIT: Pid 4884 [id 3] seems to hang INIT: Pid 4885 [id 4] seems to hang INIT: Pid 4886 [id 5] seems to hang INIT: Pid 4887 [id 6] seems to hang Stopping cups-config-daemon: [ OK ] Stopping HAL daemon: [ OK ] Stopping system message bus: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd:[ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Shutting down smartd: [ OK ] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping autofs: [ OK ] Shutting down ntpd: [ OK ] Stopping NFS statd: [ OK ] Stopping irqbalance: [ OK ] Stopping portmap: [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Shutting down interface eth0: [ OK ] Shutting down loopback interface: [ OK ] Shutting down anamon: [FAILED] Stopping beah-beaker-backend: [ OK ] Stopping beah-fwd-backend: [ OK ] Stopping beah-srv: [ OK ] Stopping cpuspeed: [ OK ] Starting killall: [ OK ] Sending all processes the TERM signal... RPC: error 101 connecting to server localhost rpciod: active tasks at shutdown?! RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: sendmsg returned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101). svc_destroy: no threads for serv=f7ee3880! RPC: sendmsg returned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101). RPC: sendmsg returned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101). RPC: sendmsg returned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). Unable to handle kernel NULL pointer dereference at virtual address 00000020 printing eip: f8e66250 *pde = 2fac9001 Oops: 0000 [#1] SMP Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc powernow_k8 cpufreq_powersave loop button battery ac ohci_hcd ehci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore forcedeth floppy sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod CPU: 4 EIP: 0060:[<f8e66250>] Not tainted VLI EFLAGS: 00010246 (2.6.9-89.ELsmp) EIP is at nfsd_svc+0x240/0x294 [nfsd] eax: 00000000 ebx: 00000000 ecx: f5b69c80 edx: fffffffb esi: 00000008 edi: fffffffb ebp: ef8c8fac esp: ef8c8f14 ds: 007b es: 007b ss: 0068 Process rpc.nfsd (pid: 8644, threadinfo=ef8c8000 task=f6de12f0) Stack: ef8c8f2c 00000008 efb1f004 f8e66b27 ef8c8f2d efb1f005 f6de0038 00000010 00000000 ef94f474 00000000 00000246 00000000 00000002 00000000 ef8c8f58 00000000 efb1f004 c017a86f efb1f000 00000002 0000000b 0000000b f69ad840 Call Trace: [<f8e66b27>] write_threads+0x6c/0x97 [nfsd] [<c017a86f>] simple_transaction_get+0xbd/0xf0 [<f8e6678d>] nfsctl_transaction_write+0x41/0x6c [nfsd] [<c015d5a7>] vfs_write+0xb6/0xe2 [<c015d671>] sys_write+0x3c/0x62 [<c02dd8e3>] syscall_call+0x7/0xb Code: f6 eb 1f 81 fb a8 f2 e8 f8 74 19 89 d8 b9 01 00 00 00 8b 50 08 b8 01 00 00 00 8b 1b e8 23 6d 2c c7 46 78 df a1 08 15 e9 f8 31 db <83> 78 20 01 0f 94 c3 e8 82 7c bf ff 85 db 74 14 c7 05 08 15 e9 <0>Fatal exception: panic in 5 seconds RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). svc_destroy: no threads for serv=f7eda880! RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: sendmsg returned error 101 portmap: RPC call returned error 101 RPC: failed to contact portmap (errno -101). Sending all processes the KILL signal... Saving mixer settings Saving random seed: Syncing hardware clock to system time Turning off swap: Turning off quotas: Unmounting pipe file systems: Unmounting file systems: Kernel panic - not syncing: Fatal exception and reproduced on RHEL4_U8 on x86_64 can't get panic, the "make run" can finish with PASS but can get "VFS: Busy inodes after unmount" error on console: [root@hp-xw4200-01 436092]# uname -a Linux hp-xw4200-01.rhts.eng.bos.redhat.com 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:33:05 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@hp-xw4200-01 436092]# make run chmod a+x ./runtest.sh ./runtest.sh --- Starting test ... Running on kernel: Linux hp-xw4200-01.rhts.eng.bos.redhat.com 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:33:05 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux --- Setting up the environment ... Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Shutting down NFS services: [FAILED] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] /tmp <world> /dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw) none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) usbfs on /proc/bus/usb type usbfs (rw) /dev/sda1 on /boot type ext3 (rw) none on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) localhost:/ on /mnt/testarea/nfs_436092 type nfs4 (rw,soft,addr=127.0.0.1) localhost:/ /mnt/testarea/nfs_436092 nfs4 rw,soft,addr=127.0.0.1 0 0 --- Checking bug presence ... Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Wait one minute for a panic ... If you see this line your system has the bug fixed. --- Cleaning up the system ... Shutting down RPC idmapd: [ OK ] localhost:/ /mnt/testarea/nfs_436092 nfs4 rw,soft,addr=127.0.0.1 0 0 umount: /mnt/testarea/nfs_436092: device is busy umount: /mnt/testarea/nfs_436092: device is busy umount2: Device or resource busy umount: /mnt/testarea/nfs_436092: device is busy ./runtest.sh: line 35: 8605 Terminated ./reproducer /mnt/testarea/nfs_436092/testfile_436092 --- Finished testing: PASS /kernel/errata/4.8/436092 result: 0 Log: /mnt/testarea/tmp.CK8484 DMesg: /mnt/testarea/dmesg.log Info: Searching AVC errors produced since 1294974275 (Thu Jan 13 22:04:35 2011) Searching logs... Info: No AVC messages found. Writing to /mnt/testarea/tmp.if8491 : AvcLog: /mnt/testarea/tmp.if8491 console: # nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost NFSD: stale stateid (4d2ebf05/00000000/00000000/00000001)! nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost NFSD: stale stateid (4d2ec030/00000004/00000002/00000001)! nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 512 connecting to server localhost VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... verified on kernel 2.6.9-96.EL on i386 and x86_64, no kernel panic occurs on and no "VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day" shown on, the system can run normal without hang.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html