Description of problem: System crashes trying to umount a unresponsive, interruptible mount, which holds references to silly renamed files. Version-Release number of selected component (if applicable): RHEL 5, kernel 2.6.18-8.1.8.el5 Steps to Reproduce: 1. Mount a NFS share with -o intr flags 2. Create a test file "x" on the mounted share 3. do "cat > x", this makes sure the file is in use 4. do "rm x", this silly renames the file 5. Now stop the NFS server hosting the NFS share. This makes sure the server is not available. 6. do "kill -9 <pid of above cat process>" 7. Unmount NFS share The system panics in shrink_dcache_for_umount. NFS in 2.6.18-8.1.8.el5 kernel does not support force umounts. This in combination with bug 218718, which does not wait for async unlink RPC task to complete, makes sure the system panics in "shrink_dcache_for_umount". After applying the patch for BUG 218718, the system still reports "Busy inodes after umount", since umount_begin() does not wait/kill all its RPC tasks. Additional info: Applying the patch to enable forced mounts solves the problem.
Oops message generated for this particular problem on a 2.6.18-38.el5 kernel BUG: Dentry f7a8f778{i=dcdc5,n=PWRPNT} still in use (1) [unmount of nfs 0:18] ------------[ cut here ]------------ kernel BUG at fs/dcache.c:615! invalid opcode: 0000 [#1] SMP last sysfs file: /block/ram0/range Modules linked in: nfs lockd fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tabl es ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 dm_mirror dm_mod video sbs backlight i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport joydev a ta_piix libata ide_cd sg bnx2 cdrom serio_raw pcspkr megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 1 EIP: 0060:[<c048354a>] Not tainted VLI EFLAGS: 00010246 (2.6.18-38.el5 #1) EIP is at shrink_dcache_for_umount_subtree+0x133/0x1c1 eax: 00000051 ebx: f7a8f778 ecx: c062af58 edx: ea37fef4 esi: 00000001 edi: f7b46d40 ebp: 000dcdc5 esp: ea37fef0 ds: 007b es: 007b ss: 0068 Process umount.nfs (pid: 5253, ti=ea37f000 task=f630caa0 task.ti=ea37f000) Stack: c062af58 f7a8f778 000dcdc5 f7a8f7dc 00000001 f8d25b94 f7b46d40 f7b46c00 f8d37f80 00000000 00000002 c0483f8b f7b46c00 c0475443 00000018 f8d37f60 c0475538 ed1f7780 f8d0a469 f7b46c00 c04755c7 cb152b40 f7b46c00 c0488572 Call Trace: [<c0483f8b>] shrink_dcache_for_umount+0x2e/0x3a [<c0475443>] generic_shutdown_super+0x16/0xd5 [<c0475538>] kill_anon_super+0x9/0x2f [<f8d0a469>] nfs_kill_super+0xc/0x14 [nfs] [<c04755c7>] deactivate_super+0x52/0x65 [<c0488572>] sys_umount+0x1f0/0x218 [<c04619fb>] unmap_region+0xe1/0xf0 [<c044aa77>] audit_syscall_entry+0x11c/0x14e [<c0404eff>] syscall_call+0x7/0xb ======================= Code: ed 8b 53 0c 8b 33 8b 4b 24 8d b8 40 01 00 00 8b 40 1c 85 d2 8b 00 74 03 8b 6a 20 57 50 56 51 55 53 68 58 af 62 c0 e8 9e 32 fa ff <0f> 0b 67 02 4c af 62 c0 83 c4 1c 8b 73 1 8 39 de 75 04 31 f6 eb EIP: [<c048354a>] shrink_dcache_for_umount_subtree+0x133/0x1c1 SS:ESP 0068:ea37fef0 This event sent from IssueTracker by sprabhu issue 129861
The upstream fix for the issue appears to be http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6154.41.223 This event sent from IssueTracker by sprabhu issue 129861
The issue can be reproduced quiet easily using the reproducer from comment #1 This event sent from IssueTracker by sprabhu issue 129861
There are two problems 1. Async unlink problem - BUG 218718 2. Enable force umounts. Async unlink problem - bug 218718 has been fixed in upstream per comment #3 and Enable force umounts - Is already there in 2.6.18-38.el5 kernel, fixed in upstream http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6048.125.12 Also, a quick patch from BUG 218718 and http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6048.125.12 on top of 2.6.18-8.el5 kernel fixes this problem. I will be happy to see this one in earlier than 5.2. Since though the sequence of steps is convoluted, it can happen with good probability. Many people delete files & directories and umount.
Which nfs-utils are you using because I'm getting #umount /mnt/rhelxen/home umount.nfs: rhelxen:/home: not found / mounted or server not reachable umount.nfs: rhelxen:/home: not found / mounted or server not reachable when I do the mount... and doring force mounts works just fine. I using nfs-utils-1.0.9-23.el5
I am using the same version of nfs-utils. Use the -f parameter to unmount. Steps to recreate: 2 terminals to the machine From terminal 1 # mount 10.65.6.224:/share /mnt # cd /mnt #cat > x From terminal 2 # rm /mnt/x rm: remove regular empty file `/mnt/x'? y //Check for silly renamed file # ls -la /mnt .. -rw-r--r-- 1 nfsnobody nfsnobody 0 Aug 21 00:03 .nfs000000000001b8c500000001 .. //Kill the cat command # killall -9 cat Terminal 1 again Terminated # cd /home # umount -f /mnt At this point, the machine crashes.
Forgot to mention. Before moving back to terminal 1 and unmounting, add an iptables rule on the nfs server to disable the connection. # iptables -A INPUT -s 10.65.6.39 -j DROP
Ok I was able to reproduce this on earlier rhel5 kernels but with the latest rhel5.1 kernel (2.6.18-43.el5) I am no longer able to reproduce it. Unfortunately, the patch in Comment #5 is a needs several prior patches for that patch to apply correctly, I think I understand the gist of what may need to happn... So please see you every one is still able to reproduce this problem on the latest kernel..
Tested with the 2.6.18-44.el5xen kernel. Could not reproduce the problem. This event sent from IssueTracker by sprabhu issue 129861
Unable to reproduce the problem in 2.6.18-45.el5 kernel. It seems to be fixed. But the bug 218718 still exists in 2.6.18-45.el5 kernel too.
*** This bug has been marked as a duplicate of 218718 ***
Reopening this bug to use as a tacker for the following patch: --- linux-2.6.18.noarch/fs/nfs/unlink.c.org 2006-09-19 23:42:06.000000000 -0400 +++ linux-2.6.18.noarch/fs/nfs/unlink.c 2007-09-17 07:50:13.990779000 -0400 @@ -219,5 +219,6 @@ nfs_complete_unlink(struct dentry *dentr dentry->d_flags &= ~DCACHE_NFSFS_RENAMED; spin_unlock(&dentry->d_lock); rpc_wake_up_task(&data->task); + __rpc_wait_for_completion_task(&data->task, NULL); nfs_put_unlinkdata(data); }
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** This bug has been marked as a duplicate of 254106 ***
Created attachment 248871 [details] messages showing the crash during reboot. We have a customer who can reproduce the issue even with the test kernel which was built with the patches posted to rhkernel-list This is how the issue was reproduced. 1. Boot 5.1 snapshot 3. 2. Build/test locally glibc from CVS with source on a NFS server mounted via autofs. ( This was also tested without autofs. The problem still occurs with a plain nfs mounted share ) 3. Reboot The crash happens while the machine is being rebooted while the share is being unmounted. This first message appears during the xcheck of glibc. The rest, starting with the INIT: SwitchingINIT: line appear on executing 'reboot'. This panic was generated with the regular -53 kernel with autofs stopped. The file attached contains the messages seen.
I reopned this bug becuase it turns out not to be a dup of 254106 even thought the foot print looked similar.
in 2.6.18-61.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 337981 has been marked as a duplicate of this bug. ***
While executing the test in comment #28 the test does not result in kernel panic but hangs on last umount command. umount command stays in uninterruptible sleep state and therefore can't be killed with 'kill -9' command. I used kernel 2.6.18-79.el5 and 2.6.18-53.1.14.el5 and both have this problem. --output of test-- [root@ibm-e326m ~]# ./test.sh + service nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] + mkdir /mnt/exp /mnt/nfs + exportfs -o rw,no_root_squash localhost:/mnt/exp + mount -o intr localhost:/mnt/exp /mnt/nfs + mkfifo /tmp/fifo.test + cat /tmp/fifo.test + rm -f /mnt/nfs/x + service nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS services: [ OK ] Shutting down RPC svcgssd: [FAILED] + kill -9 2745 + sleep 2 ./test.sh: line 12: 2745 Killed cat /tmp/fifo.test > /mnt/nfs/x + jobs -l + umount -f -l /mnt/nfs -- test hangs here --dmesg output-- NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory NFSD: starting 90-second grace period FS-Cache: Loaded FS-Cache: netfs 'nfs' registered for caching SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts nfsd: last server has exited nfsd: unexporting all filesystems --process list includes these lines-- root 2788 0.0 0.0 71732 760 pts/0 S+ 11:58 0:00 umount -f -l /mnt/nfs root 2789 0.0 0.0 3840 504 pts/0 D+ 11:58 0:00 /sbin/umount.nfs /mnt/nfs -l -f
adding IT 170449 to this BZ as Fijitsu Engineering is seeing same issue with 5.2
Hi, Can you provide a status of BZ 253663? I have not seen an update since 02/25. I am asking because of my IT case 170449 which I recently attached to this BZ. Fujitsu-engineering is seeing the problem running kernel-2.6.18-84.el5 (rhel-x86_64-server-5-beta). The issue appears to be a regression of linux-2.6-nfs-infrastructure-changes-for-silly-renames.patch. Thanks in advance for the assistance. Debbie SEG P.S. If I have done this incorrectly please let me know as I am still learning the procedures/ropes. Issue escalated to RHEL 5 Kernel by: dejohnso. Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by dejohnso issue 170449
fter further review, I contend there is a slight bug in the script in Comment #28. Changing the 'umount -f -l /mnt/nfs' to only the -f (force) flag the script exits as expected. Which means the -l flag is causing the umount to hang. Now why does the umount hang with the -l flag because it is suppose to. The umount can not complete until the removal of /mnt/nfs/x completes. The removal of the file is becomes asynchronous when the file is still open when its removed. If the NFS server is down when the asynchronous removal is starts, the client will continue (uninterpretable) trying to remove the file. This process has to be uninterpretable otherwise the oops in Comment #2 will happen. So since simply using the '-f' will stop the umount from hang (by basically putting the asynchronous removal in background) I would say things are working as expected.
Setting back to ON_QA.
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html
Reminder: This bug includes the 'RHTS' QA Whiteboard Keyword. Don't forget to add 'RHTSdone' to the QA Whiteboard along with a comment describing where the RHTS test can be found once the RHTS test has been written. Otherwise, if an RHTS will not be created, please remove RHTS from the qa whiteboard.