Bug 253663 - NFS: System crashes trying to force umount a unresponsive, interruptible mount, which holds references to silly renamed files.
Summary: NFS: System crashes trying to force umount a unresponsive, interruptible moun...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Steve Dickson
QA Contact: Martin Jenner
URL:
Whiteboard: GSSApproved
: 337981 (view as bug list)
Depends On:
Blocks: 246139 254106 296411 372911 414041 420521 422431 422441 451649
TreeView+ depends on / blocked
 
Reported: 2007-08-21 02:20 UTC by chakri
Modified: 2018-10-19 23:05 UTC (History)
7 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:53:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
messages showing the crash during reboot. (2.70 KB, application/octet-stream)
2007-11-06 08:08 UTC, Sachin Prabhu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description chakri 2007-08-21 02:20:27 UTC
Description of problem:
System crashes trying to umount a unresponsive, interruptible mount, which holds
references to silly renamed files.

Version-Release number of selected component (if applicable):
RHEL 5, kernel 2.6.18-8.1.8.el5


Steps to Reproduce:
1. Mount a NFS share with -o intr flags
2. Create a test file "x" on the mounted share
3. do "cat > x", this makes sure the file is in use
4. do "rm x", this silly renames the file
5. Now stop the NFS server hosting the NFS share. This makes sure the server is
not available.
6. do "kill -9 <pid of above cat process>"
7. Unmount NFS share

The system panics in shrink_dcache_for_umount.

NFS in 2.6.18-8.1.8.el5 kernel does not support force umounts. This in
combination with bug 218718, which does not wait for async unlink RPC task to
complete, makes sure the system panics in "shrink_dcache_for_umount".

After applying the patch for BUG 218718, the system still reports "Busy inodes
after umount", since umount_begin() does not wait/kill all its RPC tasks.

Additional info:
Applying the patch to enable forced mounts solves the problem.

Comment 2 Issue Tracker 2007-08-22 11:11:00 UTC
Oops message generated for this particular problem on a 2.6.18-38.el5
kernel

BUG: Dentry f7a8f778{i=dcdc5,n=PWRPNT} still in use (1) [unmount of nfs
0:18]
------------[ cut here ]------------
kernel BUG at fs/dcache.c:615!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/ram0/range
Modules linked in: nfs lockd fscache nfs_acl autofs4 hidp rfcomm l2cap
bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack
nfnetlink iptable_filter ip_tabl
es ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6
dm_mirror dm_mod video sbs backlight i2c_ec i2c_core button battery
asus_acpi ac parport_pc lp parport joydev a
ta_piix libata ide_cd sg bnx2 cdrom serio_raw pcspkr megaraid_sas sd_mod
scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU:    1
EIP:    0060:[<c048354a>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-38.el5 #1) 
EIP is at shrink_dcache_for_umount_subtree+0x133/0x1c1
eax: 00000051   ebx: f7a8f778   ecx: c062af58   edx: ea37fef4
esi: 00000001   edi: f7b46d40   ebp: 000dcdc5   esp: ea37fef0
ds: 007b   es: 007b   ss: 0068
Process umount.nfs (pid: 5253, ti=ea37f000 task=f630caa0
task.ti=ea37f000)
Stack: c062af58 f7a8f778 000dcdc5 f7a8f7dc 00000001 f8d25b94 f7b46d40
f7b46c00 
       f8d37f80 00000000 00000002 c0483f8b f7b46c00 c0475443 00000018
f8d37f60 
       c0475538 ed1f7780 f8d0a469 f7b46c00 c04755c7 cb152b40 f7b46c00
c0488572 
Call Trace:
 [<c0483f8b>] shrink_dcache_for_umount+0x2e/0x3a
 [<c0475443>] generic_shutdown_super+0x16/0xd5
 [<c0475538>] kill_anon_super+0x9/0x2f
 [<f8d0a469>] nfs_kill_super+0xc/0x14 [nfs]
 [<c04755c7>] deactivate_super+0x52/0x65
 [<c0488572>] sys_umount+0x1f0/0x218
 [<c04619fb>] unmap_region+0xe1/0xf0
 [<c044aa77>] audit_syscall_entry+0x11c/0x14e
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
Code: ed 8b 53 0c 8b 33 8b 4b 24 8d b8 40 01 00 00 8b 40 1c 85 d2 8b 00 74
03 8b 6a 20 57 50 56 51 55 53 68 58 af 62 c0 e8 9e 32 fa ff <0f> 0b 67 02
4c af 62 c0 83 c4 1c 8b 73 1
8 39 de 75 04 31 f6 eb 
EIP: [<c048354a>] shrink_dcache_for_umount_subtree+0x133/0x1c1 SS:ESP
0068:ea37fef0



This event sent from IssueTracker by sprabhu 
 issue 129861

Comment 3 Issue Tracker 2007-08-22 11:27:10 UTC
The upstream fix for the issue appears to be 

http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6154.41.223




This event sent from IssueTracker by sprabhu 
 issue 129861

Comment 4 Issue Tracker 2007-08-22 11:35:22 UTC
The issue can be reproduced quiet easily using the reproducer from comment
#1


This event sent from IssueTracker by sprabhu 
 issue 129861

Comment 5 chakri 2007-08-23 00:24:54 UTC
There are two problems
1. Async unlink problem - BUG 218718
2. Enable force umounts.

Async unlink problem - bug 218718 has been fixed in upstream per comment #3
and
Enable force umounts - Is already there in 2.6.18-38.el5 kernel, fixed in 
upstream 
http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6048.125.12

Also, a quick patch from
BUG 218718 and
http://linux.bkbits.net:8080/linux-2.6/?PAGE=cset&REV=1.6048.125.12
on top of 2.6.18-8.el5 kernel fixes this problem.

I will be happy to see this one in earlier than 5.2. Since though the sequence 
of steps is convoluted, it can happen with good probability.

Many people delete files & directories and umount.

Comment 6 Steve Dickson 2007-08-23 15:37:11 UTC
Which nfs-utils are you using because I'm getting
#umount /mnt/rhelxen/home
umount.nfs: rhelxen:/home: not found / mounted or server not reachable
umount.nfs: rhelxen:/home: not found / mounted or server not reachable

when I do the mount... and doring force mounts works just fine.

I using nfs-utils-1.0.9-23.el5

Comment 7 Sachin Prabhu 2007-08-23 16:03:53 UTC
I am using the same version of nfs-utils.  Use the -f parameter to unmount.

Steps to recreate:

2 terminals to the machine

From terminal 1

# mount 10.65.6.224:/share /mnt
# cd /mnt
#cat > x

From terminal 2

# rm /mnt/x 
rm: remove regular empty file `/mnt/x'? y
//Check for silly renamed file
# ls -la /mnt
..
-rw-r--r--  1 nfsnobody nfsnobody    0 Aug 21 00:03 .nfs000000000001b8c500000001
..

//Kill the cat command
# killall -9 cat

Terminal 1 again

Terminated
# cd /home
# umount -f /mnt

At this point, the machine crashes.

Comment 8 Sachin Prabhu 2007-08-23 16:06:39 UTC
Forgot to mention. Before moving back to terminal 1 and unmounting, add an
iptables rule on the nfs server to disable the connection.

# iptables -A INPUT -s 10.65.6.39 -j DROP

Comment 9 Steve Dickson 2007-08-27 18:21:32 UTC
Ok I was able to reproduce this on earlier rhel5 kernels 
but with the latest rhel5.1 kernel (2.6.18-43.el5) I am
no longer able to reproduce it. 

Unfortunately, the patch in  Comment #5 is a needs 
several prior patches for that patch to apply
correctly, I think I understand the gist of what 
may need to happn...

So please see you every one is still able to reproduce 
this problem on the latest kernel..

 

Comment 10 Issue Tracker 2007-08-28 05:51:44 UTC
Tested with the 2.6.18-44.el5xen kernel. Could not reproduce the problem.




This event sent from IssueTracker by sprabhu 
 issue 129861

Comment 11 chakri 2007-09-12 19:15:15 UTC
Unable to reproduce the problem in 2.6.18-45.el5 kernel. It seems to be fixed.

But the bug 218718 still exists in 2.6.18-45.el5 kernel too.


Comment 12 Steve Dickson 2007-10-01 07:55:28 UTC

*** This bug has been marked as a duplicate of 218718 ***

Comment 13 Steve Dickson 2007-10-01 08:22:18 UTC
Reopening this bug to use as a tacker for the following patch:
--- linux-2.6.18.noarch/fs/nfs/unlink.c.org	2006-09-19 23:42:06.000000000 -0400
+++ linux-2.6.18.noarch/fs/nfs/unlink.c	2007-09-17 07:50:13.990779000 -0400
@@ -219,5 +219,6 @@ nfs_complete_unlink(struct dentry *dentr
 	dentry->d_flags &= ~DCACHE_NFSFS_RENAMED;
 	spin_unlock(&dentry->d_lock);
 	rpc_wake_up_task(&data->task);
+	__rpc_wait_for_completion_task(&data->task, NULL);
 	nfs_put_unlinkdata(data);
 }

Comment 15 RHEL Program Management 2007-10-01 08:34:19 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Tom Coughlan 2007-10-25 19:39:26 UTC

*** This bug has been marked as a duplicate of 254106 ***

Comment 17 Sachin Prabhu 2007-11-06 08:08:43 UTC
Created attachment 248871 [details]
messages showing the crash during reboot.

We have a customer who can reproduce the issue even with the test kernel which
was built with the patches posted to rhkernel-list

This is how the issue was reproduced.

1. Boot 5.1 snapshot 3.
2. Build/test locally glibc from CVS with source on
a NFS server mounted via autofs. ( This was also tested without autofs. The
problem still occurs with a plain nfs mounted share )
3. Reboot

The crash happens while the machine is being rebooted while the share is being
unmounted.

This first message appears during the xcheck of glibc. The rest, starting with
the INIT: SwitchingINIT: line appear on executing 'reboot'. This panic was
generated with the regular -53 kernel with autofs stopped. The file attached
contains the messages seen.

Comment 20 Steve Dickson 2007-11-21 15:47:21 UTC
I reopned this bug becuase it turns out not to be
a dup of 254106 even thought the foot print looked
similar. 


Comment 27 Don Zickus 2007-12-17 19:36:53 UTC
in 2.6.18-61.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 31 Don Zickus 2008-01-24 20:51:24 UTC
*** Bug 337981 has been marked as a duplicate of this bug. ***

Comment 32 Jan Tluka 2008-02-25 17:16:30 UTC
While executing the test in comment #28 the test does not result in kernel panic
but hangs on last umount command. umount command stays in uninterruptible sleep
state and therefore can't be killed with 'kill -9' command.

I used kernel 2.6.18-79.el5 and 2.6.18-53.1.14.el5 and both have this problem.

--output of test--
[root@ibm-e326m ~]# ./test.sh 
+ service nfs start
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]
+ mkdir /mnt/exp /mnt/nfs
+ exportfs -o rw,no_root_squash localhost:/mnt/exp
+ mount -o intr localhost:/mnt/exp /mnt/nfs
+ mkfifo /tmp/fifo.test
+ cat /tmp/fifo.test
+ rm -f /mnt/nfs/x
+ service nfs stop
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: [  OK  ]
Shutting down NFS services:  [  OK  ]
Shutting down RPC svcgssd: [FAILED]
+ kill -9 2745
+ sleep 2
./test.sh: line 12:  2745 Killed                  cat /tmp/fifo.test > /mnt/nfs/x
+ jobs -l
+ umount -f -l /mnt/nfs
-- test hangs here

--dmesg output--
NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
FS-Cache: Loaded
FS-Cache: netfs 'nfs' registered for caching
SELinux: initialized (dev 0:18, type nfs), uses genfs_contexts
nfsd: last server has exited
nfsd: unexporting all filesystems

--process list includes these lines--
root      2788  0.0  0.0  71732   760 pts/0    S+   11:58   0:00 umount -f -l
/mnt/nfs
root      2789  0.0  0.0   3840   504 pts/0    D+   11:58   0:00
/sbin/umount.nfs /mnt/nfs -l -f


Comment 33 Debbie Johnson 2008-03-28 12:25:11 UTC
adding IT 170449 to this BZ as Fijitsu Engineering is seeing same issue with 5.2

Comment 34 Issue Tracker 2008-03-31 11:54:33 UTC
Hi,

Can you provide a status of BZ 253663?  I have not seen an update since
02/25.
I am asking because of my IT case 170449 which I recently attached to this
BZ.
Fujitsu-engineering is seeing the problem running kernel-2.6.18-84.el5
(rhel-x86_64-server-5-beta).  The issue appears to be a regression of 
linux-2.6-nfs-infrastructure-changes-for-silly-renames.patch.

Thanks in advance for the assistance.

Debbie 
SEG

P.S.  If I have done this incorrectly please let me know as I am still
learning the procedures/ropes.


Issue escalated to RHEL 5 Kernel by: dejohnso.
Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by dejohnso 
 issue 170449

Comment 36 Steve Dickson 2008-04-01 17:47:36 UTC
fter further review, I contend there is a slight bug
in the script in Comment #28. Changing the 
'umount -f -l /mnt/nfs' to only the -f (force) flag
the script exits as expected. Which means the -l flag
is causing the umount to hang.

Now why does the umount hang with the -l flag because
it is suppose to. The umount can not complete until
the removal of /mnt/nfs/x completes. The removal 
of the file is becomes asynchronous when the file
is still open when its removed. 

If the NFS server is down when the asynchronous removal
is starts, the client will continue (uninterpretable) 
trying to remove the file. This process has to be 
uninterpretable otherwise the oops in Comment #2 will happen.

So since simply using the '-f' will stop the umount 
from hang (by basically putting the asynchronous removal
in background) I would say things are working as expected.


Comment 37 Tom Coughlan 2008-04-01 17:55:15 UTC
Setting back to ON_QA. 

Comment 38 Don Domingo 2008-04-02 02:11:53 UTC
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 41 errata-xmlrpc 2008-05-21 14:53:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Comment 48 Chris Ward 2008-06-17 15:48:44 UTC
Reminder: This bug includes the 'RHTS' QA Whiteboard Keyword. Don't forget to add
'RHTSdone' to the QA Whiteboard along with a comment describing where the RHTS
test can be found once the RHTS test has been written. Otherwise, if an RHTS
will not be created, please remove RHTS from the qa whiteboard.


Note You need to log in before you can comment on or make changes to this bug.