Bug 501335

Summary:

oops in nfs4_put_open_state

Product:

Red Hat Enterprise Linux 4

Reporter:

Jan Tluka <jtluka>

Component:

kernel

Assignee:

Jeff Layton <jlayton>

Status:

CLOSED ERRATA

QA Contact:

yanfu,wang <yanwang>

Severity:

high

Docs Contact:

Priority:

low

Version:

4.8

CC:

dhoward, jlayton, phan, rwheeler, staubach, steved, vmayatsk, yanwang, yugzhang

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-02-16 16:04:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

589293

Attachments:

Description	Flags
reproducer	none
patch -- fix busy inodes problem	none
busy inodes reproducer	none
patchset 2 -- convert nfsv4-work queue to per-client workqueue	none

Description Jan Tluka 2009-05-18 15:50:11 UTC

Created attachment 344463 [details]
reproducer

Description of problem:
I can see specific panic after trying to get rid of NFS4 mount. I hit this while reproducing bug 436092. See attachment for that reproducer (RHTS test). See further steps to trigger this bug below.

Version-Release number of selected component (if applicable):
kernel 2.6.9-89.ELsmp

How reproducible:
Almost always.

Steps to Reproduce:
Pre: untar the testcase 
1. make run 
2. wait until you see this line:
umount: /mnt/testarea/nfs_436092: device is busy
(now the testcase hangs at 'rm -rf /mnt/testarea/nfs_436092)

from another console:
3. rm_pid=`pgrep rm` (get the pid of 'rm -rf' process that has hung)
4. kill $rm_pid (the testcase should finish on the first console)
5. mount | grep nfs_436092 (make sure the nfs is still mounted)
6. ls -l /mnt/testarea/nfs_436092 (this will hang, too)
Ctrl-C
7. lsof (this will hang, too)
Ctrl-C
8. umount /mnt/testarea/nfs_436092 (won't succeed)
umount: /mnt/testarea/nfs_436092: device is busy
umount: /mnt/testarea/nfs_436092: device is busy
9. umount -f /mnt/testarea/nfs_436092

Actual results:
Unable to handle kernel paging request at 0000000000100108 RIP: 
<ffffffff80196409>{generic_drop_inode+245}
PML4 0 
Oops: 0002 [1] SMP 
CPU 3 
Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave loop joydev button battery ac ehci_hcd uhci_hcd i5000_edac edac_mc hw_random bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod
Pid: 5935, comm: nfsv4-work Not tainted 2.6.9-89.ELsmp
RIP: 0010:[<ffffffff80196409>] <ffffffff80196409>{generic_drop_inode+245}
RSP: 0018:000001007bfa1df8  EFLAGS: 00010246
RAX: 0000010076fcb798 RBX: 0000010076fcb788 RCX: 0000000000200200
RDX: 0000000000100100 RSI: 000000000000004e RDI: ffffffff803fc260
RBP: 0000010076fcb834 R08: ffffffff804ad508 R09: ffffffff8015ecee
R10: 0000000300000000 R11: 0000000000000000 R12: 0000010076fcb788
R13: 0000010077b8a500 R14: 0000010076fcb788 R15: 000001007ee95dc0
FS:  0000000000000000(0000) GS:ffffffff80504680(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000100108 CR3: 0000000002534000 CR4: 00000000000006e0
Process nfsv4-work (pid: 5935, threadinfo 000001007bfa0000, task 0000010076d06030)
Stack: 000001007962a7c0 ffffffffa0249011 ffffffff8015ecee 0000010076fcb834 
       000001007962a7c0 0000010077b8a500 0000000000000000 ffffffffa024913b 
       0000010002713c00 000001007cfc7980 
Call Trace:<ffffffffa0249011>{:nfs:nfs4_put_open_state+135} <ffffffff8015ecee>{mempool_free_slab+0} 
       <ffffffffa024913b>{:nfs:nfs4_close_state_work+244} 
       <ffffffffa0249047>{:nfs:nfs4_close_state_work+0} <ffffffff80148bf2>{worker_thread+419} 
       <ffffffff80134a9c>{default_wake_function+0} <ffffffff80134aed>{__wake_up_common+67} 
       <ffffffff80134a9c>{default_wake_function+0} <ffffffff8014c9b8>{keventd_create_kthread+0} 
       <ffffffff80148a4f>{worker_thread+0} <ffffffff8014c9b8>{keventd_create_kthread+0} 
       <ffffffff8014c98f>{kthread+200} <ffffffff80110fd7>{child_rip+8} 
       <ffffffff8014c9b8>{keventd_create_kthread+0} <ffffffff8014c8c7>{kthread+0} 
       <ffffffff80110fcf>{child_rip+0} 

Code: 48 89 4a 08 48 89 11 48 89 40 08 48 83 8b 10 02 00 00 10 48 
RIP <ffffffff80196409>{generic_drop_inode+245} RSP <000001007bfa1df8>
CR2: 0000000000100108
 <0>Kernel panic - not syncing: Oops

Expected results:
No panic. Filesystem is unmounted.

Additional info:
Hit this in RHTS on hp-bl460c-02.rhts.bos.redhat.com.

Comment 1 Jeff Layton 2009-05-19 11:46:20 UTC

After several attempts, I haven't been able to reproduce the panic so far. I'm attempting it on a PV xen guest though so maybe that changes the timing enough that it's not occurring.

Jan, you said you can reproduce this pretty much at will?

Comment 2 Jeff Layton 2009-05-19 12:22:36 UTC

Here's some assembly from the crash site:

ffffffff801963db:       48 85 c0                test   %rax,%rax
ffffffff801963de:       48 89 02                mov    %rax,(%rdx)
ffffffff801963e1:       74 04                   je     ffffffff801963e7 <generic_drop_inode+0xd3>
ffffffff801963e3:       48 89 50 08             mov    %rdx,0x8(%rax)
ffffffff801963e7:       48 c7 03 00 00 00 00    movq   $0x0,(%rbx)
ffffffff801963ee:       48 c7 43 08 00 00 00    movq   $0x0,0x8(%rbx)
ffffffff801963f5:       00 
ffffffff801963f6:       48 8d 43 10             lea    0x10(%rbx),%rax
ffffffff801963fa:       48 8b 53 10             mov    0x10(%rbx),%rdx
ffffffff801963fe:       48 c7 c7 60 c2 3f 80    mov    $0xffffffff803fc260,%rdi
ffffffff80196405:       48 8b 48 08             mov    0x8(%rax),%rcx
ffffffff80196409:       48 89 4a 08             mov    %rcx,0x8(%rdx)  <<<<<< CRASH HERE
ffffffff8019640d:       48 89 11                mov    %rdx,(%rcx)
ffffffff80196410:       48 89 40 08             mov    %rax,0x8(%rax)
ffffffff80196414:       48 83 8b 10 02 00 00    orq    $0x10,0x210(%rbx)
ffffffff8019641b:       10 
ffffffff8019641c:       48 89 43 10             mov    %rax,0x10(%rbx)
ffffffff80196420:       ff 0d 5a 6e 31 00       decl   3239514(%rip)        # ffffffff804ad280 <inodes_stat>
ffffffff80196426:       e8 da ee 17 00          callq  ffffffff80315305 <_spin_unlock>
ffffffff8019642b:       48 83 bb 78 01 00 00    cmpq   $0x0,0x178(%rbx)

...so %rdx was bogus at the time of the crash. I think that corresponds with the list_del_init here in generic_forget_inode:

        list_del_init(&inode->i_list);    <<<<< CRASH HERE
        inode->i_state |= I_FREEING;
        inodes_stat.nr_inodes--;

...we may need a core to really determine the cause here. It would be interesting to see whether this inode was on a list at the time. I'll also need to look at what happens during a forced umount -- maybe the kernel is not doing enough to ensure that all of the inode refs are put before tearing down the sb?

Comment 3 Jan Tluka 2009-05-19 12:37:48 UTC

Hi Jeff. I was able to reproduce this two times in a row so I marked it as reproducable at will. Unfortunately today I was not able to reproduce it. I was able to see another panic however I'm not sure if it's related. I'm posting in another comment.

Comment 4 Jan Tluka 2009-05-19 12:39:03 UTC

I hit another oops.

You will have to modify original reproducer runtest.sh and comment out line
containing: sleep $(( 2 * 60 ))

(on terminal 1:)
1. make run
repeat this step until you see line similar to:
./runtest.sh: line 35:  5030 Terminated              ./reproducer
/mnt/testarea/nfs_436092/testfile_436092
(on terminal 2:)
2. umount /mnt/testarea/nfs_436092
umount: /mnt/testarea/nfs_436092: device is busy
umount: /mnt/testarea/nfs_436092: device is busy
3. ls -l /mnt/testarea/nfs_436092 (hangs)
(on terminal 1:)
Ctrl-C
(on terminal 2:)
4. umount /mnt/testarea/nfs_436092 (hangs)
output (device is busy) like in step 2 is NOT shown
(on terminal 1:)
5. umount -f /mnt/testarea/nfs_436092
umount2: Invalid argument
umount: /mnt/testarea/nfs_436092: not mounted
umount2: Invalid argument
umount: /mnt/testarea/nfs_436092: not mounted
6. while true; do service nfs restart; done (this will hang on startup in a few
secs)
repeat this step after you see hang at least 6 times (Ctrl-C)
7. service nfs stop
8. reboot

I got following panic:
VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...
INIT: Sending processes the KILL signal
INIT: Pid 4184 [id 1] seems to hang
INIT: Pid 4185 [id 2] seems to hang
INIT: Pid 4186 [id 3] seems to hang
INIT: Pid 4187 [id 4] seems to hang
INIT: Pid 4188 [id 5] seems to hang
INIT: Pid 4189 [id 6] seems to hang
Stopping HAL daemon: [FAILED]
Stopping system message bus: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd:[  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Shutting down smartd: [FAILED]
Stopping xinetd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  [  OK  ]
Shutting down ntpd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping irqbalance: [  OK  ]
Stopping portmap: [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Shutting down interface eth0:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Stopping pcmcia:  unloading Kernel Card Services
[  OK  ]
Starting killall:  [  OK  ]
Sending all processes the TERM signal...
Sending all processes the KILL signal... RPC: error 5 connecting to server
localhost

RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
RPC: error 5 connecting to server localhost
RPC: failed to contact portmap (errno -5).
Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP:
<ffffffffa01e42b2>{:nfsd:nfsd_svc+672}
PML4 7726e067 PGD 0
Oops: 0000 [1] SMP
CPU 2
Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp
parport autofs4 sunrpc cpufreq_powersave loop joydev button battery ac ehci_hcd
uhci_hcd i5000_edac edac_mc hw_random bnx2 dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod cciss sd_mod scsi_mod
Pid: 10268, comm: rpc.nfsd Not tainted 2.6.9-89.ELsmp
RIP: 0010:[<ffffffffa01e42b2>] <ffffffffa01e42b2>{:nfsd:nfsd_svc+672}
RSP: 0018:0000010076475e68  EFLAGS: 00010246
RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 000001000000d000
RDX: 00000000fffffffb RSI: 00000000fffffffb RDI: 0000000000000000
RBP: 0000000000000008 R08: 00000000fffffffb R09: 0000ffff80411b20
R10: ffffffff80321080 R11: ffffffff802af8d5 R12: 00000000fffffffb
R13: 0000000000000002 R14: 0000007fbfffb470 R15: 0000000000002000
FS:  0000002a958a5b00(0000) GS:ffffffff80504600(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 00000000024fe000 CR4: 00000000000006e0
Process rpc.nfsd (pid: 10268, threadinfo 0000010076474000, task
000001007d055030)
Stack: 0000010076fcf008 0000010076475e98 0000000000000008 ffffffffa01e4d29
       0000010076475e99 0000010076fcf009 0000007fbfff0038 0000007fbfffb470
       0000000000002000 ffffffff8015f36b
Call Trace:<ffffffffa01e4d29>{:nfsd:write_threads+131}
<ffffffff8015f36b>{get_zeroed_page+107}
       <ffffffff8019bdb5>{simple_transaction_get+152}
<ffffffffa01e48ee>{:nfsd:nfsctl_transaction_write+78}
       <ffffffff8017c4ce>{vfs_write+207} <ffffffff8017c5b6>{sys_write+69}
       <ffffffff801102f6>{system_call+126}

Code: 83 7f 38 01 0f 94 c3 e8 6c d9 f4 ff 85 db 74 15 48 c7 05 43
RIP <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} RSP <0000010076475e68>
CR2: 0000000000000038
 <0>Kernel panic - not syncing: Oops

Comment 5 Jan Tluka 2009-05-19 12:43:45 UTC

Correcting step 3 in comment 4:

3. ls -l /mnt/testarea/nfs_436092 (hangs)
(don't kill this yet)

(on terminal 1:)
Ctrl-C
(kills the rhts test)

(on terminal 2:)
Ctrl-C
(kills the ls command)

Comment 6 Jeff Layton 2009-05-19 12:50:06 UTC

Well, the busy inodes after umount thing might be related. I suspect that the problem may have something to do with that.

The actual oops though isn't related. That's a server-side problem...

Comment 7 Jeff Layton 2009-05-19 12:52:45 UTC

I suggest opening a separate BZ for the server-side issue.

Comment 8 Jan Tluka 2009-05-19 14:03:56 UTC

(In reply to comment #7)
> I suggest opening a separate BZ for the server-side issue.  

Oops from comment 4 is tracked in bug 501500.

Comment 9 Jeff Layton 2009-05-19 15:42:07 UTC

Got it to oops using your reproducer:


Unable to handle kernel paging request at 0000000000100108 RIP: 
<ffffffff80191ad7>{generic_drop_inode+245}
PML4 1c742067 PGD 0 
Oops: 0002 [1] SMP 
CPU 0 
Modules linked in: nfs nfsd exportfs nfs_acl md5 ipv6 rpcsec_gss_krb5 auth_rpcgss des lockd sunrpc iptable_filter ip_tables xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod
Pid: 2375, comm: nfsv4-work Not tainted 2.6.9-89.EL.jtltest.60xenU
RIP: e030:[<ffffffff80191ad7>] <ffffffff80191ad7>{generic_drop_inode+245}
RSP: e02b:ffffff801af2bdf8  EFLAGS: 00010246
RAX: ffffff801aecaa60 RBX: ffffff801aecaa50 RCX: 0000000000200200
RDX: 0000000000100100 RSI: 000000000000004e RDI: ffffffff80345380
RBP: ffffff801aecaafc R08: ffffffff803db088 R09: ffffffff801580e2
R10: 0000000000000000 R11: 0000000000000000 R12: ffffff801aecaa50
R13: ffffff801bc43b00 R14: ffffff801aecaa50 R15: ffffff801d33a840
FS:  0000002a96a372a0(0000) GS:ffffffff80439b80(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process nfsv4-work (pid: 2375, threadinfo ffffff801af2a000, task ffffff801b1c47f0)
Stack: ffffff801d5e7dc0 ffffffffa01c6061 ffffffff801580e2 ffffff801aecaafc 
       ffffff801d5e7dc0 ffffff801bc43b00 0000000000000000 ffffffffa01c618b 
       ffffff801fc13400 ffffff801ae03500 
Call Trace:<ffffffffa01c6061>{:nfs:nfs4_put_open_state+135} <ffffffff801580e2>{mempool_free_slab+0} 
       <ffffffffa01c618b>{:nfs:nfs4_close_state_work+244} 
       <ffffffffa01c6097>{:nfs:nfs4_close_state_work+0} <ffffffff80142162>{worker_thread+419} 
       <ffffffff8012c237>{default_wake_function+0} <ffffffff8012c287>{__wake_up_common+66} 
       <ffffffff8012c237>{default_wake_function+0} <ffffffff80146490>{keventd_create_kthread+0} 
       <ffffffff80141fbf>{worker_thread+0} <ffffffff80146490>{keventd_create_kthread+0} 
       <ffffffff80146467>{kthread+200} 


...going to see if I can get a core now. The interesting thing is that I tried several times to get it to crash on my test kernels so this may already be fixed. If it is, it's probably due to the patch for 487699.

Comment 10 Jeff Layton 2009-05-19 16:06:47 UTC

I haven't been able to reproduce since the first crash, but I did get this once I tried to unmount after the "reproducer" program finally died.

VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...

It looks like this reproducer path is resulting somehow in an iput not getting done.

Comment 11 Jeff Layton 2009-05-19 18:12:18 UTC

Ok, using a variant of the reproducer provided I can reproduce a busy inodes after umount problem at will. If I wait until the "reproducer" binary dies, then I can unmount the filesystem (no -f flag needed there), and I get a busy inodes after umount message.

So there is definitely an inode reference leak in there somewhere. The hard part is tracking it down...

Comment 12 Jeff Layton 2009-05-20 12:22:45 UTC

Ok, am able to reproduce the busy inodes at umount problem at will now I think. I'm going to proceed and work that issue since I believe that's probably the root cause here.

I modified the reproducer to wait until the "reproducer" program has exited and then to do the unmount (with no -f) immediately after. That causes the "busy inodes at umount" message to pop.

The problem here is that nfs4_close_state was called on the inode as the program was exiting. This queues up a nfs4_close_state_work job and the close() call returns. At that point there is no active reference to the mount but there is one to the inode (held by the workqueue job, which does an iput at the end of its work).

I think we need to get a reference to the superblock somehow and hold it during the life of the nfs4_close_state_work job. The usual way to do this is to do a mntget() and hold a reference to the vfsmount instead, but that's not really doable here. At this level all we have is a pointer to an inode -- determining what vfsmount to take the reference on will be ugly.

Comment 13 Jeff Layton 2009-05-26 12:15:53 UTC

This is going to be a very difficult problem to solve...

The problem is that the close_state operations basically just receive an inode as an arg and that's not enough information to take a mount reference.

One possibility is to change nfs4_close_state to only queue the work to the workqueue if the caller is rpciod. That would alleviate most of these cases and should keep the original deadlock at bay, but I think it's still possible to hit a busy inodes problem even with that.

Comment 14 Jeff Layton 2009-06-12 19:18:57 UTC

Actually, there is something else we could do...

We could turn nfs4_do_close and nfs4_do_downgrade into async RPC calls (basically, make the workqueue job not wait on their replies). Given that we don't do anything with the return codes from those calls anyway, I think we'd be safe enough to make that change.

Then, when unmounting we could just flush the nfs4-work queue. That should go reasonably quickly and should ensure that all of the open states are closed out before the superblock is torn down.

Not a trivial patch, however...

Comment 15 Jeff Layton 2009-06-17 17:04:40 UTC

Actually that won't help -- we still need to wait until the CLOSE or DOWNGRADE call goes over the wire before we can iput, so the main problem will still exist.

flushing the workqueue on umount is probably our only recourse. What we may need to do is turn the nfs4-work thread into a per-mount workqueue. That will still mean a possibly lengthy umount cycle while the RPCs cycle through, but at that point we won't be waiting on RPCs unrelated to the mount being unmounted.

Comment 16 Jeff Layton 2009-06-18 12:24:49 UTC

Created attachment 348429 [details]
patch -- fix busy inodes problem

This patchset seems to fix the busy inodes issue for me. It may also help performance when there are a lot of nfs4 mounts (by reducing some serialization).

The only catch here is that umounts have to flush the workqueue before they will proceed. If there are several jobs in it to unresponsive servers, then umounts can take a while, or may even never make progress if they're hard mounts. The good thing is that there shouldn't be nearly as many jobs in the workqueue now since we only queue jobs to it when the calling task is rpciod.

There's not much we can do about that unfortunately other than move to a one-workqueue-per-superblock model. That's doable but I want to think about it a little more.

Comment 17 Jeff Layton 2009-06-18 12:25:41 UTC

Please test this patch and let me know if it resolves the issue for you. I've built this patch into my test kernels as well:

http://people.redhat.com/jlayton

Comment 18 Jan Tluka 2009-06-30 15:24:01 UTC

Hi Jeff, i've tested your test build kernel 2.6.9-89.4.EL.jtltest.62smp.
With patched kernel i cannot see 'busy inodes' message in log when using modified reproducer.
With stock kernel from RHEL4-U8 (-89.EL) i see these messages.

Comment 19 Jan Tluka 2009-06-30 15:30:58 UTC

Created attachment 349963 [details]
busy inodes reproducer

Following modified reproducer was used to check busy inodes in comment 18.

Comment 20 Jeff Layton 2009-08-14 12:22:21 UTC

Created attachment 357449 [details]
patchset 2 -- convert nfsv4-work queue to per-client workqueue

This patchset is based on the last one. It follows a similar approach -- flush the workqueue before unmounting. The big difference is that this one changes the client to spawn a different nfsv4-work queue for each nfs4_client that it spawns. Each individual server address will get its own workqueue.

This should fix the bug, and minimize disruption to other mounts when a server goes unresponsive.

I've added this patchset to my test kernels:

http://people.redhat.com/jlayton

...could you test them and confirm whetherthey also fix the problem?

Comment 21 Jan Tluka 2009-08-21 17:00:40 UTC

Hi Jeff, once again sorry for late reply.

I retested with kernel-smp-2.6.9-89.10.EL.jtltest.68.x86_64.rpm.
I do not see 'busy inodes' messages.

I retested with RHEL4-U8 stock kernel to check if I'm using correct steps.
With -89.EL I reproduced the 'busy inodes' messages. Here's a detailed reproducer for further use:

1, take a reproducer from comment 19.
2, when the reproducer outputs 'Shutting down RPC idmapd:' it will stall
3, from other console run: umount /mnt/testarea/nfs_436092 or use with option -f to force umount
4, umount should return with an error because of a busy mount
5, test should proceed after umount
6, repeat these steps 2 times and you should get following message:
     RPC: error 5 connecting to server localhost
     VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...

Comment 23 RHEL Program Management 2009-08-21 19:40:57 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 25 Don Howard 2010-01-25 21:14:51 UTC

*** Bug 558004 has been marked as a duplicate of this bug. ***

Comment 26 Igor Zhang 2010-06-08 08:18:37 UTC

This also happened on i386 sometimes
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=14573416

Comment 27 Vivek Goyal 2010-09-17 17:50:41 UTC

Committed in 89.36.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 28 Igor Zhang 2010-10-07 07:31:51 UTC

FYI, this also happened on a ia64 machine with kernel-largesmp-2.6.9-89.29.1.EL sometimes:
https://beaker.engineering.redhat.com/logs/2010/10/218/21875/41402///console.log

Comment 29 Igor Zhang 2010-10-07 07:42:09 UTC

Crash message is:
Unable to handle kernel paging request at virtual address 0000000000200200 
 nfsv4-work[10429]: Oops 8804682956800 [1] 
 Modules linked in: nls_utf8 cifs nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat loop button uhci_hcd ehci_hcd pciehp e1000 tg3 e100 mii sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod 
  
 Pid: 10429, CPU 0, comm:           nfsv4-work 
 psr : 0000101008126010 ifs : 800000000000040a ip  : [<a000000100173780>]    Not tainted 
 ip is at generic_drop_inode+0x260/0x3e0 
 unat: 0000000000000000 pfs : 000000000000040a rsc : 0000000000000003 
 rnat: 0009804c8a70433f bsps: a000000101901368 pr  : 000000000000a641 
 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f 
 csd : 0000000000000000 ssd : 0000000000000000 
 b0  : a0000001001736a0 b6  : a0000001000163e0 b7  : a000000100173520 
 f6  : 1003e0000000000000be6 f7  : 1003e8080808080808081 
 f8  : 1003e00000000000017b6 f9  : 1003e00000000097cb561 
 f10 : 1003e000000002357a3c1 f11 : 1003e44b831eee7285baf 
 r1  : a000000100a2fda0 r2  : 000000000000040a r3  : 0000001008126010 
 r8  : 0000000000000000 r9  : 00000000000000fd r10 : 0000000000100108 
 r11 : 0000000000100100 r12 : e00000010a80fd60 r13 : e00000010a808000 
 r14 : 0000000000000000 r15 : 0000000000000000 r16 : 0000000000200200 
 r17 : 0000000000000001 r18 : e000000110bb3c48 r19 : e000000110bb3c40 
 r20 : a0000001008433fc r21 : 0000000000000000 r22 : 0000000000000090 
 r23 : e000000110bb3e20 r24 : 0000000000000000 r25 : 0000000000000004 
 r26 : e000000103cd0e10 r27 : 000000000000005e r28 : 000000000000005f 
 r29 : 0000000000000010 r30 : 0000000000000090 r31 : ffffffffffffff7f 
  
 Call Trace: 
  [<a000000100017200>] show_stack+0x80/0xa0 
                                 sp=e00000010a80f8f0 bsp=e00000010a809118 
  [<a000000100017b10>] show_regs+0x890/0x8c0 
                                 sp=e00000010a80fac0 bsp=e00000010a8090d0 
  [<a00000010003f1d0>] die+0x150/0x240 
                                 sp=e00000010a80fae0 bsp=e00000010a809090 
  [<a000000100065320>] ia64_do_page_fault+0x8e0/0xbe0 
                                 sp=e00000010a80fae0 bsp=e00000010a809028 
  [<a00000010000f620>] ia64_leave_kernel+0x0/0x260 
                                 sp=e00000010a80fb90 bsp=e00000010a809028 
  [<a000000100173780>] generic_drop_inode+0x260/0x3e0 
                                 sp=e00000010a80fd60 bsp=e00000010a808fd0 
  [<a000000100170200>] iput+0x140/0x1e0 
                                 sp=e00000010a80fd60 bsp=e00000010a808fa8 
  [<a000000200d34d90>] nfs4_put_open_state+0x130/0x200 [nfs] 
                                 sp=e00000010a80fd60 bsp=e00000010a808f70 
  [<a000000200d350d0>] nfs4_close_state_work+0x270/0x480 [nfs] 
                                 sp=e00000010a80fd60 bsp=e00000010a808f18 
  [<a0000001000a7070>] worker_thread+0x430/0x580 
                                 sp=e00000010a80fd60 bsp=e00000010a808eb8 
  [<a0000001000b25a0>] kthread+0x260/0x2c0 
                                 sp=e00000010a80fdd0 bsp=e00000010a808e78 
  [<a0000001000190d0>] kernel_thread_helper+0x30/0x60 
                                 sp=e00000010a80fe30 bsp=e00000010a808e50 
  [<a000000100008c60>] start_kernel_thread+0x20/0x40 
                                 sp=e00000010a80fe30 bsp=e00000010a808e50 
 Kernel panic - not syncing: Fatal exception 
  [-- MARK -- Wed Oct  6 10:05:00 2010]

Comment 31 yanfu,wang 2011-01-14 08:27:57 UTC

can reproduced on RHEL4_U8 on i386 and x86_64 with reproducer comment #21:
[root@hp-xw9400-01 501335]# uname -a
Linux hp-xw9400-01.rhts.eng.bos.redhat.com 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:34:33 EDT 2009 i686 athlon i386 GNU/Linux

nfsd: unexporting all filesystems
RPC: error 5 connecting to server localhost
RPC: error 5 connecting to server localhost
nfsd: last server has exited
nfsd: unexporting all filesystems
rpciod: active tasks at shutdown?!
RPC: error 5 connecting to server localhost
nfsd: last server has exited
nfsd: unexporting all filesystems
RPC: error 5 connecting to server localhost
RPC: error 5 connecting to server localhost
nfsd: last server has exited
nfsd: unexporting all filesystems
RPC: error 5 connecting to server localhost
RPC: error 5 connecting to server localhost
VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice day...

and verified on kernel 2.6.9-96.EL on i386 and x86_64 with running reproducer many times and no problem now.
For comment #26 and comment #28, run the same testsuite on the new kernel and no panic occurs on also.
job links shown as below:
https://beaker.engineering.redhat.com/jobs/45921
https://beaker.engineering.redhat.com/jobs/45922

Comment 32 errata-xmlrpc 2011-02-16 16:04:42 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html