Bug 501335
Summary: | oops in nfs4_put_open_state | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jan Tluka <jtluka> | ||||||||||
Component: | kernel | Assignee: | Jeff Layton <jlayton> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | yanfu,wang <yanwang> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 4.8 | CC: | dhoward, jlayton, phan, rwheeler, staubach, steved, vmayatsk, yanwang, yugzhang | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2011-02-16 16:04:42 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 589293 | ||||||||||||
Attachments: |
|
Description
Jan Tluka
2009-05-18 15:50:11 UTC
After several attempts, I haven't been able to reproduce the panic so far. I'm attempting it on a PV xen guest though so maybe that changes the timing enough that it's not occurring. Jan, you said you can reproduce this pretty much at will? Here's some assembly from the crash site: ffffffff801963db: 48 85 c0 test %rax,%rax ffffffff801963de: 48 89 02 mov %rax,(%rdx) ffffffff801963e1: 74 04 je ffffffff801963e7 <generic_drop_inode+0xd3> ffffffff801963e3: 48 89 50 08 mov %rdx,0x8(%rax) ffffffff801963e7: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) ffffffff801963ee: 48 c7 43 08 00 00 00 movq $0x0,0x8(%rbx) ffffffff801963f5: 00 ffffffff801963f6: 48 8d 43 10 lea 0x10(%rbx),%rax ffffffff801963fa: 48 8b 53 10 mov 0x10(%rbx),%rdx ffffffff801963fe: 48 c7 c7 60 c2 3f 80 mov $0xffffffff803fc260,%rdi ffffffff80196405: 48 8b 48 08 mov 0x8(%rax),%rcx ffffffff80196409: 48 89 4a 08 mov %rcx,0x8(%rdx) <<<<<< CRASH HERE ffffffff8019640d: 48 89 11 mov %rdx,(%rcx) ffffffff80196410: 48 89 40 08 mov %rax,0x8(%rax) ffffffff80196414: 48 83 8b 10 02 00 00 orq $0x10,0x210(%rbx) ffffffff8019641b: 10 ffffffff8019641c: 48 89 43 10 mov %rax,0x10(%rbx) ffffffff80196420: ff 0d 5a 6e 31 00 decl 3239514(%rip) # ffffffff804ad280 <inodes_stat> ffffffff80196426: e8 da ee 17 00 callq ffffffff80315305 <_spin_unlock> ffffffff8019642b: 48 83 bb 78 01 00 00 cmpq $0x0,0x178(%rbx) ...so %rdx was bogus at the time of the crash. I think that corresponds with the list_del_init here in generic_forget_inode: list_del_init(&inode->i_list); <<<<< CRASH HERE inode->i_state |= I_FREEING; inodes_stat.nr_inodes--; ...we may need a core to really determine the cause here. It would be interesting to see whether this inode was on a list at the time. I'll also need to look at what happens during a forced umount -- maybe the kernel is not doing enough to ensure that all of the inode refs are put before tearing down the sb? Hi Jeff. I was able to reproduce this two times in a row so I marked it as reproducable at will. Unfortunately today I was not able to reproduce it. I was able to see another panic however I'm not sure if it's related. I'm posting in another comment. I hit another oops. You will have to modify original reproducer runtest.sh and comment out line containing: sleep $(( 2 * 60 )) (on terminal 1:) 1. make run repeat this step until you see line similar to: ./runtest.sh: line 35: 5030 Terminated ./reproducer /mnt/testarea/nfs_436092/testfile_436092 (on terminal 2:) 2. umount /mnt/testarea/nfs_436092 umount: /mnt/testarea/nfs_436092: device is busy umount: /mnt/testarea/nfs_436092: device is busy 3. ls -l /mnt/testarea/nfs_436092 (hangs) (on terminal 1:) Ctrl-C (on terminal 2:) 4. umount /mnt/testarea/nfs_436092 (hangs) output (device is busy) like in step 2 is NOT shown (on terminal 1:) 5. umount -f /mnt/testarea/nfs_436092 umount2: Invalid argument umount: /mnt/testarea/nfs_436092: not mounted umount2: Invalid argument umount: /mnt/testarea/nfs_436092: not mounted 6. while true; do service nfs restart; done (this will hang on startup in a few secs) repeat this step after you see hang at least 6 times (Ctrl-C) 7. service nfs stop 8. reboot I got following panic: VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... INIT: Sending processes the KILL signal INIT: Pid 4184 [id 1] seems to hang INIT: Pid 4185 [id 2] seems to hang INIT: Pid 4186 [id 3] seems to hang INIT: Pid 4187 [id 4] seems to hang INIT: Pid 4188 [id 5] seems to hang INIT: Pid 4189 [id 6] seems to hang Stopping HAL daemon: [FAILED] Stopping system message bus: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd:[ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Shutting down smartd: [FAILED] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping autofs: [ OK ] Shutting down ntpd: [ OK ] Stopping NFS statd: [ OK ] Stopping irqbalance: [ OK ] Stopping portmap: [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Shutting down interface eth0: [ OK ] Shutting down loopback interface: [ OK ] Stopping pcmcia: unloading Kernel Card Services [ OK ] Starting killall: [ OK ] Sending all processes the TERM signal... Sending all processes the KILL signal... RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). RPC: error 5 connecting to server localhost RPC: failed to contact portmap (errno -5). Unable to handle kernel NULL pointer dereference at 0000000000000038 RIP: <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} PML4 7726e067 PGD 0 Oops: 0000 [1] SMP CPU 2 Modules linked in: nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc cpufreq_powersave loop joydev button battery ac ehci_hcd uhci_hcd i5000_edac edac_mc hw_random bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod Pid: 10268, comm: rpc.nfsd Not tainted 2.6.9-89.ELsmp RIP: 0010:[<ffffffffa01e42b2>] <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} RSP: 0018:0000010076475e68 EFLAGS: 00010246 RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 000001000000d000 RDX: 00000000fffffffb RSI: 00000000fffffffb RDI: 0000000000000000 RBP: 0000000000000008 R08: 00000000fffffffb R09: 0000ffff80411b20 R10: ffffffff80321080 R11: ffffffff802af8d5 R12: 00000000fffffffb R13: 0000000000000002 R14: 0000007fbfffb470 R15: 0000000000002000 FS: 0000002a958a5b00(0000) GS:ffffffff80504600(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000038 CR3: 00000000024fe000 CR4: 00000000000006e0 Process rpc.nfsd (pid: 10268, threadinfo 0000010076474000, task 000001007d055030) Stack: 0000010076fcf008 0000010076475e98 0000000000000008 ffffffffa01e4d29 0000010076475e99 0000010076fcf009 0000007fbfff0038 0000007fbfffb470 0000000000002000 ffffffff8015f36b Call Trace:<ffffffffa01e4d29>{:nfsd:write_threads+131} <ffffffff8015f36b>{get_zeroed_page+107} <ffffffff8019bdb5>{simple_transaction_get+152} <ffffffffa01e48ee>{:nfsd:nfsctl_transaction_write+78} <ffffffff8017c4ce>{vfs_write+207} <ffffffff8017c5b6>{sys_write+69} <ffffffff801102f6>{system_call+126} Code: 83 7f 38 01 0f 94 c3 e8 6c d9 f4 ff 85 db 74 15 48 c7 05 43 RIP <ffffffffa01e42b2>{:nfsd:nfsd_svc+672} RSP <0000010076475e68> CR2: 0000000000000038 <0>Kernel panic - not syncing: Oops Correcting step 3 in comment 4: 3. ls -l /mnt/testarea/nfs_436092 (hangs) (don't kill this yet) (on terminal 1:) Ctrl-C (kills the rhts test) (on terminal 2:) Ctrl-C (kills the ls command) Well, the busy inodes after umount thing might be related. I suspect that the problem may have something to do with that. The actual oops though isn't related. That's a server-side problem... I suggest opening a separate BZ for the server-side issue. (In reply to comment #7) > I suggest opening a separate BZ for the server-side issue. Oops from comment 4 is tracked in bug 501500. Got it to oops using your reproducer: Unable to handle kernel paging request at 0000000000100108 RIP: <ffffffff80191ad7>{generic_drop_inode+245} PML4 1c742067 PGD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: nfs nfsd exportfs nfs_acl md5 ipv6 rpcsec_gss_krb5 auth_rpcgss des lockd sunrpc iptable_filter ip_tables xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod Pid: 2375, comm: nfsv4-work Not tainted 2.6.9-89.EL.jtltest.60xenU RIP: e030:[<ffffffff80191ad7>] <ffffffff80191ad7>{generic_drop_inode+245} RSP: e02b:ffffff801af2bdf8 EFLAGS: 00010246 RAX: ffffff801aecaa60 RBX: ffffff801aecaa50 RCX: 0000000000200200 RDX: 0000000000100100 RSI: 000000000000004e RDI: ffffffff80345380 RBP: ffffff801aecaafc R08: ffffffff803db088 R09: ffffffff801580e2 R10: 0000000000000000 R11: 0000000000000000 R12: ffffff801aecaa50 R13: ffffff801bc43b00 R14: ffffff801aecaa50 R15: ffffff801d33a840 FS: 0000002a96a372a0(0000) GS:ffffffff80439b80(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process nfsv4-work (pid: 2375, threadinfo ffffff801af2a000, task ffffff801b1c47f0) Stack: ffffff801d5e7dc0 ffffffffa01c6061 ffffffff801580e2 ffffff801aecaafc ffffff801d5e7dc0 ffffff801bc43b00 0000000000000000 ffffffffa01c618b ffffff801fc13400 ffffff801ae03500 Call Trace:<ffffffffa01c6061>{:nfs:nfs4_put_open_state+135} <ffffffff801580e2>{mempool_free_slab+0} <ffffffffa01c618b>{:nfs:nfs4_close_state_work+244} <ffffffffa01c6097>{:nfs:nfs4_close_state_work+0} <ffffffff80142162>{worker_thread+419} <ffffffff8012c237>{default_wake_function+0} <ffffffff8012c287>{__wake_up_common+66} <ffffffff8012c237>{default_wake_function+0} <ffffffff80146490>{keventd_create_kthread+0} <ffffffff80141fbf>{worker_thread+0} <ffffffff80146490>{keventd_create_kthread+0} <ffffffff80146467>{kthread+200} ...going to see if I can get a core now. The interesting thing is that I tried several times to get it to crash on my test kernels so this may already be fixed. If it is, it's probably due to the patch for 487699. I haven't been able to reproduce since the first crash, but I did get this once I tried to unmount after the "reproducer" program finally died. VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... It looks like this reproducer path is resulting somehow in an iput not getting done. Ok, using a variant of the reproducer provided I can reproduce a busy inodes after umount problem at will. If I wait until the "reproducer" binary dies, then I can unmount the filesystem (no -f flag needed there), and I get a busy inodes after umount message. So there is definitely an inode reference leak in there somewhere. The hard part is tracking it down... Ok, am able to reproduce the busy inodes at umount problem at will now I think. I'm going to proceed and work that issue since I believe that's probably the root cause here. I modified the reproducer to wait until the "reproducer" program has exited and then to do the unmount (with no -f) immediately after. That causes the "busy inodes at umount" message to pop. The problem here is that nfs4_close_state was called on the inode as the program was exiting. This queues up a nfs4_close_state_work job and the close() call returns. At that point there is no active reference to the mount but there is one to the inode (held by the workqueue job, which does an iput at the end of its work). I think we need to get a reference to the superblock somehow and hold it during the life of the nfs4_close_state_work job. The usual way to do this is to do a mntget() and hold a reference to the vfsmount instead, but that's not really doable here. At this level all we have is a pointer to an inode -- determining what vfsmount to take the reference on will be ugly. This is going to be a very difficult problem to solve... The problem is that the close_state operations basically just receive an inode as an arg and that's not enough information to take a mount reference. One possibility is to change nfs4_close_state to only queue the work to the workqueue if the caller is rpciod. That would alleviate most of these cases and should keep the original deadlock at bay, but I think it's still possible to hit a busy inodes problem even with that. Actually, there is something else we could do... We could turn nfs4_do_close and nfs4_do_downgrade into async RPC calls (basically, make the workqueue job not wait on their replies). Given that we don't do anything with the return codes from those calls anyway, I think we'd be safe enough to make that change. Then, when unmounting we could just flush the nfs4-work queue. That should go reasonably quickly and should ensure that all of the open states are closed out before the superblock is torn down. Not a trivial patch, however... Actually that won't help -- we still need to wait until the CLOSE or DOWNGRADE call goes over the wire before we can iput, so the main problem will still exist. flushing the workqueue on umount is probably our only recourse. What we may need to do is turn the nfs4-work thread into a per-mount workqueue. That will still mean a possibly lengthy umount cycle while the RPCs cycle through, but at that point we won't be waiting on RPCs unrelated to the mount being unmounted. Created attachment 348429 [details]
patch -- fix busy inodes problem
This patchset seems to fix the busy inodes issue for me. It may also help performance when there are a lot of nfs4 mounts (by reducing some serialization).
The only catch here is that umounts have to flush the workqueue before they will proceed. If there are several jobs in it to unresponsive servers, then umounts can take a while, or may even never make progress if they're hard mounts. The good thing is that there shouldn't be nearly as many jobs in the workqueue now since we only queue jobs to it when the calling task is rpciod.
There's not much we can do about that unfortunately other than move to a one-workqueue-per-superblock model. That's doable but I want to think about it a little more.
Please test this patch and let me know if it resolves the issue for you. I've built this patch into my test kernels as well: http://people.redhat.com/jlayton Hi Jeff, i've tested your test build kernel 2.6.9-89.4.EL.jtltest.62smp. With patched kernel i cannot see 'busy inodes' message in log when using modified reproducer. With stock kernel from RHEL4-U8 (-89.EL) i see these messages. Created attachment 349963 [details] busy inodes reproducer Following modified reproducer was used to check busy inodes in comment 18. Created attachment 357449 [details] patchset 2 -- convert nfsv4-work queue to per-client workqueue This patchset is based on the last one. It follows a similar approach -- flush the workqueue before unmounting. The big difference is that this one changes the client to spawn a different nfsv4-work queue for each nfs4_client that it spawns. Each individual server address will get its own workqueue. This should fix the bug, and minimize disruption to other mounts when a server goes unresponsive. I've added this patchset to my test kernels: http://people.redhat.com/jlayton ...could you test them and confirm whetherthey also fix the problem? Hi Jeff, once again sorry for late reply. I retested with kernel-smp-2.6.9-89.10.EL.jtltest.68.x86_64.rpm. I do not see 'busy inodes' messages. I retested with RHEL4-U8 stock kernel to check if I'm using correct steps. With -89.EL I reproduced the 'busy inodes' messages. Here's a detailed reproducer for further use: 1, take a reproducer from comment 19. 2, when the reproducer outputs 'Shutting down RPC idmapd:' it will stall 3, from other console run: umount /mnt/testarea/nfs_436092 or use with option -f to force umount 4, umount should return with an error because of a busy mount 5, test should proceed after umount 6, repeat these steps 2 times and you should get following message: RPC: error 5 connecting to server localhost VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. *** Bug 558004 has been marked as a duplicate of this bug. *** This also happened on i386 sometimes http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=14573416 Committed in 89.36.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ FYI, this also happened on a ia64 machine with kernel-largesmp-2.6.9-89.29.1.EL sometimes: https://beaker.engineering.redhat.com/logs/2010/10/218/21875/41402///console.log Crash message is: Unable to handle kernel paging request at virtual address 0000000000200200 nfsv4-work[10429]: Oops 8804682956800 [1] Modules linked in: nls_utf8 cifs nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat loop button uhci_hcd ehci_hcd pciehp e1000 tg3 e100 mii sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod Pid: 10429, CPU 0, comm: nfsv4-work psr : 0000101008126010 ifs : 800000000000040a ip : [<a000000100173780>] Not tainted ip is at generic_drop_inode+0x260/0x3e0 unat: 0000000000000000 pfs : 000000000000040a rsc : 0000000000000003 rnat: 0009804c8a70433f bsps: a000000101901368 pr : 000000000000a641 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001001736a0 b6 : a0000001000163e0 b7 : a000000100173520 f6 : 1003e0000000000000be6 f7 : 1003e8080808080808081 f8 : 1003e00000000000017b6 f9 : 1003e00000000097cb561 f10 : 1003e000000002357a3c1 f11 : 1003e44b831eee7285baf r1 : a000000100a2fda0 r2 : 000000000000040a r3 : 0000001008126010 r8 : 0000000000000000 r9 : 00000000000000fd r10 : 0000000000100108 r11 : 0000000000100100 r12 : e00000010a80fd60 r13 : e00000010a808000 r14 : 0000000000000000 r15 : 0000000000000000 r16 : 0000000000200200 r17 : 0000000000000001 r18 : e000000110bb3c48 r19 : e000000110bb3c40 r20 : a0000001008433fc r21 : 0000000000000000 r22 : 0000000000000090 r23 : e000000110bb3e20 r24 : 0000000000000000 r25 : 0000000000000004 r26 : e000000103cd0e10 r27 : 000000000000005e r28 : 000000000000005f r29 : 0000000000000010 r30 : 0000000000000090 r31 : ffffffffffffff7f Call Trace: [<a000000100017200>] show_stack+0x80/0xa0 sp=e00000010a80f8f0 bsp=e00000010a809118 [<a000000100017b10>] show_regs+0x890/0x8c0 sp=e00000010a80fac0 bsp=e00000010a8090d0 [<a00000010003f1d0>] die+0x150/0x240 sp=e00000010a80fae0 bsp=e00000010a809090 [<a000000100065320>] ia64_do_page_fault+0x8e0/0xbe0 sp=e00000010a80fae0 bsp=e00000010a809028 [<a00000010000f620>] ia64_leave_kernel+0x0/0x260 sp=e00000010a80fb90 bsp=e00000010a809028 [<a000000100173780>] generic_drop_inode+0x260/0x3e0 sp=e00000010a80fd60 bsp=e00000010a808fd0 [<a000000100170200>] iput+0x140/0x1e0 sp=e00000010a80fd60 bsp=e00000010a808fa8 [<a000000200d34d90>] nfs4_put_open_state+0x130/0x200 [nfs] sp=e00000010a80fd60 bsp=e00000010a808f70 [<a000000200d350d0>] nfs4_close_state_work+0x270/0x480 [nfs] sp=e00000010a80fd60 bsp=e00000010a808f18 [<a0000001000a7070>] worker_thread+0x430/0x580 sp=e00000010a80fd60 bsp=e00000010a808eb8 [<a0000001000b25a0>] kthread+0x260/0x2c0 sp=e00000010a80fdd0 bsp=e00000010a808e78 [<a0000001000190d0>] kernel_thread_helper+0x30/0x60 sp=e00000010a80fe30 bsp=e00000010a808e50 [<a000000100008c60>] start_kernel_thread+0x20/0x40 sp=e00000010a80fe30 bsp=e00000010a808e50 Kernel panic - not syncing: Fatal exception [-- MARK -- Wed Oct 6 10:05:00 2010] can reproduced on RHEL4_U8 on i386 and x86_64 with reproducer comment #21: [root@hp-xw9400-01 501335]# uname -a Linux hp-xw9400-01.rhts.eng.bos.redhat.com 2.6.9-89.ELsmp #1 SMP Mon Apr 20 10:34:33 EDT 2009 i686 athlon i386 GNU/Linux nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost nfsd: last server has exited nfsd: unexporting all filesystems rpciod: active tasks at shutdown?! RPC: error 5 connecting to server localhost nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost nfsd: last server has exited nfsd: unexporting all filesystems RPC: error 5 connecting to server localhost RPC: error 5 connecting to server localhost VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice day... and verified on kernel 2.6.9-96.EL on i386 and x86_64 with running reproducer many times and no problem now. For comment #26 and comment #28, run the same testsuite on the new kernel and no panic occurs on also. job links shown as below: https://beaker.engineering.redhat.com/jobs/45921 https://beaker.engineering.redhat.com/jobs/45922 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |