-- Additional comment from jburke on 2007-08-20 19:21 EST -- Jeff, I found out that there was a second issue last week while I was on vacation. Steve D was looking at it. Looks like it may have panic(But I am not sure) This may/may not be a different issue. Here is a link to the vmcore from when the system had an issue. ndnc-1.lab.boston.redhat.com:/var/crash/10.12.4.232-2007-08-15-11:36 Thanks, Jeff -- Additional comment from jlayton on 2007-08-20 21:35 EST -- This issue looks different, but I suppose there's a chance that they ultimately have the same cause: kernel BUG at include/asm/spinlock.h:133! That error means that the spinlock magic was corrupt: void svc_wake_up(struct svc_serv *serv) { struct svc_rqst *rqstp; spin_lock_bh(&serv->sv_lock); ...so the serv->sv_lock was bad (and that likely means that the whole svc_serv struct was bad). The ring buffer in the core didn't have any info before the panic, so it's hard to tell what what happened leading up to the crash. Here's the stack: PID: 3618 TASK: f63deeb0 CPU: 0 COMMAND: "rpciod" #0 [caeb6dc0] netpoll_start_netdump at f8a52570 #1 [caeb6de0] die at c0106045 #2 [caeb6e14] do_invalid_op at c0106420 #3 [caeb6ec4] error_code (via invalid_op) at c02d51dd EAX: f8aa7c45 EBX: c1afd518 ECX: c02e6fa1 EDX: c02e6fa1 EBP: c28ea808 DS: 007b ESI: c1afd500 ES: 007b EDI: ec2a4a80 CS: 0060 EIP: c02d33c2 ERR: ffffffff EFLAGS: 00010216 #4 [caeb6f00] _spin_lock_bh at c02d33c2 #5 [caeb6f08] svc_wake_up at f8aa7c40 #6 [caeb6f14] nlmsvc_grant_callback at f8c41f1d #7 [caeb6f2c] __rpc_execute at f8aa50f7 #8 [caeb6f84] __rpc_schedule at f8aa520c #9 [caeb6f90] rpciod_killall at f8aa5ad8 #10 [caeb6f9c] rpciod at f8aa5a52 #11 [caeb6ff0] kernel_thread_helper at c01041f3 rpciod_killall() is in the stack here, which indicates that rpciod was in the process of shutting down at the time. Looks like lockd was in the process of shutting down too: crash> bt PID: 3617 TASK: f63de930 CPU: 1 COMMAND: "lockd" #0 [f2c26f14] schedule at c02d25c9 #1 [f2c26f78] wait_for_completion at c02d26f6 #2 [f2c26fcc] rpciod_down at f8aa5cca #3 [f2c26fd4] lockd at f8c40f31 #4 [f2c26ff0] kernel_thread_helper at c01041f3 I don't see any NFS filesystems in the mount table, so I'm guessing this happened after unmounting the last NFS filesystem on the box. Is there already a separate BZ for this? -- Additional comment from jlayton on 2007-08-21 10:30 EST -- Ahh Jeff pointed out that this is actually a nfs server. So lockd should only be shutting down if nfs is going down. Here's something odd though -- there are 2 lockd threads in the process list: 3617 1 1 f63de930 UN 0.0 0 0 [lockd] 27857 1 1 c1b46e30 UN 0.0 0 0 [lockd] the second thread looks like this: crash> set 27857 PID: 27857 COMMAND: "lockd" TASK: c1b46e30 [THREAD_INFO: f152b000] CPU: 1 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 27857 TASK: c1b46e30 CPU: 1 COMMAND: "lockd" #0 [f152bf20] schedule at c02d25c9 #1 [f152bf84] __sched_text_start at c02d1bb4 #2 [f152bfb8] __down_failed at c02d1d2b #3 [f152bfc8] .text.lock.sched (via rpciod_up) at f8aa5e92 #4 [f152bfd4] lockd at f8c40d7f #5 [f152bff0] kernel_thread_helper at c01041f3 I think it's blocked on the nlmsvc_sema (which is probably being held by whoever is calling lockd_down). Possibility? lockd was taking an inordinate amount of time to come down, nfs was restarted or something and a second lockd thread was started... -- Additional comment from jlayton on 2007-08-21 12:14 EST -- err...actually the second lockd thread is blocked on the rpciod_sema from rpciod_up. This makes sense since rpciod is coming down and the other lockd is holding it in rpciod_down. My guess is that rpciod was taking a long time to come down and the svc_serv struct that it crashed while accessing was freed and reused before it could get to it. Either that or we have memory corruption of some sort...
Ok, I think I see what could be the issue here. From lockd(): /* Exit the RPC thread */ svc_exit_thread(rqstp); /* release rpciod */ rpciod_down(); ...svc_exit_thread does this: /* Release the server */ if (serv) svc_destroy(serv); which eventually kfree's serv. So this codepath seems to be pretty much always using memory after free. This might be reproducible with slab poisoning turned on, presuming you can hit it so that there's an outstanding lock call that needs to be cleaned up when rpciod_down is called. It looks like a recent upstream patch removed the rpciod_up/down calls from lockd, but I don't think that will work for RHEL4.
Created attachment 172425 [details] untested proposed patch I think a patch like this would probably fix this particular issue (though it's untested so far, even for compilation). It just separates out the freeing of "serv" from everything else, and delaying freeing that until after rpciod_down returns. The problem is going to be reproducing and testing this since this race looks like it'll be tricky to hit. The race is (I'm pretty sure), something like this: lockd nlmsvc_retry_blocked nlmsvc_grant_blocked nlmsvc_async_call which queues up nlmsvc_granted_callback for rpciod ...then before that call actually gets serviced by rpciod, you'll need lockd to begin exiting and call rpciod_down. It's probably reproducible but may take a while.
...that last comment should be nlmsvc_grant_callback (not granted)... I'm thinking a reproducer may could happen something like this: 1) local process locks file on exported filesystem on NFS server 2) nfs client tries to lock same file and blocks waiting for callback from server 3) set up iptables rules to drop traffic between client and server 4) have local process unlock file 5) do service nfs stop on server 6) remove iptables rules to allow communication to continue The iptables rule would hopefully delay the callback long enough to allow lockd to call svc_exit_thread. We'd also want the server to run a kernel with memory poisoning.
Reproduced using the steps above on a kernel with CONFIG_DEBUG_SLAB (though it actually crashes when you run service nfs stop): lockd: couldn't shutdown host module! nfsd: last server has exited nfsd: unexporting all filesystems rpciod: active tasks at shutdown?! RPC: error 5 connecting to server 10.11.231.179 eip: ffffffffa00a7b6c ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at spinlock:119 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs lockd nfs_acl md5 ipv6 rpcsec_gss_krb5 auth_rpcgss des sunrpc xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod Pid: 1970, comm: rpciod Not tainted 2.6.9-56.EL.bz253754.1xenU RIP: e030:[<ffffffff80296687>] <ffffffff80296687>{_spin_lock_bh+51} RSP: e02b:ffffff801ad61da8 EFLAGS: 00010296 RAX: 0000000000000016 RBX: ffffff801b217428 RCX: 00000000000017d2 RDX: 00000000000017d2 RSI: 0000000000000000 RDI: ffffffff8032e960 RBP: ffffff801b2173f8 R08: ffffff800275ba08 R09: ffffff801b217428 R10: 0000000100000000 R11: 0000000000000000 R12: ffffff801b225bb0 R13: 0000000000000000 R14: 00000000fffffff4 R15: ffffff801d30e048 FS: 0000002a958a1b00(0000) GS:ffffffff80421680(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process rpciod (pid: 1970, threadinfo ffffff801ad60000, task ffffff801b2657f0) Stack: ffffff801b225b90 ffffffffa00a7b6c 0000000000000003 ffffff801b225b90 ffffff801b224330 ffffffffa013d3ec ffffff801b225ba0 ffffff801d30e048 0000000000000002 ffffff801b2173f8 Call Trace:<ffffffffa00a7b6c>{:sunrpc:svc_wake_up+16} <ffffffffa013d3ec>{:lockd:nlmsvc_grant_callback+375} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a4ed1>{:sunrpc:__rpc_execute+745} <ffffffff8012b0b7>{__wake_up+54} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a4476>{:sunrpc:__rpc_wake_up_task+368} <ffffffffa00a4fa3>{:sunrpc:__rpc_schedule+74} <ffffffffa00a52bb>{:sunrpc:rpciod_killall+56} <ffffffffa00a558f>{:sunrpc:rpciod+601} <ffffffff8012da88>{autoremove_wake_function+0} <ffffffff80132e6a>{do_exit+3073} <ffffffff8012da88>{autoremove_wake_function+0} <ffffffff8010e05a>{child_rip+8} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a5336>{:sunrpc:rpciod+0} <ffffffff8010e052>{child_rip+0} Code: 0f 0b 87 9a 2a 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 33 03 00 RIP <ffffffff80296687>{_spin_lock_bh+51} RSP <ffffff801ad61da8> <0>Kernel panic - not syncing: Oops
This may also be an upstream bug. The code for this seems to be quite a bit different now, but I'm thinking it's still possible to hit this problem there. I'll plan to test this on something rawhideish soon...
I don't think the test patch in comment #2 is the correct approach. rpciod_down doesn't guarantee that there rpciod actually goes down. So it's possible that even with that patch that nlmsvc_grant_callback would get called after lockd has exited. What's really needed I think is a reference counter for the svc_serv struct. We'd increment the counter before the nlmsvc_async_call and decrement it in nlmsvc_grant_callback after the spinlock is unlocked. lockd could check that before exiting and if it's non-zero then call rpciod_wake_up(), and schedule() and loop.
I think I might have reproduced this on rawhide as well, but it took me several tries and I'm not certain it's the exact same problem: general protection fault: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath video output sbs button dock battery ac floppy rtc_cmos 8139too 8139cp mii sg dm_snapshot dm_zero dm_mirror dm_mod ata_generic ata_piix libata sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd Pid: 1564, comm: rpciod/0 Not tainted 2.6.23-0.130.rc3.git5.killattr.1.fc8debug #1 RIP: 0010:[<ffffffff8112e852>] [<ffffffff8112e852>] _raw_spin_trylock+0x2/0x2a RSP: 0018:ffff810018a37e38 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b8b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b73 RBP: 6b6b6b6b6b6b6b73 R08: 0000000000000002 R09: 0000000000000000 R10: ffffffff881952a1 R11: 0000000000000005 R12: 0000000000000000 R13: ffff810014cbd248 R14: ffffffff814cdf00 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff813ec000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aaab7cc7000 CR3: 000000001f7cc000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process rpciod/0 (pid: 1564, threadinfo ffff810018a36000, task ffff810018a34000) Stack: ffffffff8127301e 000000000000ea60 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b73 ffffffff881952a1 ffff8100140b48f0 ffff8100140b4800 ffff8100140b4800 ffff8100140b48f0 ffffffff88190f15 ffffffff88190978 0000000000000000 Call Trace: [<ffffffff8127301e>] _spin_lock_bh+0x33/0x58 [<ffffffff881952a1>] :sunrpc:svc_wake_up+0x2b/0x8b [<ffffffff88190f15>] :sunrpc:rpc_async_schedule+0x0/0xc [<ffffffff88190978>] :sunrpc:rpc_exit_task+0x33/0x81 [<ffffffff88190d6b>] :sunrpc:__rpc_execute+0x7d/0x227 [<ffffffff810466a5>] run_workqueue+0x92/0x15e [<ffffffff81047055>] worker_thread+0x0/0xe7 [<ffffffff81047132>] worker_thread+0xdd/0xe7 [<ffffffff81049f4a>] autoremove_wake_function+0x0/0x2e [<ffffffff81049e08>] kthread+0x47/0x73 [<ffffffff81272a7b>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff8100cb18>] child_rip+0xa/0x12 [<ffffffff8100c22c>] restore_args+0x0/0x30 [<ffffffff81049c91>] kthreadd+0x117/0x13c [<ffffffff81049dc1>] kthread+0x0/0x73 [<ffffffff8100cb0e>] child_rip+0x0/0x12 Code: 87 07 31 d2 85 c0 0f 9f c2 85 d2 74 18 65 8b 04 25 24 00 00 RIP [<ffffffff8112e852>] _raw_spin_trylock+0x2/0x2a RSP <ffff810018a37e38> Kernel panic - not syncing: Aiee, killing interrupt handler!
I *think* this is the same issue, but it's very hard to tell. The changes to the rpc layer make the stack less clear, but svc_wake_up isn't called from many places -- mainly only from NLM functions, so I think the problem is the same here.
Created attachment 191861 [details] test patch 2 This patch adds a reference count to the svc_serv struct, increments it before doing the async call and decrements it in the completion function. If the count is non-zero when lockd goes down, it wakes up rpciod and sleeps briefly. The patch seems to work, but I'm not certain it's the right approach. This could keep lockd up indefinitely if the call can't be completed for some reason. A better scheme might be to kill off the async call, but I'm not clear on whether there's a way to actually do that...
Was also able to crash RHEL4 in a similar fashion with the following steps (though it took me a couple of tries -- not sure why...): 1) lock file locally on server 2) lock file on client (and block) 3) service nfs restart on server 4) release lock on server Stack trace: eip: ffffffffa00a7b6c ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at spinlock:119 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs lockd(U) nfs_acl md5 ipv6 rpcsec_gss_krb5 auth_rpcgss des sunrpc xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod Pid: 2132, comm: rpciod Not tainted 2.6.9-58.EL.bz253754.2xenU RIP: e030:[<ffffffff8029671f>] <ffffffff8029671f>{_spin_lock_bh+51} RSP: e02b:ffffff801bd19db8 EFLAGS: 00010292 RAX: 0000000000000016 RBX: ffffff801d460130 RCX: 0000000000038bc9 RDX: 0000000000038bc9 RSI: 0000000000000000 RDI: ffffffff8032eae0 RBP: ffffff801d460100 R08: ffffff801e03aa08 R09: ffffff801d460130 R10: 0000000100000000 R11: 0000000000000000 R12: ffffff801d086320 R13: 000000000000000b R14: 000000000000000a R15: ffffff8002a0e2f8 FS: 0000002a958a1b00(0000) GS:ffffffff80421800(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process rpciod (pid: 2132, threadinfo ffffff801bd18000, task ffffff801bc69030) Stack: ffffff801d086300 ffffffffa00a7b6c ffffff801e03aa08 ffffff801d086300 ffffff801d0877a8 ffffffffa013d3fa ffffff801d086310 ffffff8002a0e2f8 0000000000000001 ffffff80030dc490 Call Trace:<ffffffffa00a7b6c>{:sunrpc:svc_wake_up+16} <ffffffffa013d3fa>{:lockd:nlmsvc_grant_callback+389} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a4ed1>{:sunrpc:__rpc_execute+745} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a4fa3>{:sunrpc:__rpc_schedule+74} <ffffffffa00a5463>{:sunrpc:rpciod+301} <ffffffff8012dadc>{autoremove_wake_function+0} <ffffffff8012dadc>{autoremove_wake_function+0} <ffffffff8010e05a>{child_rip+8} <ffffffffa013c100>{:lockd:lockd+0} <ffffffffa00a5336>{:sunrpc:rpciod+0} <ffffffff8010e052>{child_rip+0} Code: 0f 0b 67 9b 2a 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 33 03 00 RIP <ffffffff8029671f>{_spin_lock_bh+51} RSP <ffffff801bd19db8> <0>Kernel panic - not syncing: Oops
I played with a new approach on this today... Rather than having the functions call svc_wake_up(block->b_daemon), the patch created a new lockd_wake_up() function that wakes up whatever the current lockd task is. It seems to prevent the panics I've seen, but the current iteration used a semaphore to do the locking, and that's a no-no in a RPC callback (we don't want sleeping functions holding up rpciod). So I need a new locking scheme for it, but I think that approach might be the best one...
Created attachment 199971 [details] patch -- add lockd_wake_up() function and call it instead of svc_wake_up() This patch seems to eliminate the panics. I think this is the best thing we can do, but maybe I'm not considering something. This just changes the svc_wake_up calls to a call that wakes up whatever lockd thread happens to be up (if there is one). This eliminates the panics, but I'm not clear on whether there are any issues with this approach that I'm not considering...
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I've got a patch that I think fixes this, but it may help expose some other problems so I'm not going to propose this for 4.7 after all. We'll have a look again for 4.8...
The problem that I think this might increase exposure to is bug 432867. It's a preexisting bug, and may not be one we can reasonably fix in RHEL without breaking kabi.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
I've backported the patchset that I pushed upstream to fix this. It seems to work for RHEL5, but on RHEL4, I can still make it fail to release a block and leave a host and call in queue. There's a difference there of some sort. It seems like the nlm_host being attached to the block is the wrong kind (client rather than server or vice versa) so we don't end up invalidating the lock.
...added dprintk to RHEL5: nlmsvc_act_unlock: block->b_host=ffff81000e858b08 host=ffff81000e858b08 block->b_host->h_server=1 host->h_server=1 ...similar dprintk for RHEL4 with similar reproducer. nlmsvc_traverse_blocks: block->b_host=000001001c7f63c0 host=000001001c7f65c0 block->b_host->h_server=0 host->h_server=1 ...so the difference seems to be that the block for RHEL5 is attached to a nlm_host with h_server=1, but RHEL4 attaches it to a h_server=0 host. This makes the lockd shutdown code leave the block on the list.
RHEL4 is missing this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=686517f1ad1630c11964d668b556aab79b8c942e ...the host attached to the nlm_block is being pulled from the wrong list, so we're not able to properly cancel it when lockd comes down. This patch seems to fix the problem in RHEL4, but I need to look over it a bit more closely and make sure it's OK.
Updating PM score.
Committed in 78.21.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Created attachment 326987 [details] final patchset -- fix use-after-free problems in lockd This is the final set of 4 patches
Patchset is in -89.EL kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html