From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050325 Firefox/1.0.2 (Debian package 1.0.2-1) Description of problem: I've run into an important issue running Oracle on RHEL4. Quite important - a deadlock on inode_lock. I've been running a test workload on RHEL4 and SLES9 SP1 with Oracle and my 2.6 ASMLib driver. We'd run 100 processes, and within a half-hour to four hours, the box is deadlocked. Stack traces (sysrq-p) on ia64 showed us a process spinning forever in ia64_spinlock_contention() coming out of ilookup5(). That's really the inline ifind(), and the lock it is spinning on is inode_lock. Moving the test to amd64, we reproduced the hang, this time seeing the deadlock in igrab(). This stack trace had .text.lock.inode in it directly. We were pretty sure inode_lock was the culprit, we just weren't sure how it was being caused, as inode_lock is not exported and is pretty isolated inside inode.c. I've not located the exact deadlock yet, even with some serious looking. It's a lot easier to reproduce on the SLES9 SP1 kernel (with i_sb_list changes) than on the SLES9 and RHEL4 kernels. The RHEL4 kernel is the hardest, often taking 4-6 hours to reproduce. For all RHEL4 tests, the kernel was the appropriate 2.6.9-5.0.3.EL kernel. [Machine Details] IA64 - HP Itanium 2 (rx4600) 4 x 1.5GHz Itanium 2 processors 8GB RAM. Emulex LP9802 FC to SGI FC array AMD64 - Sun V20z 2 x 2.2GHz Opteron 248 4GB RAM. QLogic QLA2312 FC to EMC Clariion Version-Release number of selected component (if applicable): kernel-smp-2.6.9-5.0.3.EL How reproducible: Always Steps to Reproduce: 1. Run big Oracle load that I use for stress testing. 2. Watch box hang. Yes, I'm aware that this isn't the best reproduction description. However, I've been unable to distill a smaller test case yet. Actual Results: Machine deadlocked on inode_lock Expected Results: It should run fine. Additional info:
Adding tao to cclist so issue will get updates.
Hi Joel, Do you have a crash dump or at least alt-sysrq backtraces from the deadlocked CPUs? What filesystem are you using?
Ah, I hadn't noticed the IT ticket belonging to this bug. Joel, to debug a spinlock deadlock we really need to see what's happening on _all_ CPUs, not just one. Repeated alt-sysrq-p may eventually show all the CPUs; I'll also have a look to see how easy it would be to port rhel3's alt-sysrq-w "show all CPUs" hotkey to 2.6.
Stephen, don't you think I know this? :-) I can't get it. I've tried alt-sysrq-p till the cows come home, I've even tried kdb. On the 4-way ia64 with kdb, kdb will not allow me to get to two of the cpus at all. On the amd64 I can't get anywhere as the NMI oops immediately panics the system. I've never, ever seen the CPU that is already holding inode_lock. So I'm trying to explore what else I can do to get the info, such as instrumenting inode_lock and other things. I'm hoping you have some ideas :-) I'd love to see if sysrq-w would work where kdb does not.
Even better would be a show_stack() from anyone who spins for too long. But I don't have the asm chops to do that.
netdump ought to capture the backtrace from each CPU, I think. Is it possible to set that up? Can you reproduce on i386, btw?
Well, I run netconsole during all testing, but I wonder if netdump would be smarter. I will try that. Up until recently, I have not tested on i386.
Created attachment 113384 [details] NMI oops from 2005.04.15 This is the output caught via netconsole on friday. NMI oops.
Created attachment 113385 [details] NMI oops on 2005.04.19 This is the output of another NMI oops. Same load, but now with netdump instead of netconsole. Note that it still doesn't capture the other CPUs. No vmcore either. This is 6.38.EL, btw.
Created attachment 113386 [details] sysrq output from hanging the ia64 on 2006.04.15 On the ia64, we have no NMI, so we can use sysrq. This was before the sysrq-w patch. I captured multiple sysrq-p and one sysrq-t. I can only seem to get one CPU. This is 6.39.EL.
Created attachment 113393 [details] Serial console sysrq escape patch. I've given Joel a couple of patches to help debug this: they port the RHEL-3 serial-console-sysrq-escape key and alt-sysrq-w "show all CPUs" functionality to RHEL-4. Attaching them here for reference. This is the console escape patch. It lets you set an escape key (in ascii) to trigger alt-sysrq over serial console, rather than the usual "break".
Created attachment 113394 [details] sysrq show all CPUs patch This patch adds alt-sysrq-w to the sysrq reportoire, asking for a CPU backtrace from all CPUs, not just the one that received the sysrq interrupt.
Created attachment 113417 [details] Show all cpus on nmi lockup (x86_64 and i386 only) One other possibly helpful patch, this one just in from Ingo Molnar: > yes, this is always how it worked. But e.g. in the > realtime-preempt patchset i have a feature (on x86 and > x64) that will cause NMI backtraces to be printed on all > CPUs. I've separated it out into the attached patch > (it's against 2.6.12-rc2, but should apply to RHEL4 > too), could you give it a try? The attached patch is against rhel4, and I've checked that it builds at least on x86_64.
I've got a test running with this patch included now. I got a hang on the ia64 earlier with the sysrq-w patch, but the system wasn't allowing any sysrq.
Created attachment 113500 [details] NMI oops on amd64 with Ingo's show-all-cpus patch. The machine hung with an NMI oops, but despite the inclusion of Ingo's show-all-cpus patch, the NMI crash only shows CPU0.
Adding Ingo to CC:; he may have observations on why the NMI oopser patch didn't show all CPUs.
From the last trace in comment #25: RIP: 0010:[<ffffffff802f938c>] <ffffffff802f938c>{.text.lock.spinlock+0} Call Trace:<IRQ> <ffffffffa021fd7d>{:oracleasm:asm_end_ioc+170} <ffffffffa021fe73>{:oracleasm:asm_end_bio_io+22} <ffffffffa021fe5d>{:oracleasm:asm_end_bio_io+0} <ffffffff8023f792>{__end_that_request_first+238} ... <EOI> <ffffffffa021f421>{:oracleasm:asmdisk_test+0} <ffffffff80156c4e>{mempool_free_slab+0} <ffffffff80156c4e>{mempool_free_slab+0} <ffffffff80189990>{__iget+11} <ffffffff8018aaeb>{ilookup5+124} Umm, spinlock deadlock *inside* an IRQ... with __iget() on the stack too. Can you work out exactly what spinlock this is in this case, and where it's being grabbed? Sometimes backtraces can include a lot of noise, especially if you have a few functions with really large stack footprints; but the transition to IRQ stack is reliable. But __iget+11 doesn't look like a spinlock to me; that rather suggests that it is indeed the IRQ which is deadlocking, not the iget. We're not just deadlocked in the foreground task with the NMI trapper just happening to come in during a normal IRQ service. So the next step is to find out just what spinlock is being grabbed at that RIP, and check that *everywhere* in your whole code takes it and releases it in an irq-safe manner.
Short answer: I'm well aware of the irq-safe requirements here, and I've been over the code twice before checking that. Long answer: this stack sure makes one wonder, doesn't it? I looked at that stack, and checked the spot, and I know what lock it is. I was pretty sure about the irq-safeness. But I'm going to look again, because you never, ever know.
Ok, I'd already determined what spinlock is. It's the asmfs_file_info->f_lock lock (hangs off of filp->f_private, we'll call it afi->f_lock from now on). The actual lock location, by the disassembly I did when I first got the crash, is at line 1032 (source is at http://oss.oracle.com/projects/oracleasm/src/trunk/kernel/oracleasm.c). That's in the completion routine asm_finish_io(). The lock is taken, indeed, as spin_lock_irqsave(). I did some more checking, that lock is always taken with spin_lock_irq(), except here in the interrupt, where it is taken with spin_lock_irqsave(). The ilookup5() from the higher stack is from asm_submit_io(), at line 1178. There are no locks held there. In fact, afi->f_lock is dropped three lines above at 1175. No ilookup5() or iget5_locked() is called with afi->f_lock held. Only igrab(). That I can verify with my own eyes. Given that igrab() is called with afi->f_lock held, it would deadlock if afi->f_lock was ever taken while under inode_lock... Oh, crap. The lesson, as always, is that I'm an idiot. I forgot that inode_lock isn't IRQ safe. Good catch. cpu0 cpu1 ---- ---- spin_lock(afi->f_lock) ilookup5() igrab() ifind() spin_lock(inode_lock) IRQ spin_lock(afi->f_lock); spin_lock(inode_lock); Sorry to bothe everyone. This is going to be a bitch to fix, but it's not the VFS' fault (unless you think inode_lock should prevent IRQs, and I'm not prepared to make that argument).
OK, glad you found it. This looks like the sort of thing that would have been found a lot more easily if you'd been able to get traces from all CPUs more readily. I'll be proposing the sysrq-w stuff for RHEL4 U2, but it would be good if we could get the NMI oopser doing that too.