Bug 154034 - Deadlock spinning on inode_lock
Summary: Deadlock spinning on inode_lock
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Stephen Tweedie
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-04-06 18:02 UTC by Joel Becker
Modified: 2007-11-30 22:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-04-23 02:33:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
NMI oops from 2005.04.15 (2.99 KB, text/plain)
2005-04-20 04:13 UTC, Joel Becker
no flags Details
NMI oops on 2005.04.19 (4.50 KB, text/plain)
2005-04-20 04:14 UTC, Joel Becker
no flags Details
sysrq output from hanging the ia64 on 2006.04.15 (240.28 KB, text/plain)
2005-04-20 04:16 UTC, Joel Becker
no flags Details
Serial console sysrq escape patch. (1.88 KB, patch)
2005-04-20 08:33 UTC, Stephen Tweedie
no flags Details | Diff
sysrq show all CPUs patch (1015 bytes, patch)
2005-04-20 08:34 UTC, Stephen Tweedie
no flags Details | Diff
Show all cpus on nmi lockup (x86_64 and i386 only) (2.72 KB, patch)
2005-04-20 15:36 UTC, Stephen Tweedie
no flags Details | Diff
NMI oops on amd64 with Ingo's show-all-cpus patch. (6.83 KB, text/plain)
2005-04-21 23:54 UTC, Joel Becker
no flags Details

Description Joel Becker 2005-04-06 18:02:38 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050325 Firefox/1.0.2 (Debian package 1.0.2-1)

Description of problem:
        I've run into an important issue running Oracle on RHEL4.  Quite
important - a deadlock on inode_lock.
        I've been running a test workload on RHEL4 and SLES9 SP1 with Oracle and my 2.6 ASMLib driver.  We'd run 100 processes, and within a half-hour to four hours, the box is deadlocked.  Stack traces (sysrq-p) on ia64 showed us a process spinning forever in ia64_spinlock_contention() coming out of ilookup5().  That's really the inline ifind(), and the lock it is spinning on is inode_lock.
        Moving the test to amd64, we reproduced the hang, this time seeing the
deadlock in igrab().  This stack trace had .text.lock.inode in it directly.  We
were pretty sure inode_lock was the culprit, we just weren't sure how it was
being caused, as inode_lock is not exported and is pretty isolated inside inode.c.
        I've not located the exact deadlock yet, even with some serious looking.  It's a lot easier to reproduce on the SLES9 SP1 kernel (with i_sb_list changes) than on the SLES9 and RHEL4 kernels.  The RHEL4 kernel is the hardest, often taking 4-6 hours to reproduce.
        For all RHEL4 tests, the kernel was the appropriate 2.6.9-5.0.3.EL kernel.

[Machine Details]
IA64 -
    HP Itanium 2 (rx4600)
    4 x 1.5GHz Itanium 2 processors
    8GB RAM.
    Emulex LP9802 FC to SGI FC array

AMD64 - 
    Sun V20z
    2 x 2.2GHz Opteron 248
    4GB RAM.
    QLogic QLA2312 FC to EMC Clariion


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-5.0.3.EL

How reproducible:
Always

Steps to Reproduce:
1. Run big Oracle load that I use for stress testing.
2. Watch box hang.

Yes, I'm aware that this isn't the best reproduction description.  However, I've been unable to distill a smaller test case yet.
  

Actual Results:  Machine deadlocked on inode_lock

Expected Results:  It should run fine.

Additional info:

Comment 3 David Lawrence 2005-04-07 16:43:26 UTC
Adding tao to cclist so issue will get updates.

Comment 4 Stephen Tweedie 2005-04-12 11:57:24 UTC
Hi Joel,

Do you have a crash dump or at least alt-sysrq backtraces from the deadlocked CPUs?

What filesystem are you using?


Comment 7 Stephen Tweedie 2005-04-12 14:14:53 UTC
Ah, I hadn't noticed the IT ticket belonging to this bug.

Joel, to debug a spinlock deadlock we really need to see what's happening on
_all_ CPUs, not just one.  Repeated alt-sysrq-p may eventually show all the
CPUs; I'll also have a look to see how easy it would be to port rhel3's
alt-sysrq-w "show all CPUs" hotkey to 2.6.


Comment 8 Joel Becker 2005-04-12 20:31:33 UTC
Stephen, don't you think I know this? :-)  I can't get it.  I've tried
alt-sysrq-p till the cows come home, I've even tried kdb.  On the 4-way ia64
with kdb, kdb will not allow me to get to two of the cpus at all.  On the amd64
I can't get anywhere as the NMI oops immediately panics the system.  I've never,
ever seen the CPU that is already holding inode_lock.  So I'm trying to explore
what else I can do to get the info, such as instrumenting inode_lock and other
things.  I'm hoping you have some ideas :-)

I'd love to see if sysrq-w would work where kdb does not.

Comment 9 Joel Becker 2005-04-12 23:53:38 UTC
Even better would be a show_stack() from anyone who spins for too long.  But I
don't have the asm chops to do that.

Comment 10 Stephen Tweedie 2005-04-13 09:49:46 UTC
netdump ought to capture the backtrace from each CPU, I think.  Is it possible
to set that up?

Can you reproduce on i386, btw?


Comment 11 Joel Becker 2005-04-13 14:33:37 UTC
Well, I run netconsole during all testing, but I wonder if netdump would be
smarter.  I will try that.

Up until recently, I have not tested on i386.

Comment 16 Joel Becker 2005-04-20 04:13:38 UTC
Created attachment 113384 [details]
NMI oops from 2005.04.15

This is the output caught via netconsole on friday.  NMI oops.

Comment 17 Joel Becker 2005-04-20 04:14:58 UTC
Created attachment 113385 [details]
NMI oops on 2005.04.19

This is the output of another NMI oops.  Same load, but now with netdump
instead of netconsole.	Note that it still doesn't capture the other CPUs.  No
vmcore either.

This is 6.38.EL, btw.

Comment 18 Joel Becker 2005-04-20 04:16:35 UTC
Created attachment 113386 [details]
sysrq output from hanging the ia64 on 2006.04.15

On the ia64, we have no NMI, so we can use sysrq.  This was before the sysrq-w
patch.	I captured multiple sysrq-p and one sysrq-t.  I can only seem to get
one CPU.

This is 6.39.EL.

Comment 19 Stephen Tweedie 2005-04-20 08:33:00 UTC
Created attachment 113393 [details]
Serial console sysrq escape patch.

I've given Joel a couple of patches to help debug this: they port the RHEL-3
serial-console-sysrq-escape key and alt-sysrq-w "show all CPUs" functionality
to RHEL-4.  Attaching them here for reference.

This is the console escape patch.  It lets you set an escape key (in ascii) to
trigger alt-sysrq over serial console, rather than the usual "break".

Comment 20 Stephen Tweedie 2005-04-20 08:34:50 UTC
Created attachment 113394 [details]
sysrq show all CPUs patch

This patch adds alt-sysrq-w to the sysrq reportoire, asking for a CPU backtrace
from all CPUs, not just the one that received the sysrq interrupt.

Comment 22 Stephen Tweedie 2005-04-20 15:36:36 UTC
Created attachment 113417 [details]
Show all cpus on nmi lockup (x86_64 and i386 only)

One other possibly helpful patch, this one just in from Ingo Molnar:

> yes, this is always how it worked. But e.g. in the
> realtime-preempt patchset i have a feature (on x86 and 
> x64) that will cause NMI backtraces to be printed on all 
> CPUs. I've separated it out into the attached patch
> (it's against 2.6.12-rc2, but should apply to RHEL4
> too), could you give it a try?

The attached patch is against rhel4, and I've checked that it builds at least
on x86_64.

Comment 24 Joel Becker 2005-04-21 04:27:59 UTC
I've got a test running with this patch included now.  I got a hang on the ia64
earlier with the sysrq-w patch, but the system wasn't allowing any sysrq.

Comment 25 Joel Becker 2005-04-21 23:54:40 UTC
Created attachment 113500 [details]
NMI oops on amd64 with Ingo's show-all-cpus patch.

The machine hung with an NMI oops, but despite the inclusion of Ingo's
show-all-cpus patch, the NMI crash only shows CPU0.

Comment 26 Stephen Tweedie 2005-04-22 09:52:41 UTC
Adding Ingo to CC:; he may have observations on why the NMI oopser patch didn't
show all CPUs.


Comment 27 Stephen Tweedie 2005-04-22 10:00:30 UTC
From the last trace in comment #25:

RIP: 0010:[<ffffffff802f938c>] <ffffffff802f938c>{.text.lock.spinlock+0}
Call Trace:<IRQ> <ffffffffa021fd7d>{:oracleasm:asm_end_ioc+170} 
                 <ffffffffa021fe73>{:oracleasm:asm_end_bio_io+22} 
                 <ffffffffa021fe5d>{:oracleasm:asm_end_bio_io+0} 
                 <ffffffff8023f792>{__end_that_request_first+238} 
...
           <EOI> <ffffffffa021f421>{:oracleasm:asmdisk_test+0} 
                 <ffffffff80156c4e>{mempool_free_slab+0} 
                 <ffffffff80156c4e>{mempool_free_slab+0} 
                 <ffffffff80189990>{__iget+11}
                 <ffffffff8018aaeb>{ilookup5+124} 

Umm, spinlock deadlock *inside* an IRQ... with __iget() on the stack too.

Can you work out exactly what spinlock this is in this case, and where it's
being grabbed?  

Sometimes backtraces can include a lot of noise, especially if you have a few
functions with really large stack footprints; but the transition to IRQ stack is
reliable.  But __iget+11 doesn't look like a spinlock to me; that rather
suggests that it is indeed the IRQ which is deadlocking, not the iget.  We're
not just deadlocked in the foreground task with the NMI trapper just happening
to come in during a normal IRQ service.

So the next step is to find out just what spinlock is being grabbed at that RIP,
and check that *everywhere* in your whole code takes it and releases it in an
irq-safe manner.

Comment 28 Joel Becker 2005-04-22 18:02:18 UTC
Short answer: I'm well aware of the irq-safe requirements here, and I've been
over the code twice before checking that.

Long answer: this stack sure makes one wonder, doesn't it?  I looked at that
stack, and checked the spot, and I know what lock it is.  I was pretty sure
about the irq-safeness.  But I'm going to look again, because you never, ever know.

Comment 29 Joel Becker 2005-04-23 02:33:41 UTC
Ok, I'd already determined what spinlock is.  It's the asmfs_file_info->f_lock
lock (hangs off of filp->f_private, we'll call it afi->f_lock from now on).  The
actual lock location, by the disassembly I did when I first got the crash, is at
line 1032 (source is at
http://oss.oracle.com/projects/oracleasm/src/trunk/kernel/oracleasm.c).  That's
in the completion routine asm_finish_io().  The lock is taken, indeed, as
spin_lock_irqsave().

I did some more checking, that lock is always taken with spin_lock_irq(), except
here in the interrupt, where it is taken with spin_lock_irqsave().

The ilookup5() from the higher stack is from asm_submit_io(), at line 1178. 
There are no locks held there.  In fact, afi->f_lock is dropped three lines
above at 1175.

No ilookup5() or iget5_locked() is called with afi->f_lock held.  Only igrab().
 That I can verify with my own eyes.  Given that igrab() is called with
afi->f_lock held, it would deadlock if afi->f_lock was ever taken while under
inode_lock...

Oh, crap.  The lesson, as always, is that I'm an idiot.  I forgot that
inode_lock isn't IRQ safe.  Good catch.

cpu0                    cpu1
----                    ----
spin_lock(afi->f_lock)  ilookup5()
igrab()                 ifind()
                        spin_lock(inode_lock)
                        IRQ
                        spin_lock(afi->f_lock);
  spin_lock(inode_lock);

Sorry to bothe everyone.  This is going to be a bitch to fix, but it's not the
VFS' fault (unless you think inode_lock should prevent IRQs, and I'm not
prepared to make that argument).

Comment 30 Stephen Tweedie 2005-04-25 14:36:55 UTC
OK, glad you found it.

This looks like the sort of thing that would have been found a lot more easily
if you'd been able to get traces from all CPUs more readily.  I'll be proposing
the sysrq-w stuff for RHEL4 U2, but it would be good if we could get the NMI
oopser doing that too.


Note You need to log in before you can comment on or make changes to this bug.