Bug 658012 - NMI panic during xfs forced shutdown
Summary: NMI panic during xfs forced shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Dave Chinner
QA Contact: Eryu Guan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-29 06:42 UTC by Lachlan McIlroy
Modified: 2018-11-14 16:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 09:23:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Lachlan McIlroy 2010-11-29 06:42:05 UTC
Description of problem:

Customer system panicked with the following:

sd 0:0:11:0: rejecting I/O to device being removed
sd 0:0:11:0: rejecting I/O to device being removed
NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: xfs ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 cxgb3 igb i2c_core cdc_ether usbnet dca sg 8021q pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 17567, comm: xfslogd/1 Not tainted 2.6.18-194.el5 #1
RIP: 0010:[<ffffffff80172fda>]  [<ffffffff80172fda>] vgacon_scroll+0x91/0x23f
RSP: 0018:ffff810365489a80  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff81000905c400 RCX: 0000000000000078
RDX: 000000000000000c RSI: ffff8100000be9c8 RDI: ffff8100090534a8
RBP: 0000000000000000 R08: ffff8100000be9a0 R09: 0000000000000030
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000a0
R13: ffff8100000be9a0 R14: 0000000000000000 R15: 0000000000000032
FS:  0000000000000000(0000) GS:ffff81010c499440(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaabad000 CR3: 0000000363fdc000 CR4: 00000000000006e0
Process xfslogd/1 (pid: 17567, threadinfo ffff810365488000, task ffff81067fc017e0)
Stack:  0000000000000019 ffff81000905c400 0000000000000000 0000000000000000
 0000000000000001 ffffffff801b0906 ffff81000905c400 000000000000000a
 0000000000000000 ffffffff8049b6d4 0000000000000001 ffffffff801b0985
Call Trace:
 [<ffffffff801b0906>] scrup+0x5d/0xc0
 [<ffffffff801b0985>] lf+0x1c/0x3a
 [<ffffffff801b4aae>] vt_console_print+0x15a/0x227
 [<ffffffff8854b560>] :xfs:xfs_buf_iodone_work+0x0/0x6a
 [<ffffffff80092ddf>] __call_console_drivers+0x5b/0x69
 [<ffffffff8001720c>] release_console_sem+0x149/0x20e
 [<ffffffff80093578>] vprintk+0x2b2/0x317
 [<ffffffff8001727d>] release_console_sem+0x1ba/0x20e
 [<ffffffff8009362f>] printk+0x52/0xbd
 [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f
 [<ffffffff80065b50>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff88552c21>] :xfs:cmn_err+0xc8/0xf3
 [<ffffffff885479fe>] :xfs:xfs_ioerror_alert+0x76/0x83
 [<ffffffff88547b5b>] :xfs:xfs_do_force_shutdown+0x4a/0x13c
 [<ffffffff88535efc>] :xfs:xlog_iodone+0xa0/0xc4
sd 0:0:11:0: rejecting I/O to device being removed
sd 0:0:11:0: rejecting I/O to device being removed
 [<ffffffff88535efc>] :xfs:xlog_iodone+0xa0/0xc4
 [<ffffffff8004dc37>] run_workqueue+0x94/0xe4
 [<ffffffff8004a472>] worker_thread+0x0/0x122
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a562>] worker_thread+0xf0/0x122
 [<ffffffff8008e16d>] default_wake_function+0x0/0xe
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x11

I think the cause of the NMI watchdog firing was because this process was hogging a spinlock.  There's a spinlock inside cmn_err() protecting the global message buffer.

Version-Release number of selected component (if applicable):
kernel-2.6.18-194.el5 + patch from BZ657166.

How reproducible:
Very frequent.

Steps to Reproduce:
Using the same steps from BZ657166

Comment 1 Lachlan McIlroy 2010-11-29 06:43:58 UTC
There's another process trying to get that same spinlock:

PID: 17566  TASK: ffff81036487a100  CPU: 0   COMMAND: "xfslogd/0"
 #0 [ffffffff8044bf20] crash_nmi_callback at ffffffff8007bf44
 #1 [ffffffff8044bf40] do_nmi at ffffffff8006688a
 #2 [ffffffff8044bf50] nmi at ffffffff80065eef
    [exception RIP: .text.lock.spinlock+12]
    RIP: ffffffff80065c06  RSP: ffff81067c41bcf8  RFLAGS: 00000086
    RAX: 0000000000000297  RBX: ffffffff885546a9  RCX: 0000000000000002
    RDX: ffff81067c41bdf0  RSI: ffffffff885546a9  RDI: ffffffff885706e0
    RBP: 0000000000000001   R8: ffff81038abc1c20   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff81038abc1c20
    R13: 0000000000000002  R14: ffff8103642216c0  R15: ffffffff8854b560
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff81067c41bcf8] .text.lock.spinlock at ffffffff80065c06 *
 #4 [ffff81067c41bcf8] cmn_err at ffffffff88552b95
 #5 [ffff81067c41bde8] xfs_buf_iodone_callbacks at ffffffff8851c761
 #6 [ffff81067c41be38] run_workqueue at ffffffff8004dc37
 #7 [ffff81067c41be78] worker_thread at ffffffff8004a562
 #8 [ffff81067c41bee8] kthread at ffffffff80032bdc
 #9 [ffff81067c41bf48] kernel_thread at ffffffff8005efb1

And another process still trying to do writes:

PID: 18393  TASK: ffff81036427e820  CPU: 7   COMMAND: "fio"
 #0 [ffff81038ab04f20] crash_nmi_callback at ffffffff8007bf44
 #1 [ffff81038ab04f40] do_nmi at ffffffff8006688a
 #2 [ffff81038ab04f50] nmi at ffffffff80065eef
    [exception RIP: mpage_writepages+379]
    RIP: ffffffff8001ceae  RSP: ffff8103631cbb38  RFLAGS: 00000202
    RAX: 000000000000086f  RBX: ffff81010be093f8  RCX: 0000000000000034
    RDX: 0007ffffffffffff  RSI: ffff81010be093c0  RDI: ffff81010c4434e8
    RBP: ffff8103631cbc68   R8: ffff81000002c600   R9: ffff81067c4bd9f0
    R10: ffff81010bdaa958  R11: 0000000000000246  R12: ffff8103645acc50
    R13: 0000000000000001  R14: 000000000000000c  R15: ffff8103631cbbd8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #3 [ffff8103631cbb38] mpage_writepages at ffffffff8001ceae
 #4 [ffff8103631cbc50] do_writepages at ffffffff8005b1ea
 #5 [ffff8103631cbc60] __filemap_fdatawrite_range at ffffffff8005000e
 #6 [ffff8103631cbcc0] filemap_write_and_wait at ffffffff800c7325
 #7 [ffff8103631cbce0] xfs_flushinval_pages at ffffffff8854c710
 #8 [ffff8103631cbd10] xfs_write at ffffffff8854fd66
 #9 [ffff8103631cbde0] xfs_file_aio_write at ffffffff8854c590
#10 [ffff8103631cbe00] do_sync_write at ffffffff80018266
#11 [ffff8103631cbf10] vfs_write at ffffffff80016a49
#12 [ffff8103631cbf40] sys_write at ffffffff80017316
#13 [ffff8103631cbf80] tracesys at ffffffff8005e28d (via system_call)
    RIP: 0000003bd9c0d89b  RSP: 000000004295ce00  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff8005e28d  RCX: ffffffffffffffff
    RDX: 0000000000100000  RSI: 00002aaaaaaac000  RDI: 0000000000000009
    RBP: 0000000017d7e238   R8: 000000001b91f69c   R9: 000000004cf237aa
    R10: 0000003bd935210c  R11: 0000000000000202  R12: 00002b58a2eb2000
    R13: 000000004295ce40  R14: 0000000017d7e180  R15: 0000000100000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Comment 2 Lachlan McIlroy 2010-11-29 06:45:55 UTC
This is what I think is going on with this panic.

- There's a flood of errors from the scsi device layer like this one: "sd 0:0:11:0: rejecting I/O to device being removed"
- This is because the device has been unplugged and the filesystem is still throwing user data I/O at it.
- The filesystem attempts to write out some log buffers and due to the missing device they also fail.
- This causes xlog_iodone() to call xfs_do_force_shutdown() to initiate a forced shutdown of the filesystem.
- xfs_do_force_shutdown() wants to print an informational message that the filesystem is shutting down.
- This calls cmn_err() (a support routine leftover from the IRIX code) which formats it's message into a buffer to pass to printk().
- In order to keep the buffer off the stack it's now a shared global buffer protected by a spinlock that also disables irqs in case cmn_err() is called from interrupt context.
- Due to the excessive number of messages being logged to the console by the scsi layer the xfs thread is getting held up in printk() while holding a spinlock.
- This has caused the NMI watchdog to think the CPU is stuck (since it cannot receive irqs) and panics the system.
- What usually happens in a forced shutdown scenario is the xfs mount point is marked with a forced shutdown flag.
- This stops further data I/O and in this case would stop the storm of scsi error console messages and allow the shutdown to proceed.
- But the code that sets this flag is just after the call to cmn_err() that's held up so the system panicked just short of fixing itself.

I think the best way to fix this is to replace the cmn_err() call with a direct call to printk() as this gets the spinlock and irq disabling out of the way and would prevent a NMI.  Moving the setting of the shutdown flag earlier (before the cmn_err()) may not work if there's still a backlog on the console when cmn_err() calls printk().

Comment 3 Dave Chinner 2010-11-30 00:41:58 UTC
Hi Lachlan,

What are all the errors coming out of the scsi layer? If there are huge numbers of identical errors, (e.g. all just "sd 0:0:11:0: rejecting I/O to device being removed" errors) then they should probably be rate limited in the scsi layer.  Such a storm could cause problems with error reporting elsewhere in XFS as well as on other filesystems, not to mention pretty much hang any machine with a slow serial console.

As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the printk call anymore as printk() now disables interrupts and serialises console output itself. I suspect the lock in cmn_err() predated such serialisation, so perhaps the only fix needed for XFS is to remove the lock in cmn_err() altogether.

However, I'm not sure that will fix the problem of the SCSI error messages flooding the system and causing problems for threads trying to run printk, though. What do you think?

Cheers,

Dave.

Comment 4 Lachlan McIlroy 2010-11-30 01:33:54 UTC
(In reply to comment #3)
> Hi Lachlan,
> 
> What are all the errors coming out of the scsi layer? If there are huge numbers
> of identical errors, (e.g. all just "sd 0:0:11:0: rejecting I/O to device being
> removed" errors) then they should probably be rate limited in the scsi layer.

That's a good idea.  The errors in this case were all the same (as above) but there's a chance they could be different I suppose.
 
> Such a storm could cause problems with error reporting elsewhere in XFS as well
> as on other filesystems, not to mention pretty much hang any machine with a
> slow serial console.

Yes and conversely such storms could come from somewhere else too so just fixing the scsi errors may leave XFS still vulnerable.

> 
> As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the
> printk call anymore as printk() now disables interrupts and serialises console
> output itself. I suspect the lock in cmn_err() predated such serialisation, so
> perhaps the only fix needed for XFS is to remove the lock in cmn_err()
> altogether.

Just removing the lock is not a good idea - it's used to protect a global message buffer that cmn_err() uses to format it's message before passing it to printk().  I believe the buffer used to be allocated on the stack which didn't need locking but now it's global.  I don't like the idea of putting it back on the stack - especially when printk() allocates it's buffer on the stack too.  If we can be certain that cmn_err() isn't used in interrupt context then we could change the spin_lock_irqsave() to spin_lock() but I think that's a bit risky.

The cmn_err() stuff looks like a glue routine left over from IRIX and the only extra useful feature it offers is that it will trigger a panic on use of CE_PANIC - other than that it's just unnecessary overhead.

I think we should remove all uses of cmn_err() and replace it with printk() - a macro could make that relatively painless.  Or replace the guts of cmn_err() with a call to vprintk().  That still leaves the CE_* -> KERN_* translation to deal with though.

> 
> However, I'm not sure that will fix the problem of the SCSI error messages
> flooding the system and causing problems for threads trying to run printk,
> though. What do you think?

If there is a flood on the console then printk() will still get held up.  As long as we avoid disabling irqs then it wont cause an NMI and the system can eventually recover.  I think that's the best result we can expect.

Comment 5 Dave Chinner 2010-11-30 04:10:51 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Such a storm could cause problems with error reporting elsewhere in XFS as well
> > as on other filesystems, not to mention pretty much hang any machine with a
> > slow serial console.
> 
> Yes and conversely such storms could come from somewhere else too so just
> fixing the scsi errors may leave XFS still vulnerable.

True, though in general spamming the logs without rate limiting is considered bad form...

> > As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the
> > printk call anymore as printk() now disables interrupts and serialises console
> > output itself. I suspect the lock in cmn_err() predated such serialisation, so
> > perhaps the only fix needed for XFS is to remove the lock in cmn_err()
> > altogether.
> 
> Just removing the lock is not a good idea - it's used to protect a global
> message buffer that cmn_err() uses to format it's message before passing it to
> printk().  I believe the buffer used to be allocated on the stack which didn't
> need locking but now it's global.  I don't like the idea of putting it back on
> the stack - especially when printk() allocates it's buffer on the stack too. 
> If we can be certain that cmn_err() isn't used in interrupt context then we
> could change the spin_lock_irqsave() to spin_lock() but I think that's a bit
> risky.

Agreed.

> The cmn_err() stuff looks like a glue routine left over from IRIX and the only
> extra useful feature it offers is that it will trigger a panic on use of
> CE_PANIC - other than that it's just unnecessary overhead.
> 
> I think we should remove all uses of cmn_err() and replace it with printk() - a
> macro could make that relatively painless.  Or replace the guts of cmn_err()
> with a call to vprintk().  That still leaves the CE_* -> KERN_* translation to
> deal with though.

*nod*

If we're going to touch this code, we may as well try to call directly into vprintk(). I think I might be able to do it all with some extra special preprocessor hackery but I need to think about it a bit - I'll see what I can come up with....

Cheers,

Dave.

Comment 6 Lachlan McIlroy 2010-11-30 04:45:50 UTC
Thanks Dave.

Would something like this work as a blanket change?

#define CE_blah KERN_blah
#define cmn_err(a,b, ...) printk(a b, __VA_ARGS__)

Comment 7 Dave Chinner 2010-11-30 05:11:27 UTC
(In reply to comment #6)
> Thanks Dave.
> 
> Would something like this work as a blanket change?
> 
> #define CE_blah KERN_blah
> #define cmn_err(a,b, ...) printk(a b, __VA_ARGS__)

Yes, that's the sort of transformation I was thinking of as the core of change, but it's not quite that simple - there are places where we do stuff based on CE_blah being integers. e.g. in xfs_cmn_err() where we check the panic mask and change the error level based on both the error level and whether the error type matches the panic mask or not.

That complicates the solution a bit, so a pure cmn_err macro solution would probably end up more like:

#define cmn_err(lvl, fmt, ...)  do { \
    switch (lvl): \
    case CE_blah:  printk(KERN_blah fmt, __VA_ARGS__); \
....

However, that is going to add a fair bit of bloat to the binary because there's a couple of hundred cmn_err() call sites, so it's not really a good solution.

I think with a bit of massaging we can clean up the code that assumes CE_blah are integers first, and then applying your simple blanket macro to cmn_err()
will do the rest of conversion. Note that we also need to deal with icmn_err() as well, though it is only called in a couple of places and no longer exists upstream....

Cheers,

Dave.

Comment 8 Dave Chinner 2010-12-17 06:15:27 UTC
patch posted on 16/12/10.

Comment 9 RHEL Program Management 2011-02-01 16:59:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Jarod Wilson 2011-03-10 16:42:29 UTC
in kernel-2.6.18-247.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 18 Chao Ye 2011-06-15 05:19:55 UTC
Confirm patch in kernel git tree

Comment 20 errata-xmlrpc 2011-07-21 09:23:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.