Description of problem: Customer system panicked with the following: sd 0:0:11:0: rejecting I/O to device being removed sd 0:0:11:0: rejecting I/O to device being removed NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: xfs ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 cxgb3 igb i2c_core cdc_ether usbnet dca sg 8021q pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 17567, comm: xfslogd/1 Not tainted 2.6.18-194.el5 #1 RIP: 0010:[<ffffffff80172fda>] [<ffffffff80172fda>] vgacon_scroll+0x91/0x23f RSP: 0018:ffff810365489a80 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff81000905c400 RCX: 0000000000000078 RDX: 000000000000000c RSI: ffff8100000be9c8 RDI: ffff8100090534a8 RBP: 0000000000000000 R08: ffff8100000be9a0 R09: 0000000000000030 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000a0 R13: ffff8100000be9a0 R14: 0000000000000000 R15: 0000000000000032 FS: 0000000000000000(0000) GS:ffff81010c499440(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aaaaabad000 CR3: 0000000363fdc000 CR4: 00000000000006e0 Process xfslogd/1 (pid: 17567, threadinfo ffff810365488000, task ffff81067fc017e0) Stack: 0000000000000019 ffff81000905c400 0000000000000000 0000000000000000 0000000000000001 ffffffff801b0906 ffff81000905c400 000000000000000a 0000000000000000 ffffffff8049b6d4 0000000000000001 ffffffff801b0985 Call Trace: [<ffffffff801b0906>] scrup+0x5d/0xc0 [<ffffffff801b0985>] lf+0x1c/0x3a [<ffffffff801b4aae>] vt_console_print+0x15a/0x227 [<ffffffff8854b560>] :xfs:xfs_buf_iodone_work+0x0/0x6a [<ffffffff80092ddf>] __call_console_drivers+0x5b/0x69 [<ffffffff8001720c>] release_console_sem+0x149/0x20e [<ffffffff80093578>] vprintk+0x2b2/0x317 [<ffffffff8001727d>] release_console_sem+0x1ba/0x20e [<ffffffff8009362f>] printk+0x52/0xbd [<ffffffff8001a81c>] vsnprintf+0x5e7/0x62f [<ffffffff80065b50>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff88552c21>] :xfs:cmn_err+0xc8/0xf3 [<ffffffff885479fe>] :xfs:xfs_ioerror_alert+0x76/0x83 [<ffffffff88547b5b>] :xfs:xfs_do_force_shutdown+0x4a/0x13c [<ffffffff88535efc>] :xfs:xlog_iodone+0xa0/0xc4 sd 0:0:11:0: rejecting I/O to device being removed sd 0:0:11:0: rejecting I/O to device being removed [<ffffffff88535efc>] :xfs:xlog_iodone+0xa0/0xc4 [<ffffffff8004dc37>] run_workqueue+0x94/0xe4 [<ffffffff8004a472>] worker_thread+0x0/0x122 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a562>] worker_thread+0xf0/0x122 [<ffffffff8008e16d>] default_wake_function+0x0/0xe [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032bdc>] kthread+0xfe/0x132 [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032ade>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 I think the cause of the NMI watchdog firing was because this process was hogging a spinlock. There's a spinlock inside cmn_err() protecting the global message buffer. Version-Release number of selected component (if applicable): kernel-2.6.18-194.el5 + patch from BZ657166. How reproducible: Very frequent. Steps to Reproduce: Using the same steps from BZ657166
There's another process trying to get that same spinlock: PID: 17566 TASK: ffff81036487a100 CPU: 0 COMMAND: "xfslogd/0" #0 [ffffffff8044bf20] crash_nmi_callback at ffffffff8007bf44 #1 [ffffffff8044bf40] do_nmi at ffffffff8006688a #2 [ffffffff8044bf50] nmi at ffffffff80065eef [exception RIP: .text.lock.spinlock+12] RIP: ffffffff80065c06 RSP: ffff81067c41bcf8 RFLAGS: 00000086 RAX: 0000000000000297 RBX: ffffffff885546a9 RCX: 0000000000000002 RDX: ffff81067c41bdf0 RSI: ffffffff885546a9 RDI: ffffffff885706e0 RBP: 0000000000000001 R8: ffff81038abc1c20 R9: 0000000000000020 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81038abc1c20 R13: 0000000000000002 R14: ffff8103642216c0 R15: ffffffff8854b560 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff81067c41bcf8] .text.lock.spinlock at ffffffff80065c06 * #4 [ffff81067c41bcf8] cmn_err at ffffffff88552b95 #5 [ffff81067c41bde8] xfs_buf_iodone_callbacks at ffffffff8851c761 #6 [ffff81067c41be38] run_workqueue at ffffffff8004dc37 #7 [ffff81067c41be78] worker_thread at ffffffff8004a562 #8 [ffff81067c41bee8] kthread at ffffffff80032bdc #9 [ffff81067c41bf48] kernel_thread at ffffffff8005efb1 And another process still trying to do writes: PID: 18393 TASK: ffff81036427e820 CPU: 7 COMMAND: "fio" #0 [ffff81038ab04f20] crash_nmi_callback at ffffffff8007bf44 #1 [ffff81038ab04f40] do_nmi at ffffffff8006688a #2 [ffff81038ab04f50] nmi at ffffffff80065eef [exception RIP: mpage_writepages+379] RIP: ffffffff8001ceae RSP: ffff8103631cbb38 RFLAGS: 00000202 RAX: 000000000000086f RBX: ffff81010be093f8 RCX: 0000000000000034 RDX: 0007ffffffffffff RSI: ffff81010be093c0 RDI: ffff81010c4434e8 RBP: ffff8103631cbc68 R8: ffff81000002c600 R9: ffff81067c4bd9f0 R10: ffff81010bdaa958 R11: 0000000000000246 R12: ffff8103645acc50 R13: 0000000000000001 R14: 000000000000000c R15: ffff8103631cbbd8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #3 [ffff8103631cbb38] mpage_writepages at ffffffff8001ceae #4 [ffff8103631cbc50] do_writepages at ffffffff8005b1ea #5 [ffff8103631cbc60] __filemap_fdatawrite_range at ffffffff8005000e #6 [ffff8103631cbcc0] filemap_write_and_wait at ffffffff800c7325 #7 [ffff8103631cbce0] xfs_flushinval_pages at ffffffff8854c710 #8 [ffff8103631cbd10] xfs_write at ffffffff8854fd66 #9 [ffff8103631cbde0] xfs_file_aio_write at ffffffff8854c590 #10 [ffff8103631cbe00] do_sync_write at ffffffff80018266 #11 [ffff8103631cbf10] vfs_write at ffffffff80016a49 #12 [ffff8103631cbf40] sys_write at ffffffff80017316 #13 [ffff8103631cbf80] tracesys at ffffffff8005e28d (via system_call) RIP: 0000003bd9c0d89b RSP: 000000004295ce00 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: ffffffff8005e28d RCX: ffffffffffffffff RDX: 0000000000100000 RSI: 00002aaaaaaac000 RDI: 0000000000000009 RBP: 0000000017d7e238 R8: 000000001b91f69c R9: 000000004cf237aa R10: 0000003bd935210c R11: 0000000000000202 R12: 00002b58a2eb2000 R13: 000000004295ce40 R14: 0000000017d7e180 R15: 0000000100000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
This is what I think is going on with this panic. - There's a flood of errors from the scsi device layer like this one: "sd 0:0:11:0: rejecting I/O to device being removed" - This is because the device has been unplugged and the filesystem is still throwing user data I/O at it. - The filesystem attempts to write out some log buffers and due to the missing device they also fail. - This causes xlog_iodone() to call xfs_do_force_shutdown() to initiate a forced shutdown of the filesystem. - xfs_do_force_shutdown() wants to print an informational message that the filesystem is shutting down. - This calls cmn_err() (a support routine leftover from the IRIX code) which formats it's message into a buffer to pass to printk(). - In order to keep the buffer off the stack it's now a shared global buffer protected by a spinlock that also disables irqs in case cmn_err() is called from interrupt context. - Due to the excessive number of messages being logged to the console by the scsi layer the xfs thread is getting held up in printk() while holding a spinlock. - This has caused the NMI watchdog to think the CPU is stuck (since it cannot receive irqs) and panics the system. - What usually happens in a forced shutdown scenario is the xfs mount point is marked with a forced shutdown flag. - This stops further data I/O and in this case would stop the storm of scsi error console messages and allow the shutdown to proceed. - But the code that sets this flag is just after the call to cmn_err() that's held up so the system panicked just short of fixing itself. I think the best way to fix this is to replace the cmn_err() call with a direct call to printk() as this gets the spinlock and irq disabling out of the way and would prevent a NMI. Moving the setting of the shutdown flag earlier (before the cmn_err()) may not work if there's still a backlog on the console when cmn_err() calls printk().
Hi Lachlan, What are all the errors coming out of the scsi layer? If there are huge numbers of identical errors, (e.g. all just "sd 0:0:11:0: rejecting I/O to device being removed" errors) then they should probably be rate limited in the scsi layer. Such a storm could cause problems with error reporting elsewhere in XFS as well as on other filesystems, not to mention pretty much hang any machine with a slow serial console. As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the printk call anymore as printk() now disables interrupts and serialises console output itself. I suspect the lock in cmn_err() predated such serialisation, so perhaps the only fix needed for XFS is to remove the lock in cmn_err() altogether. However, I'm not sure that will fix the problem of the SCSI error messages flooding the system and causing problems for threads trying to run printk, though. What do you think? Cheers, Dave.
(In reply to comment #3) > Hi Lachlan, > > What are all the errors coming out of the scsi layer? If there are huge numbers > of identical errors, (e.g. all just "sd 0:0:11:0: rejecting I/O to device being > removed" errors) then they should probably be rate limited in the scsi layer. That's a good idea. The errors in this case were all the same (as above) but there's a chance they could be different I suppose. > Such a storm could cause problems with error reporting elsewhere in XFS as well > as on other filesystems, not to mention pretty much hang any machine with a > slow serial console. Yes and conversely such storms could come from somewhere else too so just fixing the scsi errors may leave XFS still vulnerable. > > As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the > printk call anymore as printk() now disables interrupts and serialises console > output itself. I suspect the lock in cmn_err() predated such serialisation, so > perhaps the only fix needed for XFS is to remove the lock in cmn_err() > altogether. Just removing the lock is not a good idea - it's used to protect a global message buffer that cmn_err() uses to format it's message before passing it to printk(). I believe the buffer used to be allocated on the stack which didn't need locking but now it's global. I don't like the idea of putting it back on the stack - especially when printk() allocates it's buffer on the stack too. If we can be certain that cmn_err() isn't used in interrupt context then we could change the spin_lock_irqsave() to spin_lock() but I think that's a bit risky. The cmn_err() stuff looks like a glue routine left over from IRIX and the only extra useful feature it offers is that it will trigger a panic on use of CE_PANIC - other than that it's just unnecessary overhead. I think we should remove all uses of cmn_err() and replace it with printk() - a macro could make that relatively painless. Or replace the guts of cmn_err() with a call to vprintk(). That still leaves the CE_* -> KERN_* translation to deal with though. > > However, I'm not sure that will fix the problem of the SCSI error messages > flooding the system and causing problems for threads trying to run printk, > though. What do you think? If there is a flood on the console then printk() will still get held up. As long as we avoid disabling irqs then it wont cause an NMI and the system can eventually recover. I think that's the best result we can expect.
(In reply to comment #4) > (In reply to comment #3) > > Such a storm could cause problems with error reporting elsewhere in XFS as well > > as on other filesystems, not to mention pretty much hang any machine with a > > slow serial console. > > Yes and conversely such storms could come from somewhere else too so just > fixing the scsi errors may leave XFS still vulnerable. True, though in general spamming the logs without rate limiting is considered bad form... > > As it is, I'm not sure that cmn_err() needs to lock and disable IRQs across the > > printk call anymore as printk() now disables interrupts and serialises console > > output itself. I suspect the lock in cmn_err() predated such serialisation, so > > perhaps the only fix needed for XFS is to remove the lock in cmn_err() > > altogether. > > Just removing the lock is not a good idea - it's used to protect a global > message buffer that cmn_err() uses to format it's message before passing it to > printk(). I believe the buffer used to be allocated on the stack which didn't > need locking but now it's global. I don't like the idea of putting it back on > the stack - especially when printk() allocates it's buffer on the stack too. > If we can be certain that cmn_err() isn't used in interrupt context then we > could change the spin_lock_irqsave() to spin_lock() but I think that's a bit > risky. Agreed. > The cmn_err() stuff looks like a glue routine left over from IRIX and the only > extra useful feature it offers is that it will trigger a panic on use of > CE_PANIC - other than that it's just unnecessary overhead. > > I think we should remove all uses of cmn_err() and replace it with printk() - a > macro could make that relatively painless. Or replace the guts of cmn_err() > with a call to vprintk(). That still leaves the CE_* -> KERN_* translation to > deal with though. *nod* If we're going to touch this code, we may as well try to call directly into vprintk(). I think I might be able to do it all with some extra special preprocessor hackery but I need to think about it a bit - I'll see what I can come up with.... Cheers, Dave.
Thanks Dave. Would something like this work as a blanket change? #define CE_blah KERN_blah #define cmn_err(a,b, ...) printk(a b, __VA_ARGS__)
(In reply to comment #6) > Thanks Dave. > > Would something like this work as a blanket change? > > #define CE_blah KERN_blah > #define cmn_err(a,b, ...) printk(a b, __VA_ARGS__) Yes, that's the sort of transformation I was thinking of as the core of change, but it's not quite that simple - there are places where we do stuff based on CE_blah being integers. e.g. in xfs_cmn_err() where we check the panic mask and change the error level based on both the error level and whether the error type matches the panic mask or not. That complicates the solution a bit, so a pure cmn_err macro solution would probably end up more like: #define cmn_err(lvl, fmt, ...) do { \ switch (lvl): \ case CE_blah: printk(KERN_blah fmt, __VA_ARGS__); \ .... However, that is going to add a fair bit of bloat to the binary because there's a couple of hundred cmn_err() call sites, so it's not really a good solution. I think with a bit of massaging we can clean up the code that assumes CE_blah are integers first, and then applying your simple blanket macro to cmn_err() will do the rest of conversion. Note that we also need to deal with icmn_err() as well, though it is only called in a couple of places and no longer exists upstream.... Cheers, Dave.
patch posted on 16/12/10.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-247.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Confirm patch in kernel git tree
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html