734900 – Panic, NMI Watchdog detected LOCKUP on CPU 6

Bug 734900 - Panic, NMI Watchdog detected LOCKUP on CPU 6

Summary: Panic, NMI Watchdog detected LOCKUP on CPU 6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.7
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Don Zickus
QA Contact:	Zhouping Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	744147 758797
TreeView+	depends on / blocked

Reported:	2011-08-31 19:27 UTC by Mike Stevens
Modified:	2018-11-27 21:20 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-2.6.18-300.el5
Doc Type:	Bug Fix
Doc Text:	In certain circumstances, the evdev_pass_event() function with a spinlock attached was interrupted and called again, eventually resulting in a deadlock. A patch has been provided to address this issue by disabling interrupts when the spinlock is obtained. This prevents the deadlock from occurring.
Clone Of:
Environment:
Last Closed:	2012-02-21 03:54:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
RHEL5 additional fix for this issue (1.47 KB, patch) 2011-12-01 14:59 UTC, Prarit Bhargava	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	64161	0	None	None	None	Never
Red Hat Product Errata	RHSA-2012:0150	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux 5.8 kernel update	2012-02-21 07:35:24 UTC

Description Mike Stevens 2011-08-31 19:27:26 UTC

Description of problem:

This system is running an Oracle database.  After some period of inactivity, the system panics and reboots.  This seems to be a sporadic issue.  A vmcore has been produced.

Version-Release number of selected component (if applicable):

Linux 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Running a large Oracle job, and then letting the system sit idle seems to tickle the bug.

Steps to Reproduce:
1.
2.
3.
  
Actual results:

The kernel panics

Expected results:

The system continues running

Additional info:

NMI Watchdog detected LOCKUP on CPU 6
CPU 6 
Modules linked in: ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu autofs4 hidp nfs nfs_acl rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom igb 8021q tpm_tis tpm i7core_edac edac_mc serio_raw dca tpm_bios bnx2 pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-274.el5 #1
RIP: 0010:[<ffffffff80064bcc>]  [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30
RSP: 0018:ffff810c400a7ab0  EFLAGS: 00000086
RAX: ffff810c3ec9bd08 RBX: ffff810c395cd000 RCX: 0000000011eda412
RDX: ffff810c395cd620 RSI: ffff810c400a7ad8 RDI: ffff810c395cd608
RBP: ffff810c3ec9bc80 R08: ffff810c3fc780c0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000048 R12: ffff810c400a7ad8
R13: 0000000000000000 R14: 0000000000000000 R15: ffff810c3f9dc000
FS:  0000000000000000(0000) GS:ffff810c6a27fe40(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaab8a9b48 CR3: 0000000c3abeb000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffff810c400a2000, task ffff810c3fc780c0)
Stack:  ffffffff802145ba ffff810c3ec9bc80 ffff810c395cd000 0000000000000000
 ffffffff802148b1 000000004e5d70d0 00000000000496f2 0000000000000000
 ffff81183ec50000 ffff81183ec50000 ffff810c3ec9bca0 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff802145ba>] evdev_pass_event+0x19/0x67
 [<ffffffff802148b1>] evdev_event+0x59/0xa6
 [<ffffffff80211ee8>] input_event+0x424/0x44c
 [<ffffffff8020c64f>] hidinput_report_event+0x22/0x4a
 [<ffffffff80207d05>] hid_input_report+0x2f6/0x349
 [<ffffffff8020910e>] hid_irq_in+0x55/0xea
 [<ffffffff801fbc48>] usb_hcd_giveback_urb+0x37/0x65
 [<ffffffff88021714>] :uhci_hcd:uhci_giveback_urb+0x138/0x165
 [<ffffffff88021de9>] :uhci_hcd:uhci_scan_schedule+0x59d/0x880
 [<ffffffff88023be3>] :uhci_hcd:uhci_irq+0x13f/0x15c
 [<ffffffff801fc637>] usb_hcd_irq+0x27/0x55
 [<ffffffff80010d6e>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800bd69a>] __do_IRQ+0xe1/0x140
 [<ffffffff80046c44>] try_to_wake_up+0x472/0x484
 [<ffffffff80211f2e>] input_repeat_key+0x0/0x75
 [<ffffffff8006d4c1>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff802145da>] evdev_pass_event+0x39/0x67
 [<ffffffff802148b1>] evdev_event+0x59/0xa6
 [<ffffffff8009fb5f>] __queue_work+0x49/0x59
 [<ffffffff80211ee8>] input_event+0x424/0x44c
 [<ffffffff80211f54>] input_repeat_key+0x26/0x75
 [<ffffffff80099bac>] run_timer_softirq+0x18d/0x23a
 [<ffffffff80012562>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006d636>] do_softirq+0x2c/0x7d
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI>  [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c
 [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c
 [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c
 [<ffffffff80048fc5>] cpu_idle+0x95/0xb8
 [<ffffffff80078a9a>] start_secondary+0x479/0x488


Code: 83 3f 00 7e f9 e9 7f fe ff ff f3 90 83 3f 00 7e f9 e9 f9 fe

Comment 1 Don Zickus 2011-09-02 16:13:33 UTC

How big is the vmcore file?  Would it be possible for me to download it from somewhere?

This panic seems odd.  The watchdog times out because the system is stuck spinning forever on a spin_lock.  Ok, fine.  Seen that plenty of times.  But the places where that lock is used, the critical region is very small and interrupts are supposed to be disabled.  I am baffled how that lock is stuck spinning forever.

I am hoping the vmcore could give me some answers as to what the other cpus are doing at the time of the crash.  Maybe then it will become obvious what happened with the lock.  Perhaps something else is causing the problem and this lock was just a symptom (IOW interrupts were not disabled like they were supposed to be?).

Cheers,
Don

Comment 2 Don Zickus 2011-09-02 16:30:43 UTC

Actually, re-reading the stack shows what I thought is true to be correct.

<IRQ>  [<ffffffff802145ba>] evdev_pass_event+0x19/0x67
                            ^^^^^^^^^^^^^
grabs buffer_lock again and deadlocks

 [<ffffffff802148b1>] evdev_event+0x59/0xa6
 [<ffffffff80211ee8>] input_event+0x424/0x44c
 [<ffffffff8020c64f>] hidinput_report_event+0x22/0x4a
 [<ffffffff80207d05>] hid_input_report+0x2f6/0x349
 [<ffffffff8020910e>] hid_irq_in+0x55/0xea
 [<ffffffff801fbc48>] usb_hcd_giveback_urb+0x37/0x65
 [<ffffffff88021714>] :uhci_hcd:uhci_giveback_urb+0x138/0x165
 [<ffffffff88021de9>] :uhci_hcd:uhci_scan_schedule+0x59d/0x880
 [<ffffffff88023be3>] :uhci_hcd:uhci_irq+0x13f/0x15c
 [<ffffffff801fc637>] usb_hcd_irq+0x27/0x55
 [<ffffffff80010d6e>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800bd69a>] __do_IRQ+0xe1/0x140
 [<ffffffff80046c44>] try_to_wake_up+0x472/0x484
 [<ffffffff80211f2e>] input_repeat_key+0x0/0x75
 [<ffffffff8006d4c1>] do_IRQ+0xe9/0xf7
                      ^^^^^^
interrupted

 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff802145da>] evdev_pass_event+0x39/0x67
                      ^^^^^^^^^^^^^^^^^^
originally grabbed buffer_lock here

 [<ffffffff802148b1>] evdev_event+0x59/0xa6
 [<ffffffff8009fb5f>] __queue_work+0x49/0x59
 [<ffffffff80211ee8>] input_event+0x424/0x44c
 [<ffffffff80211f54>] input_repeat_key+0x26/0x75
 [<ffffffff80099bac>] run_timer_softirq+0x18d/0x23a

It looks like upstream commit 8006479c9b75fb6594a7b746af3d7f1fbb68f18f tried to address this.  However that commit is waaaaay to intrusive for RHEL-5.

I think I will just take a shortcut and convert the spin_locks in evdev_pass_event to spin_lock_irq just to really disable interrupts and prevent being pre-empted by a real interrupt (the first interrupt was a softIRQ).

Cheers,
Don

Comment 3 Mike Stevens 2011-09-02 16:37:05 UTC

I do have a vmcore, but it is 96GB.  I'd be happy to make that available if
you'd like, or provide any relevant output.  Let me know your preference.

At the time, I had a KVM dongle plugged in to one of the USB ports.  That
dongle was causing all sorts of problems: the getty to respawn too rapidly, the
caps lock key to type lower case when it was on, and some other oddities. 
The bt seems to suggest to me that we were at least somewhere in the
keyboard/mouse modules.

I've removed that dongle, and the system has been stable for a few days. 
Running over the weekend will probably be a better test however.

Comment 4 Don Zickus 2011-09-06 16:23:12 UTC

I have uploaded a kernel with a possible fix here:

http://people.redhat.com/dzickus/el5/.d3ee01aa60a1394a9b165fa1337b1ff8a236b091/kernel-2.6.18-282.el5.dz58.3.x86_64.rpm


Let me know if that fixes your problem.

Cheers,
Don

Comment 5 Mike Stevens 2011-09-08 15:44:38 UTC

I've installed the patched kernel, and run a few test jobs.  The system has been up and stable for about 48 hours now.  Since I can't readily recreate the panic, I'll let this run over the weekend and update early next week.

Comment 6 Don Zickus 2011-09-29 14:09:16 UTC

RHEL-5 deadline is approaching, any feedback?

Cheers,
Don

Comment 7 Mike Stevens 2011-09-29 14:40:07 UTC

The system has been stable since 9/8/2011, so we can close this as fixed.

Comment 8 Don Zickus 2011-09-29 18:11:57 UTC

(In reply to comment #7)
> The system has been stable since 9/8/2011, so we can close this as fixed.

Well, the patch has not been integrated yet into our kernel yet, so I'll just move it to a POST state for further review and then it will get closed down the road. ;-)

But thanks for the testing feedback!

Cheers,
Don

Comment 10 RHEL Program Management 2011-09-30 14:21:25 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Jarod Wilson 2011-10-12 20:01:43 UTC

Patch(es) available in kernel-2.6.18-288.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 33 Prarit Bhargava 2011-12-01 14:25:51 UTC

Moving back to assigned so we can include another patch.

P.

Comment 34 Prarit Bhargava 2011-12-01 14:59:33 UTC

Created attachment 539239 [details]
RHEL5 additional fix for this issue

Comment 36 Jarod Wilson 2011-12-05 14:48:20 UTC

Patch(es) available in kernel-2.6.18-300.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 38 Zhouping Liu 2011-12-09 07:12:00 UTC

Depend on comment 15, set Sanity-Only, confirmed the patch in 2.6.18-300.el5.
hi, Prarit, do you know how to test this bug?

Comment 44 Tomas Capek 2012-01-11 13:28:09 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In certain circumstances, the evdev_pass_event() function with a spinlock attached was interrupted and called again, eventually resulting in a deadlock. A patch has been provided to address this issue by disabling interrupts when the spinlock is obtained. This prevents the deadlock from occurring.

Comment 45 errata-xmlrpc 2012-02-21 03:54:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html

Note You need to log in before you can comment on or make changes to this bug.