Bug 734900
| Summary: | Panic, NMI Watchdog detected LOCKUP on CPU 6 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Mike Stevens <michael_stevens> | ||||
| Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Zhouping Liu <zliu> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.7 | CC: | dhoward, grocha, jarod, jwest, mmilgram, nobody+295318, plougher, qcai, rdassen, sforsber, zliu | ||||
| Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | kernel-2.6.18-300.el5 | Doc Type: | Bug Fix | ||||
| Doc Text: |
In certain circumstances, the evdev_pass_event() function with a spinlock attached was interrupted and called again, eventually resulting in a deadlock. A patch has been provided to address this issue by disabling interrupts when the spinlock is obtained. This prevents the deadlock from occurring.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-02-21 03:54:09 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 744147, 758797 | ||||||
| Attachments: |
|
||||||
|
Description
Mike Stevens
2011-08-31 19:27:26 UTC
How big is the vmcore file? Would it be possible for me to download it from somewhere? This panic seems odd. The watchdog times out because the system is stuck spinning forever on a spin_lock. Ok, fine. Seen that plenty of times. But the places where that lock is used, the critical region is very small and interrupts are supposed to be disabled. I am baffled how that lock is stuck spinning forever. I am hoping the vmcore could give me some answers as to what the other cpus are doing at the time of the crash. Maybe then it will become obvious what happened with the lock. Perhaps something else is causing the problem and this lock was just a symptom (IOW interrupts were not disabled like they were supposed to be?). Cheers, Don Actually, re-reading the stack shows what I thought is true to be correct.
<IRQ> [<ffffffff802145ba>] evdev_pass_event+0x19/0x67
^^^^^^^^^^^^^
grabs buffer_lock again and deadlocks
[<ffffffff802148b1>] evdev_event+0x59/0xa6
[<ffffffff80211ee8>] input_event+0x424/0x44c
[<ffffffff8020c64f>] hidinput_report_event+0x22/0x4a
[<ffffffff80207d05>] hid_input_report+0x2f6/0x349
[<ffffffff8020910e>] hid_irq_in+0x55/0xea
[<ffffffff801fbc48>] usb_hcd_giveback_urb+0x37/0x65
[<ffffffff88021714>] :uhci_hcd:uhci_giveback_urb+0x138/0x165
[<ffffffff88021de9>] :uhci_hcd:uhci_scan_schedule+0x59d/0x880
[<ffffffff88023be3>] :uhci_hcd:uhci_irq+0x13f/0x15c
[<ffffffff801fc637>] usb_hcd_irq+0x27/0x55
[<ffffffff80010d6e>] handle_IRQ_event+0x51/0xa6
[<ffffffff800bd69a>] __do_IRQ+0xe1/0x140
[<ffffffff80046c44>] try_to_wake_up+0x472/0x484
[<ffffffff80211f2e>] input_repeat_key+0x0/0x75
[<ffffffff8006d4c1>] do_IRQ+0xe9/0xf7
^^^^^^
interrupted
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
[<ffffffff802145da>] evdev_pass_event+0x39/0x67
^^^^^^^^^^^^^^^^^^
originally grabbed buffer_lock here
[<ffffffff802148b1>] evdev_event+0x59/0xa6
[<ffffffff8009fb5f>] __queue_work+0x49/0x59
[<ffffffff80211ee8>] input_event+0x424/0x44c
[<ffffffff80211f54>] input_repeat_key+0x26/0x75
[<ffffffff80099bac>] run_timer_softirq+0x18d/0x23a
It looks like upstream commit 8006479c9b75fb6594a7b746af3d7f1fbb68f18f tried to address this. However that commit is waaaaay to intrusive for RHEL-5.
I think I will just take a shortcut and convert the spin_locks in evdev_pass_event to spin_lock_irq just to really disable interrupts and prevent being pre-empted by a real interrupt (the first interrupt was a softIRQ).
Cheers,
Don
I do have a vmcore, but it is 96GB. I'd be happy to make that available if you'd like, or provide any relevant output. Let me know your preference. At the time, I had a KVM dongle plugged in to one of the USB ports. That dongle was causing all sorts of problems: the getty to respawn too rapidly, the caps lock key to type lower case when it was on, and some other oddities. The bt seems to suggest to me that we were at least somewhere in the keyboard/mouse modules. I've removed that dongle, and the system has been stable for a few days. Running over the weekend will probably be a better test however. I have uploaded a kernel with a possible fix here: http://people.redhat.com/dzickus/el5/.d3ee01aa60a1394a9b165fa1337b1ff8a236b091/kernel-2.6.18-282.el5.dz58.3.x86_64.rpm Let me know if that fixes your problem. Cheers, Don I've installed the patched kernel, and run a few test jobs. The system has been up and stable for about 48 hours now. Since I can't readily recreate the panic, I'll let this run over the weekend and update early next week. RHEL-5 deadline is approaching, any feedback? Cheers, Don The system has been stable since 9/8/2011, so we can close this as fixed. (In reply to comment #7) > The system has been stable since 9/8/2011, so we can close this as fixed. Well, the patch has not been integrated yet into our kernel yet, so I'll just move it to a POST state for further review and then it will get closed down the road. ;-) But thanks for the testing feedback! Cheers, Don This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patch(es) available in kernel-2.6.18-288.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. Moving back to assigned so we can include another patch. P. Created attachment 539239 [details]
RHEL5 additional fix for this issue
Patch(es) available in kernel-2.6.18-300.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee. Depend on comment 15, set Sanity-Only, confirmed the patch in 2.6.18-300.el5. hi, Prarit, do you know how to test this bug?
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
In certain circumstances, the evdev_pass_event() function with a spinlock attached was interrupted and called again, eventually resulting in a deadlock. A patch has been provided to address this issue by disabling interrupts when the spinlock is obtained. This prevents the deadlock from occurring.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html |