Description of problem: This system is running an Oracle database. After some period of inactivity, the system panics and reboots. This seems to be a sporadic issue. A vmcore has been produced. Version-Release number of selected component (if applicable): Linux 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Running a large Oracle job, and then letting the system sit idle seems to tickle the bug. Steps to Reproduce: 1. 2. 3. Actual results: The kernel panics Expected results: The system continues running Additional info: NMI Watchdog detected LOCKUP on CPU 6 CPU 6 Modules linked in: ipmi_si mpt2sas scsi_transport_sas mptctl mptbase ipmi_devintf ipmi_msghandler dell_rbu autofs4 hidp nfs nfs_acl rfcomm l2cap bluetooth lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom igb 8021q tpm_tis tpm i7core_edac edac_mc serio_raw dca tpm_bios bnx2 pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-274.el5 #1 RIP: 0010:[<ffffffff80064bcc>] [<ffffffff80064bcc>] .text.lock.spinlock+0x2/0x30 RSP: 0018:ffff810c400a7ab0 EFLAGS: 00000086 RAX: ffff810c3ec9bd08 RBX: ffff810c395cd000 RCX: 0000000011eda412 RDX: ffff810c395cd620 RSI: ffff810c400a7ad8 RDI: ffff810c395cd608 RBP: ffff810c3ec9bc80 R08: ffff810c3fc780c0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000048 R12: ffff810c400a7ad8 R13: 0000000000000000 R14: 0000000000000000 R15: ffff810c3f9dc000 FS: 0000000000000000(0000) GS:ffff810c6a27fe40(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aaaab8a9b48 CR3: 0000000c3abeb000 CR4: 00000000000006a0 Process swapper (pid: 0, threadinfo ffff810c400a2000, task ffff810c3fc780c0) Stack: ffffffff802145ba ffff810c3ec9bc80 ffff810c395cd000 0000000000000000 ffffffff802148b1 000000004e5d70d0 00000000000496f2 0000000000000000 ffff81183ec50000 ffff81183ec50000 ffff810c3ec9bca0 0000000000000000 Call Trace: <IRQ> [<ffffffff802145ba>] evdev_pass_event+0x19/0x67 [<ffffffff802148b1>] evdev_event+0x59/0xa6 [<ffffffff80211ee8>] input_event+0x424/0x44c [<ffffffff8020c64f>] hidinput_report_event+0x22/0x4a [<ffffffff80207d05>] hid_input_report+0x2f6/0x349 [<ffffffff8020910e>] hid_irq_in+0x55/0xea [<ffffffff801fbc48>] usb_hcd_giveback_urb+0x37/0x65 [<ffffffff88021714>] :uhci_hcd:uhci_giveback_urb+0x138/0x165 [<ffffffff88021de9>] :uhci_hcd:uhci_scan_schedule+0x59d/0x880 [<ffffffff88023be3>] :uhci_hcd:uhci_irq+0x13f/0x15c [<ffffffff801fc637>] usb_hcd_irq+0x27/0x55 [<ffffffff80010d6e>] handle_IRQ_event+0x51/0xa6 [<ffffffff800bd69a>] __do_IRQ+0xe1/0x140 [<ffffffff80046c44>] try_to_wake_up+0x472/0x484 [<ffffffff80211f2e>] input_repeat_key+0x0/0x75 [<ffffffff8006d4c1>] do_IRQ+0xe9/0xf7 [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff802145da>] evdev_pass_event+0x39/0x67 [<ffffffff802148b1>] evdev_event+0x59/0xa6 [<ffffffff8009fb5f>] __queue_work+0x49/0x59 [<ffffffff80211ee8>] input_event+0x424/0x44c [<ffffffff80211f54>] input_repeat_key+0x26/0x75 [<ffffffff80099bac>] run_timer_softirq+0x18d/0x23a [<ffffffff80012562>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d636>] do_softirq+0x2c/0x7d [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c <EOI> [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c [<ffffffff80048fc5>] cpu_idle+0x95/0xb8 [<ffffffff80078a9a>] start_secondary+0x479/0x488 Code: 83 3f 00 7e f9 e9 7f fe ff ff f3 90 83 3f 00 7e f9 e9 f9 fe
How big is the vmcore file? Would it be possible for me to download it from somewhere? This panic seems odd. The watchdog times out because the system is stuck spinning forever on a spin_lock. Ok, fine. Seen that plenty of times. But the places where that lock is used, the critical region is very small and interrupts are supposed to be disabled. I am baffled how that lock is stuck spinning forever. I am hoping the vmcore could give me some answers as to what the other cpus are doing at the time of the crash. Maybe then it will become obvious what happened with the lock. Perhaps something else is causing the problem and this lock was just a symptom (IOW interrupts were not disabled like they were supposed to be?). Cheers, Don
Actually, re-reading the stack shows what I thought is true to be correct. <IRQ> [<ffffffff802145ba>] evdev_pass_event+0x19/0x67 ^^^^^^^^^^^^^ grabs buffer_lock again and deadlocks [<ffffffff802148b1>] evdev_event+0x59/0xa6 [<ffffffff80211ee8>] input_event+0x424/0x44c [<ffffffff8020c64f>] hidinput_report_event+0x22/0x4a [<ffffffff80207d05>] hid_input_report+0x2f6/0x349 [<ffffffff8020910e>] hid_irq_in+0x55/0xea [<ffffffff801fbc48>] usb_hcd_giveback_urb+0x37/0x65 [<ffffffff88021714>] :uhci_hcd:uhci_giveback_urb+0x138/0x165 [<ffffffff88021de9>] :uhci_hcd:uhci_scan_schedule+0x59d/0x880 [<ffffffff88023be3>] :uhci_hcd:uhci_irq+0x13f/0x15c [<ffffffff801fc637>] usb_hcd_irq+0x27/0x55 [<ffffffff80010d6e>] handle_IRQ_event+0x51/0xa6 [<ffffffff800bd69a>] __do_IRQ+0xe1/0x140 [<ffffffff80046c44>] try_to_wake_up+0x472/0x484 [<ffffffff80211f2e>] input_repeat_key+0x0/0x75 [<ffffffff8006d4c1>] do_IRQ+0xe9/0xf7 ^^^^^^ interrupted [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff802145da>] evdev_pass_event+0x39/0x67 ^^^^^^^^^^^^^^^^^^ originally grabbed buffer_lock here [<ffffffff802148b1>] evdev_event+0x59/0xa6 [<ffffffff8009fb5f>] __queue_work+0x49/0x59 [<ffffffff80211ee8>] input_event+0x424/0x44c [<ffffffff80211f54>] input_repeat_key+0x26/0x75 [<ffffffff80099bac>] run_timer_softirq+0x18d/0x23a It looks like upstream commit 8006479c9b75fb6594a7b746af3d7f1fbb68f18f tried to address this. However that commit is waaaaay to intrusive for RHEL-5. I think I will just take a shortcut and convert the spin_locks in evdev_pass_event to spin_lock_irq just to really disable interrupts and prevent being pre-empted by a real interrupt (the first interrupt was a softIRQ). Cheers, Don
I do have a vmcore, but it is 96GB. I'd be happy to make that available if you'd like, or provide any relevant output. Let me know your preference. At the time, I had a KVM dongle plugged in to one of the USB ports. That dongle was causing all sorts of problems: the getty to respawn too rapidly, the caps lock key to type lower case when it was on, and some other oddities. The bt seems to suggest to me that we were at least somewhere in the keyboard/mouse modules. I've removed that dongle, and the system has been stable for a few days. Running over the weekend will probably be a better test however.
I have uploaded a kernel with a possible fix here: http://people.redhat.com/dzickus/el5/.d3ee01aa60a1394a9b165fa1337b1ff8a236b091/kernel-2.6.18-282.el5.dz58.3.x86_64.rpm Let me know if that fixes your problem. Cheers, Don
I've installed the patched kernel, and run a few test jobs. The system has been up and stable for about 48 hours now. Since I can't readily recreate the panic, I'll let this run over the weekend and update early next week.
RHEL-5 deadline is approaching, any feedback? Cheers, Don
The system has been stable since 9/8/2011, so we can close this as fixed.
(In reply to comment #7) > The system has been stable since 9/8/2011, so we can close this as fixed. Well, the patch has not been integrated yet into our kernel yet, so I'll just move it to a POST state for further review and then it will get closed down the road. ;-) But thanks for the testing feedback! Cheers, Don
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-288.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Moving back to assigned so we can include another patch. P.
Created attachment 539239 [details] RHEL5 additional fix for this issue
Patch(es) available in kernel-2.6.18-300.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee.
Depend on comment 15, set Sanity-Only, confirmed the patch in 2.6.18-300.el5. hi, Prarit, do you know how to test this bug?
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: In certain circumstances, the evdev_pass_event() function with a spinlock attached was interrupted and called again, eventually resulting in a deadlock. A patch has been provided to address this issue by disabling interrupts when the spinlock is obtained. This prevents the deadlock from occurring.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0150.html