Bug 455747
| Summary: | Oops when running oprofile | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||||||
| Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | |||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | beta | CC: | bhu, davids | ||||||||
| Target Milestone: | 1.0.1 | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2008-08-26 19:57:37 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
IBM Bug Proxy
2008-07-17 15:20:23 UTC
------- Comment From ankigarg.com 2008-07-18 04:36 EDT------- Could not reproduce this issue with the aplha01 - R2 (2.6.24.3-29ibmrt2.0) kernel. So, this issue came in at a later point in time. ------- Comment From ankigarg.com 2008-07-18 05:40 EDT------- Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This is the -65 kernel. I get the following oops. rt-ipe.austin.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6 autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video output sbs sbshc battery ac parport_pc lp parport sg bnx2 button i5000_edac edac_core pata_acpi pcspkr ata_generic iTCO_wdt iTCO_vendor_support shpchp ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.24.7-65ibmrt2.4 #1 RIP: 0010:[<ffffffff81139a82>] [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 RSP: 0018:ffffffff81509ee0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810001008780 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff8100010086a8 RDI: ffff810001008780 RBP: ffffffff81509ee8 R08: 0000000000000010 R09: 0000000000000000 R10: ffff81007fb09000 R11: 0000000000000000 R12: ffff810001008698 R13: ffff81042a8d44f0 R14: ffff81042a8d44e0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffffffff813f3100(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 000000041956e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff8149e000, task ffffffff813ae7a0) Stack: 0000000000000001 ffffffff81509f18 ffffffff81053e48 ffff810001008780 ffff810001008698 ffff810001008640 7fffffffffffffff ffffffff81509f68 ffffffff81054df7 000000152c077bb2 000000152c077bb2 0000000000000000 Call Trace: <IRQ> [<ffffffff81053e48>] enqueue_hrtimer+0xda/0xe8 [<ffffffff81054df7>] hrtimer_interrupt+0x136/0x1ab [<ffffffff8101ff97>] smp_local_timer_interrupt+0x5a/0x5e [<ffffffff810205f9>] smp_apic_timer_interrupt+0x3a/0x51 [<ffffffff8100af61>] ? mwait_idle+0x0/0x73 [<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70 <EOI> [<ffffffff8100afcf>] ? mwait_idle+0x6e/0x73 [<ffffffff8100abb1>] ? enter_idle+0x22/0x24 [<ffffffff8100ae90>] ? cpu_idle+0x99/0xf8 [<ffffffff8128648e>] ? rest_init+0x82/0x84 [<ffffffff814a8b70>] ? start_kernel+0x31e/0x329 [<ffffffff814a8119>] ? _sinittext+0x119/0x120 Code: 74 12 49 3b 78 08 75 06 49 89 48 08 eb 09 49 89 48 10 eb 03 48 89 0e 48 8b 07 83 e0 03 48 09 c1 48 89 0f c9 c3 55 48 89 e5 41 57 <49> 89 f7 41 56 41 55 49 89 fd 41 54 53 e9 a1 00 00 00 49 89 c4 RIP [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 RSP <ffffffff81509ee0> Initializing cgroup subsys cpuset ------- Comment From ankigarg.com 2008-07-18 06:02 EDT------- (In reply to comment #11) > Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This > is the -65 kernel. I get the following oops. > So luckily this time I was able to capture a dump !! The vmcore can be found here if anyone wants to take a look: http://kernel.beaverton.ibm.com/jtcltc/kdump_cores/bz46482/vmcore.bz2 Will investigate the dump and also try another R2 kernel between alpha14 & 18 ------- Comment From ankigarg.com 2008-07-18 07:59 EDT------- So looks like the issue originated in -65 mrg kernel. On -60, oprofile is working fine. So, am looking at finding the changelog to see the changes and also got to look at the kdump. ------- Comment From ankigarg.com 2008-07-21 07:36 EDT------- So I have been trying to figure out which of the patches between -61 and -65 could have led to the issue. Looking at the patches, there could be a few candidates. Tried by removing several patches, but with no gain! Only to realize that the process I was using the generate kernels with some patches removed was erroneous!!!!!!! *grrr* :-( kicking it off again ! and in this process, rt-ipe too went down! ------- Comment From ankigarg.com 2008-07-22 02:05 EDT------- I cannot reproduce this issue with the -62 kernel. Also, all the oops seen are related to "Unable to handle kernel paging request at fffffffffffffff8". Taking the cue from this, I removed the patch slab-fix-rt-v2.patch from -65 kernel and built a new kernel. With this patch removed, oprofile is working fine on -65 kernel. So, this could point to this particular patch as faulty. I would spend sometime looking at the patch and the dump. However, it might be faster if Peter takes a look at it. Attaching the patch here for ref. Created attachment 312310 [details]
Patch that introduces oops with oprofile
------- Comment From ankigarg.com 2008-07-22 03:51 EDT------- http://lkml.org/lkml/2008/6/10/133 introduces these patches. So this series provides fixes for hotplug..not sure how it is affecting us here. ------- Comment From ankigarg.com 2008-07-22 06:30 EDT------- Pasting some content from the dump - # crash /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux /test/ankita/vmcore crash 4.0-5.0.3 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: cpu 0 first exception stack: 0 boot_exception_stacks: ffffffff8150a000 KERNEL: /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux DUMPFILE: /test/ankita/vmcore CPUS: 8 DATE: Fri Jul 18 01:39:31 2008 UPTIME: 00:01:30 LOAD AVERAGE: 1.66, 0.69, 0.25 TASKS: 257 NODENAME: rt-ipe.austin.ibm.com RELEASE: 2.6.24.7-65ibmrt2.4 VERSION: #1 SMP PREEMPT RT Fri Jun 6 20:06:47 EDT 2008 MACHINE: x86_64 (3000 Mhz) MEMORY: 16 GB PANIC: "Oops: 0002 [1] PREEMPT SMP " (check log for details) PID: 0 COMMAND: "swapper" TASK: ffffffff813ae7a0 (1 of 8) [THREAD_INFO: ffffffff8149e000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 0 TASK: ffffffff813ae7a0 CPU: 0 COMMAND: "swapper" #0 [ffffffff81509b80] machine_kexec at ffffffff81022dcd #1 [ffffffff81509c60] crash_kexec at ffffffff8106abd3 #2 [ffffffff81509d20] __die at ffffffff8128a48b #3 [ffffffff81509d50] do_page_fault at ffffffff8128bca1 #4 [ffffffff81509e30] error_exit at ffffffff81289e19 [exception RIP: rb_insert_color+6] RIP: ffffffff81139a82 RSP: ffffffff81509ee0 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810001008780 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff8100010086a8 RDI: ffff810001008780 RBP: ffffffff81509ee8 R8: 0000000000000010 R9: 0000000000000000 R10: ffff81007fb09000 R11: 0000000000000000 R12: ffff810001008698 R13: ffff81042a8d44f0 R14: ffff81042a8d44e0 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffffff81509ef0] enqueue_hrtimer at ffffffff81053e48 #6 [ffffffff81509f20] hrtimer_interrupt at ffffffff81054df7 #7 [ffffffff81509f70] smp_local_timer_interrupt at ffffffff8101ff97 #8 [ffffffff81509f90] smp_apic_timer_interrupt at ffffffff810205f9 #9 [ffffffff81509fb0] apic_timer_interrupt at ffffffff8100ce46 --- <IRQ stack> --- #10 [ffffffff8149fe98] apic_timer_interrupt at ffffffff8100ce46 [exception RIP: mwait_idle+110] RIP: ffffffff8100afcf RSP: ffffffff8149ff48 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff8149ff48 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8149e010 RDI: ffffffff813b2230 RBP: 0000000000000001 R8: 0000000000000000 R9: ffff81042e4a9e60 R10: 0000000000000000 R11: ffff81042e4a9e90 R12: ffffffff8128bddc R13: ffffffff8149fee8 R14: ffff81000100b980 R15: ffff81041954f080 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #11 [ffffffff8149ff40] enter_idle at ffffffff8100abb1 #12 [ffffffff8149ff50] cpu_idle at ffffffff8100ae90 crash> dis 0xffffffff81139a82 0xffffffff81139a82 <rb_insert_color+6>: mov %rsi,%r15 crash> dis 0xffffffff81509ee0 0xffffffff81509ee0 <boot_cpu_stack+16096>: add %eax,(%rax) crash> rd boot_cpu_stack ffffffff81506000: 0000000000000000 ........ Note that the slab fix patch makes changes to the stack allocation code. Since oprofile here is using NMI, the oops happens when the switch to other stack happens. While I am not sure what the "first exception stack" is supposed to be, in the WARNING below, WARNING: cpu 0 first exception stack: 0 boot_exception_stacks: ffffffff8150a000 it appears to be zero. My guess would be that the stack allocation code might need some attention. I can provide more output from crash if required. Created attachment 312406 [details]
Initial patch from peterz
I applied this patch and started oprofile as Ankita did in the opening comment
on an LS21. The box panic'd within a few seconds. So this patch doesn't
appear to resolve the issue entirely.
Created attachment 312494 [details]
updated patch
move alloc stacks to trap_init()
------- Comment From dvhltc.com 2008-07-23 15:19 EDT------- I've built with the patch in comment #27, and kicked off oprofile per the opening comment. Been profiling for several minutes and haven't seen a crash. I think this is fixed. ------- Comment From jstultz.com 2008-07-23 20:35 EDT------- Issue should be fixed in -74, which is available for testing. ------- Comment From ankigarg.com 2008-07-24 00:42 EDT------- Yes, I too confirm that the patch fixes this issue. Tested the -74 kernel. So, marking this fixed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html |