=Comment: #0================================================= Ankita Garg <ankigarg.com> - 2008-07-17 04:34 EDT Problem description: On starting oprofile on the latest R2 kernel, #opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/2.6.24.7-72ibmrt2.5/vmlinux #opcontrol --start I get a kernel oops. I tried several times, and each time I got a different oops (though, the origin might be the same and only looks diff). Sadly enough, could not capture dump. kdump kernel starts booting and then drops into a shell. Manual copy of the core results in out of memory issue. Pasting a few of the oops here: rt-oak.austin.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff8113be87>] clear_page_c+0x7/0x10 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl dm_multipath scsi_dh iTCO_wdt pata_acpi ata_generic iTCO_vendor_support i5000_edac edac_core usb_storage serio_raw shpchp bnx2 lp parport_pc parport ac battery button i2c_core sbs sbshc video output dm_mirror dm_mod xt_tcpudp ip6t_REJECT ipv6 nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth autofs4 sg mptsas mptscsih mptbase scsi_transport_sas pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 4311, comm: opcontrol Not tainted 2.6.24.7-72ibmrt2.5 #1 RIP: 0010:[<ffffffff8113be87>] [<ffffffff8113be87>] clear_page_c+0x7/0x10 RSP: 0018:ffff81042b5f5bb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000018f495c0 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff810428c39fe8 RBP: ffff81042b5f5c48 R08: ffff81042fc01d14 R09: ffff81042b5f5bc8 R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001 R13: ffff81000001b100 R14: ffffe20018f49560 R15: 0000000000000000 FS: 00002b65ebd31f00(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 0000000428498000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process opcontrol (pid: 4311, threadinfo ffff81042b5f4000, task ffff81042c9fc140) Stack: ffffffff8108a191 ffff81042b5f5bc8 000000448110cd5d ffff81000001d670 00000000810b9191 00000040000284d0 ffff81000001d678 0000000200000000 0000000000000000 000000002b5f5c58 ffffffff00000000 0000000100000000 Call Trace: [<ffffffff8108a191>] ? get_page_from_freelist+0x496/0x60b [<ffffffff8108a3c4>] __alloc_pages+0x68/0x312 [<ffffffff8109188d>] ? zone_statistics+0x64/0x69 [<ffffffff81087fc4>] ? put_zone_pcp+0x1f/0x21 [<ffffffff810a439e>] alloc_pages_current+0xa8/0xb1 [<ffffffff81089cbe>] get_zeroed_page+0x11/0x4e [<ffffffff81092c7d>] __pud_alloc+0x1d/0x7a [<ffffffff81095a78>] copy_page_range+0x152/0x6d9 [<ffffffff81287193>] ? rt_spin_lock+0x9/0xb [<ffffffff810a8afb>] ? _slab_irq_disable+0x37/0x5c [<ffffffff8103b03d>] copy_process+0xcf8/0x1626 [<ffffffff8103bb0a>] do_fork+0x75/0x20e [<ffffffff81077795>] ? audit_syscall_entry+0x148/0x17e [<ffffffff8100c37e>] ? traceret+0x0/0x5 [<ffffffff8100a630>] sys_clone+0x23/0x25 [<ffffffff8100c517>] ptregscall_common+0x67/0xb0 <The below if from a diff machine> llm49.in.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff811f598c>] acpi_pm_read+0xc/0x13 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile usb_storage i2c_amd756 i2c_core tg3 serio_raw amd_rng shpchp joydev lp parport_pc parport ac battery button sbs sbshc video output dm_multipath scsi_dh dm_mirror dm_mod xt_tcpudp ip6t_REJECT ipv6 nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth autofs4 sg mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod pcspkr k8temp hwmon k8_edac edac_core ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.24.7-72ibmrt2.5 #1 RIP: 0010:[<ffffffff811f598c>] [<ffffffff811f598c>] acpi_pm_read+0xc/0x13 RSP: 0018:ffffffff81504e88 EFLAGS: 00010086 RAX: 00000000088757e5 RBX: 0000000001b89700 RCX: ffff81010f91dd28 RDX: 0000000000002208 RSI: 0000000000000000 RDI: ffffffff81504ed8 RBP: ffffffff81504e88 R08: 0000000000000010 R09: 0000000000000000 R10: ffff810087b0e000 R11: 0000000000000000 R12: ffffffff81504ed8 R13: 0000000000000000 R14: 000006909ad975c0 R15: 0000000000000002 FS: 00002b90582d3f00(0000) GS:ffffffff813ef100(0000) knlGS:00000000f7f116d0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 000000010f1f2000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff8149a000, task ffffffff813aa7a0) Stack: ffffffff81504ea8 ffffffff81056a44 0000000001b89700 ffffffff81504ed8 ffffffff81504ec8 ffffffff81054983 000006909ad975c0 ffff8100090045a0 ffffffff81504ee8 ffffffff810549c7 00000000487efa76 00000000363c8f37 Call Trace: <IRQ> [<ffffffff81056a44>] getnstimeofday+0x31/0x88 [<ffffffff81054983>] ktime_get_ts+0x18/0x4b [<ffffffff810549c7>] ktime_get+0x11/0x42 [<ffffffff810599cb>] tick_program_event+0x2c/0x5a [<ffffffff81054db9>] hrtimer_interrupt+0x183/0x1aa [<ffffffff8101ffd3>] smp_local_timer_interrupt+0x5a/0x5e [<ffffffff81020635>] smp_apic_timer_interrupt+0x3a/0x51 [<ffffffff8100ada1>] ? default_idle+0x0/0x56 [<ffffffff8100ce66>] apic_timer_interrupt+0x66/0x70 <EOI> [<ffffffff8100ade2>] ? default_idle+0x41/0x56 [<ffffffff8100abb1>] ? enter_idle+0x22/0x24 [<ffffffff8100ae90>] ? cpu_idle+0x99/0xf8 [<ffffffff81283e5e>] ? rest_init+0x82/0x84 [<ffffffff814a4b68>] ? start_kernel+0x31e/0x329 [<ffffffff814a4119>] ? _sinittext+0x119/0x120 Code: 0f 97 c0 39 cf 0f 92 c2 21 d0 a8 01 75 bd c9 89 c8 c3 55 48 89 e5 e8 a6 ff ff ff c9 89 c0 c3 55 0f b7 15 78 0f 2a 00 48 89 e5 ed <c9> 25 ff ff ff 00 c3 83 3d 76 34 3c 00 00 55 48 89 e5 75 29 80 RIP [<ffffffff811f598c>] acpi_pm_read+0xc/0x13 RSP <ffffffff81504e88> RIP: 0010:[<ffffffff810bd2b5>] [<ffffffff810bd2b5>] do_select+0x2af/0x4f2 RSP: 0018:ffff81021003ba68 EFLAGS: 00010202 RAX: 0000000000010092 RBX: 0000000000000300 RCX: 0000000000000304 RDX: ffff81021091e090 RSI: ffff81021003bd44 RDI: 0000000000000012 RBP: ffff81021003bd78 R08: 0000000000020000 R09: ffff81021003ba58 R10: 0000000000000003 R11: 0000000000000003 R12: ffff810208cd4080 R13: 0000000000000012 R14: 0000000000040000 R15: ffff81021003bea8 FS: 00002ac6f7c3c080(0000) GS:ffff810211ae4640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000100ca CR3: 000000020f5a7000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ntpd (pid: 2660, threadinfo ffff81021003a000, task ffff8102107194c0) Stack: ffff81021003bad8 ffff81021003bf40 0000000000000000 00000000080e0a80 ffff81021003bdc0 ffff81021003bdc8 ffff81021003bdd0 ffff81021003bda8 ffff81021003bdb0 ffff81021003bdb8 00000000003f0000 0000000000000000 Call Trace: [<ffffffff810bd972>] ? __pollwait+0x0/0xdf [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff81033423>] ? default_wake_function+0x0/0x14 [<ffffffff810a8c60>] ? __cache_free+0x3b/0x203 [<ffffffff81287193>] ? rt_spin_lock+0x9/0xb [<ffffffff81053d8e>] ? enqueue_hrtimer+0xda/0xe8 [<ffffffff81054753>] ? hrtimer_start+0x136/0x17f [<ffffffff810549b1>] ? ktime_get_ts+0x46/0x4b [<ffffffff810bd6be>] core_sys_select+0x1c6/0x275 [<ffffffff81047966>] ? recalc_sigpending+0x12/0x41 [<ffffffff8100bc83>] ? do_notify_resume+0x71e/0x7db [<ffffffff810bdb05>] sys_select+0xb4/0x176 [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb [<ffffffff8100c37e>] traceret+0x0/0x5 Code: 58 fd ff ff 0f 84 b0 00 00 00 48 8d 75 cc 44 89 ef e8 e8 3c ff ff 48 85 c0 49 89 c4 0f 84 93 00 00 00 48 8b 40 68 48 85 c0 74 23 <48> 8b 40 38 48 85 c0 74 1a 31 f6 83 bd 78 fd ff ff 00 4c 89 e7 RIP [<ffffffff810bd2b5>] do_select+0x2af/0x4f2 RSP <ffff81021003ba68> llm49.in.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff81083f18>] find_get_page+0x5a/0x173 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile usb_storage i2c_amd756 i2c_core tg3 serio_raw amd_rng shpchp joydev lp parport_pc parport ac battery button sbs sbshc video output dm_multipath scsi_dh dm_mirror dm_mod xt_tcpudp ip6t_REJECT ipv6 nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth autofs4 sg mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod pcspkr k8temp hwmon k8_edac edac_core ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 3357, comm: opcontrol Not tainted 2.6.24.7-72ibmrt2.5 #1 RIP: 0010:[<ffffffff81083f18>] [<ffffffff81083f18>] find_get_page+0x5a/0x173 RSP: 0000:ffff81020f96dc48 EFLAGS: 00010283 RAX: ffffe2000c63ffbf RBX: ffffe2000c63ffc0 RCX: 000000000000002b RDX: ffff810210a84368 RSI: ffffe2000c63ffc8 RDI: 0000000000000000 RBP: ffff81020f96dcb8 R08: ffff81021114b480 R09: 0000000000000000 R10: 00002b2f1344ff90 R11: 0000000000000246 R12: ffffe2000c63ffc0 R13: ffff810210a844f8 R14: 000000000000006f R15: ffff810211437840 FS: 00002b2f1344ff00(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 000000020f5f5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process opcontrol (pid: 3357, threadinfo ffff81020f96c000, task ffff81021114b480) Stack: ffff81020f96dc58 0000004481087fc4 ffff810000016231 000000008108a196 00000040001200d2 ffff810000016239 00000002000157f9 0000000000000000 00000000000284d0 ffff81020c0187a0 ffff810211437840 000000000000006f Call Trace: [<ffffffff810843a5>] find_lock_page+0x1e/0x5d [<ffffffff810863e9>] filemap_fault+0x79/0x399 [<ffffffff81092dae>] __do_fault+0x65/0x3b6 [<ffffffff81093564>] ? do_wp_page+0x465/0x4e5 [<ffffffff810949a5>] handle_mm_fault+0x202/0x764 [<ffffffff812895d0>] do_page_fault+0x3ba/0x76d [<ffffffff81077adc>] ? audit_syscall_exit+0x311/0x332 [<ffffffff812879e9>] error_exit+0x0/0x51 Code: 84 25 01 00 00 48 8b 18 49 83 cc ff f6 c3 01 4c 0f 44 e3 49 8d 44 24 ff 4c 89 e3 48 83 f8 fd 77 cc 41 8b 4c 24 08 49 8d 74 24 08 <85> c9 74 be 8d 41 01 48 63 d1 48 63 f8 48 89 d0 f0 0f b1 3e 39 RIP [<ffffffff81083f18>] find_get_page+0x5a/0x173 RSP <ffff81020f96dc48> I am yet to try it on HS21 and on mainline kernel. =Comment: #1================================================= Ankita Garg <ankigarg.com> - 2008-07-17 05:15 EDT Could not reproduce the issue with mainline rt kernel 2.6.25.8-rt7 =Comment: #2================================================= Ankita Garg <ankigarg.com> - 2008-07-17 05:27 EDT Will quickly try and verify if the issue is reproducible on MRG kernel and will then mirror bug to RH. =Comment: #3================================================= Ankita Garg <ankigarg.com> - 2008-07-17 05:54 EDT As expected, can reproduce the bug with MRG kernel as well. Here is the oops I got: rt-beech.austin.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6 autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video output sbs sbshc battery ac parport_pc lp parport sg bnx2 button serio_raw k8temp k8_edac edac_core hwmon shpchp pcspkr usb_storage mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd Pid: 3416, comm: opcontrol Not tainted 2.6.24.7-72.el5rt #1 RIP: 0010:[<ffffffff810b3af1>] [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d RSP: 0018:ffff81021fccfeb8 EFLAGS: 00010292 RAX: 0000000000000000 RBX: ffff810000000000 RCX: 0000000000000000 RDX: ffff81021fccffd8 RSI: ffff81022d033009 RDI: ffffe2000cc253e0 RBP: ffff81021fccfec8 R08: 0000000000000001 R09: ffff810000015670 R10: 0000000000000000 R11: 0000000000000002 R12: ffff81022b98f080 R13: 0000000000000080 R14: ffff81022d033000 R15: 000000000071b370 FS: 00002b1da6050f10(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 000000022cd72000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process opcontrol (pid: 3416, threadinfo ffff81021fcce000, task ffff81022ccf0b60) Stack: ffff81021fccfec8 ffff81022c0b1dc0 ffff81021fccff18 ffffffff810b5334 00000000007293a0 ffff81021fccff58 0000000000720ec0 ffff81022d033000 000000000071b370 0000000000720ec0 ffff81022d033000 0000000000000000 Call Trace: [<ffffffff810b5334>] do_execve+0x108/0x205 [<ffffffff8100ac30>] sys_execve+0x36/0x8a [<ffffffff8100c5c7>] stub_execve+0x67/0xb0 Code: ec 08 65 48 8b 04 25 10 00 00 00 48 8b 98 48 e0 ff ff 48 c7 80 48 e0 ff ff ff ff ff ff e8 14 fe ff ff 65 48 8b 14 25 10 00 00 00 <48> 89 9a 48 e0 ff ff 59 5b c9 c3 55 48 89 e5 41 57 41 56 41 55 RIP [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d RSP <ffff81021fccfeb8> CR2: fffffffffffffff8 =Comment: #5================================================= Ankita Garg <ankigarg.com> - 2008-07-17 07:09 EDT So this is weird. Every time the machine panics, on the second reboot, it fails to come up, with the following error: Waiting on MM for permission to boot..... CP: CD I9990306 Timed Out Waiting on MM:System Halted in POST This is a known issue, however, due to this, the machine up-time is rather limited !
------- Comment From ankigarg.com 2008-07-18 04:36 EDT------- Could not reproduce this issue with the aplha01 - R2 (2.6.24.3-29ibmrt2.0) kernel. So, this issue came in at a later point in time.
------- Comment From ankigarg.com 2008-07-18 05:40 EDT------- Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This is the -65 kernel. I get the following oops. rt-ipe.austin.ibm.com login: Unable to handle kernel paging request at fffffffffffffff8 RIP: [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 PGD 203067 PUD 204067 PMD 0 Oops: 0002 [1] PREEMPT SMP CPU 0 Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6 autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath dm_mod video output sbs sbshc battery ac parport_pc lp parport sg bnx2 button i5000_edac edac_core pata_acpi pcspkr ata_generic iTCO_wdt iTCO_vendor_support shpchp ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.24.7-65ibmrt2.4 #1 RIP: 0010:[<ffffffff81139a82>] [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 RSP: 0018:ffffffff81509ee0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810001008780 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff8100010086a8 RDI: ffff810001008780 RBP: ffffffff81509ee8 R08: 0000000000000010 R09: 0000000000000000 R10: ffff81007fb09000 R11: 0000000000000000 R12: ffff810001008698 R13: ffff81042a8d44f0 R14: ffff81042a8d44e0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffffffff813f3100(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: fffffffffffffff8 CR3: 000000041956e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff8149e000, task ffffffff813ae7a0) Stack: 0000000000000001 ffffffff81509f18 ffffffff81053e48 ffff810001008780 ffff810001008698 ffff810001008640 7fffffffffffffff ffffffff81509f68 ffffffff81054df7 000000152c077bb2 000000152c077bb2 0000000000000000 Call Trace: <IRQ> [<ffffffff81053e48>] enqueue_hrtimer+0xda/0xe8 [<ffffffff81054df7>] hrtimer_interrupt+0x136/0x1ab [<ffffffff8101ff97>] smp_local_timer_interrupt+0x5a/0x5e [<ffffffff810205f9>] smp_apic_timer_interrupt+0x3a/0x51 [<ffffffff8100af61>] ? mwait_idle+0x0/0x73 [<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70 <EOI> [<ffffffff8100afcf>] ? mwait_idle+0x6e/0x73 [<ffffffff8100abb1>] ? enter_idle+0x22/0x24 [<ffffffff8100ae90>] ? cpu_idle+0x99/0xf8 [<ffffffff8128648e>] ? rest_init+0x82/0x84 [<ffffffff814a8b70>] ? start_kernel+0x31e/0x329 [<ffffffff814a8119>] ? _sinittext+0x119/0x120 Code: 74 12 49 3b 78 08 75 06 49 89 48 08 eb 09 49 89 48 10 eb 03 48 89 0e 48 8b 07 83 e0 03 48 09 c1 48 89 0f c9 c3 55 48 89 e5 41 57 <49> 89 f7 41 56 41 55 49 89 fd 41 54 53 e9 a1 00 00 00 49 89 c4 RIP [<ffffffff81139a82>] rb_insert_color+0x6/0xe3 RSP <ffffffff81509ee0> Initializing cgroup subsys cpuset
------- Comment From ankigarg.com 2008-07-18 06:02 EDT------- (In reply to comment #11) > Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This > is the -65 kernel. I get the following oops. > So luckily this time I was able to capture a dump !! The vmcore can be found here if anyone wants to take a look: http://kernel.beaverton.ibm.com/jtcltc/kdump_cores/bz46482/vmcore.bz2 Will investigate the dump and also try another R2 kernel between alpha14 & 18
------- Comment From ankigarg.com 2008-07-18 07:59 EDT------- So looks like the issue originated in -65 mrg kernel. On -60, oprofile is working fine. So, am looking at finding the changelog to see the changes and also got to look at the kdump.
------- Comment From ankigarg.com 2008-07-21 07:36 EDT------- So I have been trying to figure out which of the patches between -61 and -65 could have led to the issue. Looking at the patches, there could be a few candidates. Tried by removing several patches, but with no gain! Only to realize that the process I was using the generate kernels with some patches removed was erroneous!!!!!!! *grrr* :-( kicking it off again ! and in this process, rt-ipe too went down!
------- Comment From ankigarg.com 2008-07-22 02:05 EDT------- I cannot reproduce this issue with the -62 kernel. Also, all the oops seen are related to "Unable to handle kernel paging request at fffffffffffffff8". Taking the cue from this, I removed the patch slab-fix-rt-v2.patch from -65 kernel and built a new kernel. With this patch removed, oprofile is working fine on -65 kernel. So, this could point to this particular patch as faulty. I would spend sometime looking at the patch and the dump. However, it might be faster if Peter takes a look at it. Attaching the patch here for ref.
Created attachment 312310 [details] Patch that introduces oops with oprofile
------- Comment From ankigarg.com 2008-07-22 03:51 EDT------- http://lkml.org/lkml/2008/6/10/133 introduces these patches. So this series provides fixes for hotplug..not sure how it is affecting us here.
------- Comment From ankigarg.com 2008-07-22 06:30 EDT------- Pasting some content from the dump - # crash /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux /test/ankita/vmcore crash 4.0-5.0.3 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: cpu 0 first exception stack: 0 boot_exception_stacks: ffffffff8150a000 KERNEL: /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux DUMPFILE: /test/ankita/vmcore CPUS: 8 DATE: Fri Jul 18 01:39:31 2008 UPTIME: 00:01:30 LOAD AVERAGE: 1.66, 0.69, 0.25 TASKS: 257 NODENAME: rt-ipe.austin.ibm.com RELEASE: 2.6.24.7-65ibmrt2.4 VERSION: #1 SMP PREEMPT RT Fri Jun 6 20:06:47 EDT 2008 MACHINE: x86_64 (3000 Mhz) MEMORY: 16 GB PANIC: "Oops: 0002 [1] PREEMPT SMP " (check log for details) PID: 0 COMMAND: "swapper" TASK: ffffffff813ae7a0 (1 of 8) [THREAD_INFO: ffffffff8149e000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 0 TASK: ffffffff813ae7a0 CPU: 0 COMMAND: "swapper" #0 [ffffffff81509b80] machine_kexec at ffffffff81022dcd #1 [ffffffff81509c60] crash_kexec at ffffffff8106abd3 #2 [ffffffff81509d20] __die at ffffffff8128a48b #3 [ffffffff81509d50] do_page_fault at ffffffff8128bca1 #4 [ffffffff81509e30] error_exit at ffffffff81289e19 [exception RIP: rb_insert_color+6] RIP: ffffffff81139a82 RSP: ffffffff81509ee0 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff810001008780 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff8100010086a8 RDI: ffff810001008780 RBP: ffffffff81509ee8 R8: 0000000000000010 R9: 0000000000000000 R10: ffff81007fb09000 R11: 0000000000000000 R12: ffff810001008698 R13: ffff81042a8d44f0 R14: ffff81042a8d44e0 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffffff81509ef0] enqueue_hrtimer at ffffffff81053e48 #6 [ffffffff81509f20] hrtimer_interrupt at ffffffff81054df7 #7 [ffffffff81509f70] smp_local_timer_interrupt at ffffffff8101ff97 #8 [ffffffff81509f90] smp_apic_timer_interrupt at ffffffff810205f9 #9 [ffffffff81509fb0] apic_timer_interrupt at ffffffff8100ce46 --- <IRQ stack> --- #10 [ffffffff8149fe98] apic_timer_interrupt at ffffffff8100ce46 [exception RIP: mwait_idle+110] RIP: ffffffff8100afcf RSP: ffffffff8149ff48 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffffffff8149ff48 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8149e010 RDI: ffffffff813b2230 RBP: 0000000000000001 R8: 0000000000000000 R9: ffff81042e4a9e60 R10: 0000000000000000 R11: ffff81042e4a9e90 R12: ffffffff8128bddc R13: ffffffff8149fee8 R14: ffff81000100b980 R15: ffff81041954f080 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #11 [ffffffff8149ff40] enter_idle at ffffffff8100abb1 #12 [ffffffff8149ff50] cpu_idle at ffffffff8100ae90 crash> dis 0xffffffff81139a82 0xffffffff81139a82 <rb_insert_color+6>: mov %rsi,%r15 crash> dis 0xffffffff81509ee0 0xffffffff81509ee0 <boot_cpu_stack+16096>: add %eax,(%rax) crash> rd boot_cpu_stack ffffffff81506000: 0000000000000000 ........ Note that the slab fix patch makes changes to the stack allocation code. Since oprofile here is using NMI, the oops happens when the switch to other stack happens. While I am not sure what the "first exception stack" is supposed to be, in the WARNING below, WARNING: cpu 0 first exception stack: 0 boot_exception_stacks: ffffffff8150a000 it appears to be zero. My guess would be that the stack allocation code might need some attention. I can provide more output from crash if required.
Created attachment 312406 [details] Initial patch from peterz I applied this patch and started oprofile as Ankita did in the opening comment on an LS21. The box panic'd within a few seconds. So this patch doesn't appear to resolve the issue entirely.
Created attachment 312494 [details] updated patch move alloc stacks to trap_init()
------- Comment From dvhltc.com 2008-07-23 15:19 EDT------- I've built with the patch in comment #27, and kicked off oprofile per the opening comment. Been profiling for several minutes and haven't seen a crash. I think this is fixed.
------- Comment From jstultz.com 2008-07-23 20:35 EDT------- Issue should be fixed in -74, which is available for testing.
------- Comment From ankigarg.com 2008-07-24 00:42 EDT------- Yes, I too confirm that the patch fixes this issue. Tested the -74 kernel. So, marking this fixed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html