| Summary: | gdb session is oopsing latest kernel releases (F15, F16, hosts, guests) | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Gilboa Davara <gilboad> |
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 16 | CC: | fche, gansalmon, gleb, itamar, jonathan, kernel-maint, madhu.chinakonda, onestero |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-09-06 12:17:19 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Gilboa Davara
2012-01-05 10:44:03 UTC
What were you doing in gdb when this oopsed? Also, was the VM you tested in running on a host that was using the nVidia drivers? Me and my co-workers were debugging our proprietary software under gdb. In many cases, a simple "run" was enough to trigger the oops. We managed to reproduce the issue on guests that were running under nVidia-less hosts. - Gilboa Just to be certain I'll setup a headless server w/ VM's to see if this OOPs reproduces reliably. - Gilboa I found a similar report from almost a year ago, also involving ptrace. It doesn't seem to have resolved itself though. http://marc.info/?l=linux-kernel&m=129198247531612&w=2 (In reply to comment #0) > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > [ 7746.777215] <EOI> > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 Is this trace always the same? Did you use perf? I can hardly believe this has something to do with ptrace... Gleb, do you think this can be somehow connected with your fixes in perf_sched_in ? (In reply to comment #5) > (In reply to comment #0) > > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > > [ 7746.777215] <EOI> > > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 > > Is this trace always the same? Did you use perf? > > I can hardly believe this has something to do with ptrace... > > Gleb, do you think this can be somehow connected with your > fixes in perf_sched_in ? The trace looks similar to traces I god while debugging perf. But to trigger it perf needs to be running. Does this kernel have a fix for perf_sched_in and jump_label? Also here is bug report with similar trace where those fixes made it disappear: http://lkml.org/lkml/2011/11/5/101 (In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #0) > > > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > > > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > > > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > > > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > > > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > > > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > > > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > > > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > > > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > > > [ 7746.777215] <EOI> > > > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > > > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > > > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > > > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > > > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 > > > > Is this trace always the same? Did you use perf? > > > > I can hardly believe this has something to do with ptrace... > > > > Gleb, do you think this can be somehow connected with your > > fixes in perf_sched_in ? > > The trace looks similar to traces I god while debugging perf. But to trigger it > perf needs to be running. Does this kernel have a fix for perf_sched_in and > jump_label? 1d5f003f5a964711853514b04ddc872eec0fdc7b and bbbf7af4bf8fc69bc751818cf30521080fa47dcb are in 3.2-rc5 (ish). The latter is in 3.1.5. The former isn't queued for -stable at all from what I can tell. So, the particular kernel in this bug report doesn't have either of those fixes. The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix, but not the other. > Also here is bug report with similar trace where those fixes made it disappear: > http://lkml.org/lkml/2011/11/5/101 Gilboa, if you can recreate this fairly easily please let us know. Also, trying the 2.6.41.7 kernel in updates-testing would be a good idea as well. Managed to crash 2.6.41.7. We're not using perf; we simply use gdb to test our proprietary software (run XXX). Callstack: [15758.642011] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [15758.642011] IP: [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] PGD 181e6d067 PUD 181d04067 PMD 0 [15758.642011] Oops: 0000 [#1] SMP [15758.642011] CPU 5 [15758.642011] Modules linked in: lockd uinput 8021q garp stp llc sunrpc ipt_LOG xt_state bnep bluetooth iptable_nat nf_nat nf_conntrack_ipv4 rfkill nf_conntrack nf_defrag_ipv4 snd_hda_intel snd_hda_codec snd_hwdep i2c_piix4 snd_seq ppdev snd_seq_device joydev snd_pcm parport_pc i2c_core parport snd_timer snd microc ode soundcore snd_page_alloc e1000 8139cp mii ipv6 [last unloaded: scsi_wait_scan] [15758.642011] [15758.642011] Pid: 16385, comm: gdb Tainted: G W 2.6.41.7-1.fc15.x86_64 #1 Bochs Bochs [15758.642011] RIP: 0010:[<ffffffff8110d639>] [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] RSP: 0018:ffff88019fd43d88 EFLAGS: 00010086 [15758.642011] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX: 0000000000000000 [15758.642011] RDX: ffff88019fd55fa8 RSI: 00000000000f41a8 RDI: ffff880181f1e300 [15758.642011] RBP: ffff88019fd43db8 R08: 0000000000989680 R09: 0000000000000020 [15758.642011] R10: 0000000000000400 R11: 0000000000000000 R12: ffff880181f1e350 [15758.642011] R13: 00000000000f41a8 R14: ffff88019fd4f930 R15: ffff880195145cc0 [15758.642011] FS: 00007f423267e720(0000) GS:ffff88019fd40000(0000) knlGS:0000000000000000 [15758.642011] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15758.642011] CR2: 0000000000000048 CR3: 000000017fcdc000 CR4: 00000000000006e0 [15758.642011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15758.642011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15758.642011] Process gdb (pid: 16385, threadinfo ffff880181cc0000, task ffff880195145cc0) [15758.642011] Stack: [15758.642011] ffff88019fd43d98 0000000000000000 ffff88019fd43db8 ffff88019fd55fb0 [15758.642011] ffff88019fd4f860 ffff88019fd56080 ffff88019fd43e18 ffffffff8110d7a0 [15758.642011] ffff88017fcdae00 00000000000f41a8 0000000100000000 ffff880181f1e300 [15758.642011] Call Trace: [15758.642011] <IRQ> [15758.642011] [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290 [15758.642011] [<ffffffff8106478c>] scheduler_tick+0xdc/0x280 [15758.642011] [<ffffffff8107cb0e>] update_process_times+0x6e/0x90 [15758.642011] [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0 [15758.642011] [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0 [15758.642011] [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100 [15758.642011] [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10 [15758.642011] [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210 [15758.642011] [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99 [15758.642011] [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80 [15758.642011] <EOI> [15758.642011] [<ffffffff8117eb35>] ? inode_permission+0x25/0x100 [15758.642011] [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110 [15758.642011] [<ffffffff811808b6>] link_path_walk+0x76/0x880 [15758.642011] [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140 [15758.642011] [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80 [15758.642011] [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0 [15758.642011] [<ffffffff811829a8>] path_openat+0xb8/0x3c0 [15758.642011] [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90 [15758.642011] [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80 [15758.642011] [<ffffffff81182dd2>] do_filp_open+0x42/0xa0 [15758.642011] [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260 [15758.642011] [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150 [15758.642011] [<ffffffff81172827>] do_sys_open+0xf7/0x1d0 [15758.642011] [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360 [15758.642011] [<ffffffff81172920>] sys_open+0x20/0x30 [15758.642011] [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b [15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 [15758.642011] RIP [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] RSP <ffff88019fd43d88> [15758.642011] CR2: 0000000000000048 [15758.642011] ---[ end trace bb0af427d549d80a ]--- [15758.642011] Kernel panic - not syncing: Fatal exception in interrupt [15758.642011] Pid: 16385, comm: gdb Tainted: G D W 2.6.41.7-1.fc15.x86_64 #1 [15758.642011] Call Trace: [15758.642011] <IRQ> [<ffffffff815a49dc>] panic+0x91/0x1a7 [15758.642011] [<ffffffff815b088a>] oops_end+0xea/0xf0 [15758.642011] [<ffffffff815a42c1>] no_context+0x209/0x218 [15758.642011] [<ffffffff815a4499>] __bad_area_nosemaphore+0x1c9/0x1e8 [15758.642011] [<ffffffff8103ba65>] ? pvclock_clocksource_read+0x55/0xf0 [15758.642011] [<ffffffff8105c76f>] ? update_group_power+0x9f/0x130 [15758.642011] [<ffffffff815a44cb>] bad_area_nosemaphore+0x13/0x15 [15758.642011] [<ffffffff815b3076>] do_page_fault+0x416/0x4f0 [15758.642011] [<ffffffff815b28b5>] do_async_page_fault+0x35/0x80 [15758.642011] [<ffffffff815afbe5>] async_page_fault+0x25/0x30 [15758.642011] [<ffffffff8110d639>] ? perf_ctx_adjust_freq+0x49/0x130 [15758.642011] [<ffffffff8103ac29>] ? kvm_clock_read+0x19/0x20 [15758.642011] [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290 [15758.642011] [<ffffffff8106478c>] scheduler_tick+0xdc/0x280 [15758.642011] [<ffffffff8107cb0e>] update_process_times+0x6e/0x90 [15758.642011] [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0 [15758.642011] [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0 [15758.642011] [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100 [15758.642011] [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10 [15758.642011] [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210 [15758.642011] [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99 [15758.642011] [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80 [15758.642011] <EOI> [<ffffffff8117eb35>] ? inode_permission+0x25/0x100 [15758.642011] [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110 [15758.642011] [<ffffffff811808b6>] link_path_walk+0x76/0x880 [15758.642011] [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140 [15758.642011] [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80 [15758.642011] [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0 [15758.642011] [<ffffffff811829a8>] path_openat+0xb8/0x3c0 [15758.642011] [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90 [15758.642011] [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80 [15758.642011] [<ffffffff81182dd2>] do_filp_open+0x42/0xa0 [15758.642011] [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260 [15758.642011] [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150 [15758.642011] [<ffffffff81172827>] do_sys_open+0xf7/0x1d0 [15758.642011] [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360 [15758.642011] [<ffffffff81172920>] sys_open+0x20/0x30 [15758.642011] [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b (In reply to comment #7) > > So, the particular kernel in this bug report doesn't have either of those > fixes. The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix, > but not the other. > The jump_label fix is definitely not enough. I made it first and oops was easily reproducible with it applied. (In reply to comment #8) > Managed to crash 2.6.41.7. > We're not using perf; Hmm. OK, gdb can use perf events "implicitly", but afaics only if you play with hw breakpoints. But you are saying: > we simply use gdb to test our proprietary software (run > XXX). Strange... > [15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 > 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b > 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 at least this is clear after decodecode ;) perf_ctx_adjust_freq() hits the NULL terminated ctx->event_list. Not that this helps a lot. Still I hope this _can_ be fixed by Gleb's 1d5f003f + 86b47c25, may be ->task_ctx wasn't cleared and then this memory was freed/reused. (In reply to comment #8) > Managed to crash 2.6.41.7. Any chance you can try vanilla 3.1 and 3.2 kernels and let us know if you can re-create the problem (_hopefully_ fixed in 3.2) ? I can either build vanilla from scratch or use the 3.2 RPM and strip a patch (or two). What do you prefer? As for perf, we triggered the OOps w/o any breakpoints. Simply by: gdb prog (gdb) ran params BOOM - Gilboa (In reply to comment #12) > I can either build vanilla from scratch or use the 3.2 RPM and strip a patch > (or two). > What do you prefer? Can you try 3.2 vanilla? OK. I'll try and free some time to build a vanilla 3.2 kernel on one of the F16 VM's. - Gilboa Haven't tried vanilla yet, but recompiled F16 3.2 kernel OOPs like crazy. I'm building a vanilla 3.2 kernel as I write this. Thus far, vanilla 3.2 seems stable enough. I used the same configuration used by a F17 3.2 kernel. ... However, I'm currently testing (c)gdb on a VM running on my Xeon workstation (w/ nVidia binary driver). Most of the previous testing was done a headless Athlon Phenom (635) server. Any chance that the trigger for this bug is hardware / CPU related? - Gilboa (In reply to comment #16) > Thus far, vanilla 3.2 seems stable enough. > I used the same configuration used by a F17 3.2 kernel. Where can I get the source for F17 3.2 kernel? > > ... However, I'm currently testing (c)gdb on a VM running on my Xeon > workstation (w/ nVidia binary driver). Most of the previous testing was done a > headless Athlon Phenom (635) server. > Any chance that the trigger for this bug is hardware / CPU related? > Perf subsystem is hardware dependent, so it is possible. (In reply to comment #17) > (In reply to comment #16) > > Thus far, vanilla 3.2 seems stable enough. > > I used the same configuration used by a F17 3.2 kernel. > Where can I get the source for F17 3.2 kernel? http://koji.fedoraproject.org/koji/buildinfo?buildID=281207 Is the main build page for the 3.2 build that has been in the repositories until today. The SRPM for it is here: http://kojipkgs.fedoraproject.org/packages/kernel/3.2.0/2.fc17/src/kernel-3.2.0-2.fc17.src.rpm Just to eliminate utrace, I started a scratch build with that patch not applied of the 3.2.0-2 kernel here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 When it is finished building, could you give it a test? (In reply to comment #19) > Just to eliminate utrace, I started a scratch build with that patch not applied > of the 3.2.0-2 kernel here: > > http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 > > When it is finished building, could you give it a test? Oh, thanks Josh. Yes, I am starting to afraid I was wrong. I still can't imagine how utrace changes could introduce the problem like this, but given that vanilla 3.2 kernel works fine... (In reply to comment #20) > (In reply to comment #19) > > Just to eliminate utrace, I started a scratch build with that patch not applied > > of the 3.2.0-2 kernel here: > > > > http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 > > > > When it is finished building, could you give it a test? > > Oh, thanks Josh. > > Yes, I am starting to afraid I was wrong. I still can't imagine > how utrace changes could introduce the problem like this, but > given that vanilla 3.2 kernel works fine... It works on different HW. I still would like Gilboa to try it on AMD. Hello, I'm having difficulties reproducing this crash on vanilla 3.2 (good). ... But also having difficulties reproducing this fresh 3.1.9 (fc16 rpm from @updates). I'll try to free a couple of hours tomorrow (Sunday) to try and see which of the possible kernel configurations (3.1.7, 3.1.9, 3.2.1 from koji and vanilla 3.2) is crashing, and when/how. - Gilboa OK. Both 2.41.9 (F15) and 3.1.9 (F16) are oopsing w/ gdb. I'll install 3.2.1 from koji and test it. - Gilboa 3.2.1 from Koji goes up the flames. [ 772.672021] <IRQ> [ 772.672021] [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290 [ 772.672021] [<ffffffff81053329>] ? sched_slice+0x59/0xa0 [ 772.672021] [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300 [ 772.672021] [<ffffffff8107e48e>] update_process_times+0x6e/0x90 [ 772.672021] [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0 [ 772.672021] [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0 [ 772.672021] [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100 [ 772.672021] [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10 [ 772.672021] [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210 [ 772.672021] [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99 [ 772.672021] [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80 [ 772.672021] <EOI> [ 772.672021] [<ffffffff815e9034>] ? sysret_audit+0x16/0x20 [ 772.672021] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 [ 772.672021] RIP [<ffffffff81112d69>] perf_ctx_adjust_freq+0x49/0x130 [ 772.672021] RSP <ffff88011fd83d88> [ 772.672021] CR2: 0000000000000048 [ 772.672021] ---[ end trace ced3abbf6a7fa769 ]--- [ 772.672021] Kernel panic - not syncing: Fatal exception in interrupt [ 772.672021] Pid: 10786, comm: gdb Tainted: P D O 3.2.1-1.fc16.x86_64 #1 [ 772.672021] Call Trace: [ 772.672021] <IRQ> [<ffffffff815d682f>] panic+0x91/0x1a7 [ 772.672021] [<ffffffff815e217a>] oops_end+0xea/0xf0 [ 772.672021] [<ffffffff815d6114>] no_context+0x214/0x223 [ 772.672021] [<ffffffff813d3b7b>] ? ata_scsi_qc_complete+0x6b/0x470 [ 772.672021] [<ffffffff815d62ec>] __bad_area_nosemaphore+0x1c9/0x1e8 [ 772.672021] [<ffffffff8105e8bc>] ? update_group_power+0x9c/0x130 [ 772.672021] [<ffffffff812b5d36>] ? cpumask_next_and+0x36/0x50 [ 772.672021] [<ffffffff815d631e>] bad_area_nosemaphore+0x13/0x15 [ 772.672021] [<ffffffff815e4c26>] do_page_fault+0x416/0x4f0 [ 772.672021] [<ffffffff815e4465>] do_async_page_fault+0x35/0x80 [ 772.672021] [<ffffffff815e1725>] async_page_fault+0x25/0x30 [ 772.672021] [<ffffffff81112d69>] ? perf_ctx_adjust_freq+0x49/0x130 [ 772.672021] [<ffffffff8103cca9>] ? kvm_clock_read+0x19/0x20 [ 772.672021] [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290 [ 772.672021] [<ffffffff81053329>] ? sched_slice+0x59/0xa0 [ 772.672021] [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300 [ 772.672021] [<ffffffff8107e48e>] update_process_times+0x6e/0x90 [ 772.672021] [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0 [ 772.672021] [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0 [ 772.672021] [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100 [ 772.672021] [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10 [ 772.672021] [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210 [ 772.672021] [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99 [ 772.672021] [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80 [ 772.672021] <EOI> [<ffffffff815e9034>] ? sysret_audit+0x16/0x20 - Gilboa This is on the same HW where vanilla 3.2 worked fine? In theory, yes. I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing. On the up side, this really feels like a platform (as in CPU) dependent bug. ... Oh, it may or may not be relevant - but the code we're debugging makes heavy use of numa-ctl for memory allocation. - Gilboa (In reply to comment #26) > In theory, yes. > I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing. > > On the up side, this really feels like a platform (as in CPU) dependent bug. > > ... Oh, it may or may not be relevant - but the code we're debugging makes > heavy use of numa-ctl for memory allocation. > > - Gilboa Please try the kernel I linked to in comment #19. It has utrace disabled. The crash on comment 24 came from fc16 koji build :( Gilboa, first of all thanks a lot for your efforts. but I got lost, (In reply to comment #28) > > The crash on comment 24 came from fc16 koji build :( that comment says "3.2.1 from Koji". while the kernel linked to #19 is kernel-3.2.0, or I do what http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 tells me... was it really that kernel? (In reply to comment #29) > Gilboa, first of all thanks a lot for your efforts. Yes, thank you. > was it really that kernel? [ 772.672021] Pid: 10786, comm: gdb Tainted: P D O 3.2.1-1.fc16.x86_64 That's not the kernel I built without utrace. Also, please, try duplicating this without any proprietary modules loaded. My mistake, sorry. Do you want to rebuild the SRPM under F16 or do you want me to use the F17 kernel as-is? - Gilboa (In reply to comment #31) > My mistake, sorry. > Do you want to rebuild the SRPM under F16 or do you want me to use the F17 > kernel as-is? No need to rebuild. Installing it directly should work just fine. Thank you Finally got around to build a VM to test the koji build, but there's no binary package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911 What am I missing? - Gilboa (In reply to comment #33) > Finally got around to build a VM to test the koji build, but there's no binary > package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911 > What am I missing? Waited too long and koji auto-pruned it because it was a scratch build. I rebuilt it here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3746349 When that completes it should be identical to what was originally posted at the link that no longer works. No go. fc17 kernel died after ~2 "runs". However, there was no callstack / OOPs message in the serial console. - Gilboa Short update. I'm currently installing Ubunutu 11.10 on a VM to see if I can reproduce the crash under a different distro (I already tried the vanilla kernel path). ... While I'm not sure it'll do any good, it might at least confirm that this is not an upstream kernel.org bug. - Gilboa 3.2.6 triggers warn-on-slowpath (again, by "run <param>" in gdb). I'll try and reproduce it on non-tainted host/vm. [367414.345972] ------------[ cut here ]------------ [367414.345979] WARNING: at kernel/events/core.c:2047 task_ctx_sched_out+0x63/0x70() [367414.345981] Hardware name: GA-MA785GM-US2H [367414.345982] Modules linked in: bluetooth rfkill btrfs zlib_deflate libcrc32c ufs hfsplus hfs minix vfat msdos fat jfs xfs reiserfs usb_storage nfs fscache fuse ppdev parport_pc lp parport ipt_LOG ipt_MASQUERADE xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bridge stp llc it87 hwmon_vid xts gf128mul sha256_generic dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel nvidia(P) snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore r8169 snd_page_alloc sp5100_tco k10temp i2c_piix4 i2c_core edac_core edac_mce_amd mii microcode vhost_net macvtap macvlan tun virtio_net kvm_amd kvm nfsd lockd nfs_acl auth_rpcgss sunrpc binfmt_misc uinput pata_acpi ata_generic pata_atiixp wmi [last unloaded: scsi_wait_scan] [367414.346024] Pid: 28328, comm: upgen Tainted: P O 3.2.6-3.fc16.x86_64 #1 [367414.346026] Call Trace: [367414.346031] [<ffffffff8106dd4f>] warn_slowpath_common+0x7f/0xc0 [367414.346034] [<ffffffff8106ddaa>] warn_slowpath_null+0x1a/0x20 [367414.346036] [<ffffffff8110fc43>] task_ctx_sched_out+0x63/0x70 [367414.346039] [<ffffffff81113ffa>] perf_event_comm+0x8a/0x330 [367414.346042] [<ffffffff811888e2>] ? do_filp_open+0x42/0xa0 [367414.346045] [<ffffffff8117fff0>] set_task_comm+0x60/0x70 [367414.346047] [<ffffffff811dbe0f>] comm_write+0xdf/0xf0 [367414.346050] [<ffffffff81178fa3>] vfs_write+0xb3/0x180 [367414.346052] [<ffffffff811792ca>] sys_write+0x4a/0x90 [367414.346055] [<ffffffff815e9982>] system_call_fastpath+0x16/0x1b [367414.346057] ---[ end trace 33dece0994e4edf5 ]--- [mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update. [mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update. [mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update. I'll give it a test. - Gilboa We managed to reproduce the gdb-kernel-crash a couple of times no 3.3.0 kernel running on a dual core ATOM 330 netbook w/ nouveau driver. No crash dump. (We'll try and reproduce this crash on serial on a VM w/ serial console) - Gilboa We've dropped utrace in the 3.4.2 or newer kernels. Are you still seeing this with the latest F16 kernel update? Hi, We moved to F17 across the board, and thus far, I'm happy to say, we haven't managed to reproduce this OOPs. Bug can be safely closed. Thanks! - Gilboa OK, thank you for letting us know. |