Since the last round of kernel release under both F15 and F16 we're experiencing multiple kernel crashes on both F15 and F16 machines. As far as we could test, dropping back to F15's 2.6.41.1 (as opposed to 2.6.41.4) or F16's 3.1.4 (as opposed to 3.1.6) seems to reduce the chance of crashing the kernel. The type of gdb front-end doesn't seem to make difference (I'm using the console based cgdb, my co-workers are using eclipse, etc) In-order to try and reproduce it in a controled environment (the physical machines all use the nVidia binary drivers, and I doubt that a tainted kernel OOPs will be welcome :)), I tried and managed to reproduce the crash under a F15 x86_64 VM running under F16/x86_64 host. OOPs: gilboa-vmw-probe64 login: [ 7746.776241] general protection fault: 0000 [#1] SMP [ 7746.777215] CPU 1 [ 7746.777215] Modules linked in: nfs fscache auth_rpcgss nfs_acl 8021q garp stp llc lockd sunrpc uinput ipt_LOG xt_state bnep iptable_nat nf_nat bluetooth nf_conntrack_ipv4 rfkill nf_conntrack nf_defrag_ipv4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq ppdev snd_seq_device parport_pc parport snd_pcm joydev snd_timer snd i2c_piix4 soundcore snd_page_alloc i2c_core e1000 8139cp mii ipv6 [last unloaded: scsi_wait_scan] [ 7746.777215] [ 7746.777215] Pid: 6323, comm: gdb Tainted: G W 2.6.41.4-1.fc15.x86_64 #1 Bochs Bochs [ 7746.777215] RIP: 0010:[<ffffffff810d8aad>] [<ffffffff810d8aad>] perf_ctx_adjust_freq+0x29/0xd5 [ 7746.777215] RSP: 0018:ffff88011fc83da8 EFLAGS: 00010003 [ 7746.777215] RAX: 66524153e5894855 RBX: 66524153e5894845 RCX: 0000000000000000 [ 7746.777215] RDX: ffff88011fc95da8 RSI: 00000000000f41a8 RDI: ffff880114b6d3c0 [ 7746.777215] RBP: ffff88011fc83dd8 R08: 000000000000017f R09: 000000000000017f [ 7746.777215] R10: 000000000000017f R11: ffff88011fc92d70 R12: ffff880114b6d410 [ 7746.777215] R13: 00000000000f41a8 R14: ffff88011fc95e80 R15: ffff88003797ae60 [ 7746.777215] FS: 00007fb0f3079720(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 [ 7746.777215] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 7746.777215] CR2: 000000000046e190 CR3: 00000000378b3000 CR4: 00000000000006e0 [ 7746.777215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7746.777215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 7746.777215] Process gdb (pid: 6323, threadinfo ffff8800378ee000, task ffff88003797ae60) [ 7746.777215] Stack: [ 7746.777215] ffff88011fc83db8 66524153e5894855 ffff88011fc83dd8 ffff88011fc95db0 [ 7746.777215] ffff88011fc8f810 ffff88011fc8f740 ffff88011fc83e38 ffffffff810d8c6c [ 7746.777215] ffff880100000000 00000000000f41a8 0000000116d67400 ffff880114b6d3c0 [ 7746.777215] Call Trace: [ 7746.777215] <IRQ> [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 [ 7746.777215] <EOI> [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 [ 7746.777215] [<ffffffff81062825>] ptrace_request+0x373/0x410 [ 7746.777215] [<ffffffff8149ca36>] ? _raw_spin_lock+0xe/0x10 [ 7746.777215] [<ffffffff810456e0>] ? task_rq_lock+0x4e/0x87 [ 7746.777215] [<ffffffff8149ca9c>] ? _raw_spin_unlock_irqrestore+0x17/0x19 [ 7746.777215] [<ffffffff810451ca>] ? task_rq_unlock+0x1b/0x1d [ 7746.777215] [<ffffffff8104fae6>] ? wait_task_inactive+0xb3/0x129 [ 7746.777215] [<ffffffff81018ffc>] arch_ptrace+0x1aa/0x1bb [ 7746.777215] [<ffffffff81062418>] sys_ptrace+0x97/0xb3 [ 7746.777215] [<ffffffff814a31c2>] system_call_fastpath+0x16/0x1b [ 7746.777215] Code: 5d c3 55 48 89 e5 41 55 49 89 f5 41 54 4c 8d 67 50 53 48 83 ec 18 48 8b 47 50 48 89 45 d8 48 8b 5d d8 48 83 eb 10 e9 94 00 00 00 [ 7746.777215] 7b 58 01 75 7e 48 89 df e8 ae bb ff ff 85 c0 74 72 48 8b 83 [ 7746.777215] RIP [<ffffffff810d8aad>] perf_ctx_adjust_freq+0x29/0xd5 [ 7746.777215] RSP <ffff88011fc83da8> [ 7746.777215] ---[ end trace 672d713ca5841918 ]--- [ 7746.777215] Kernel panic - not syncing: Fatal exception in interrupt [ 7746.777215] Pid: 6323, comm: gdb Tainted: G D W 2.6.41.4-1.fc15.x86_64 #1 [ 7746.777215] Call Trace: [ 7746.777215] <IRQ> [<ffffffff81493847>] panic+0x91/0x1a5 [ 7746.777215] [<ffffffff8149dbe6>] oops_end+0xb4/0xc5 [ 7746.777215] [<ffffffff81011d47>] die+0x5a/0x63 [ 7746.777215] [<ffffffff8149d61f>] do_general_protection+0x128/0x130 [ 7746.777215] [<ffffffff8149d0c5>] general_protection+0x25/0x30 [ 7746.777215] [<ffffffff810d8aad>] ? perf_ctx_adjust_freq+0x29/0xd5 [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 [ 7746.777215] <EOI> [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 [ 7746.777215] [<ffffffff81062825>] ptrace_request+0x373/0x410 [ 7746.777215] [<ffffffff8149ca36>] ? _raw_spin_lock+0xe/0x10 [ 7746.777215] [<ffffffff810456e0>] ? task_rq_lock+0x4e/0x87 [ 7746.777215] [<ffffffff8149ca9c>] ? _raw_spin_unlock_irqrestore+0x17/0x19 [ 7746.777215] [<ffffffff810451ca>] ? task_rq_unlock+0x1b/0x1d [ 7746.777215] [<ffffffff8104fae6>] ? wait_task_inactive+0xb3/0x129 [ 7746.777215] [<ffffffff81018ffc>] arch_ptrace+0x1aa/0x1bb [ 7746.777215] [<ffffffff81062418>] sys_ptrace+0x97/0xb3 [ 7746.777215] [<ffffffff814a31c2>] system_call_fastpath+0x16/0x1b
What were you doing in gdb when this oopsed? Also, was the VM you tested in running on a host that was using the nVidia drivers?
Me and my co-workers were debugging our proprietary software under gdb. In many cases, a simple "run" was enough to trigger the oops. We managed to reproduce the issue on guests that were running under nVidia-less hosts. - Gilboa
Just to be certain I'll setup a headless server w/ VM's to see if this OOPs reproduces reliably. - Gilboa
I found a similar report from almost a year ago, also involving ptrace. It doesn't seem to have resolved itself though. http://marc.info/?l=linux-kernel&m=129198247531612&w=2
(In reply to comment #0) > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > [ 7746.777215] <EOI> > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 Is this trace always the same? Did you use perf? I can hardly believe this has something to do with ptrace... Gleb, do you think this can be somehow connected with your fixes in perf_sched_in ?
(In reply to comment #5) > (In reply to comment #0) > > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > > [ 7746.777215] <EOI> > > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 > > Is this trace always the same? Did you use perf? > > I can hardly believe this has something to do with ptrace... > > Gleb, do you think this can be somehow connected with your > fixes in perf_sched_in ? The trace looks similar to traces I god while debugging perf. But to trigger it perf needs to be running. Does this kernel have a fix for perf_sched_in and jump_label? Also here is bug report with similar trace where those fixes made it disappear: http://lkml.org/lkml/2011/11/5/101
(In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #0) > > > [ 7746.777215] [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2 > > > [ 7746.777215] [<ffffffff810528d8>] scheduler_tick+0xd0/0x260 > > > [ 7746.777215] [<ffffffff8106520d>] update_process_times+0x65/0x76 > > > [ 7746.777215] [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f > > > [ 7746.777215] [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154 > > > [ 7746.777215] [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde > > > [ 7746.777215] [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d > > > [ 7746.777215] [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a > > > [ 7746.777215] [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80 > > > [ 7746.777215] <EOI> > > > [ 7746.777215] [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8 > > > [ 7746.777215] [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd > > > [ 7746.777215] [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19 > > > [ 7746.777215] [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21 > > > [ 7746.777215] [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9 > > > > Is this trace always the same? Did you use perf? > > > > I can hardly believe this has something to do with ptrace... > > > > Gleb, do you think this can be somehow connected with your > > fixes in perf_sched_in ? > > The trace looks similar to traces I god while debugging perf. But to trigger it > perf needs to be running. Does this kernel have a fix for perf_sched_in and > jump_label? 1d5f003f5a964711853514b04ddc872eec0fdc7b and bbbf7af4bf8fc69bc751818cf30521080fa47dcb are in 3.2-rc5 (ish). The latter is in 3.1.5. The former isn't queued for -stable at all from what I can tell. So, the particular kernel in this bug report doesn't have either of those fixes. The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix, but not the other. > Also here is bug report with similar trace where those fixes made it disappear: > http://lkml.org/lkml/2011/11/5/101 Gilboa, if you can recreate this fairly easily please let us know. Also, trying the 2.6.41.7 kernel in updates-testing would be a good idea as well.
Managed to crash 2.6.41.7. We're not using perf; we simply use gdb to test our proprietary software (run XXX). Callstack: [15758.642011] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [15758.642011] IP: [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] PGD 181e6d067 PUD 181d04067 PMD 0 [15758.642011] Oops: 0000 [#1] SMP [15758.642011] CPU 5 [15758.642011] Modules linked in: lockd uinput 8021q garp stp llc sunrpc ipt_LOG xt_state bnep bluetooth iptable_nat nf_nat nf_conntrack_ipv4 rfkill nf_conntrack nf_defrag_ipv4 snd_hda_intel snd_hda_codec snd_hwdep i2c_piix4 snd_seq ppdev snd_seq_device joydev snd_pcm parport_pc i2c_core parport snd_timer snd microc ode soundcore snd_page_alloc e1000 8139cp mii ipv6 [last unloaded: scsi_wait_scan] [15758.642011] [15758.642011] Pid: 16385, comm: gdb Tainted: G W 2.6.41.7-1.fc15.x86_64 #1 Bochs Bochs [15758.642011] RIP: 0010:[<ffffffff8110d639>] [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] RSP: 0018:ffff88019fd43d88 EFLAGS: 00010086 [15758.642011] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX: 0000000000000000 [15758.642011] RDX: ffff88019fd55fa8 RSI: 00000000000f41a8 RDI: ffff880181f1e300 [15758.642011] RBP: ffff88019fd43db8 R08: 0000000000989680 R09: 0000000000000020 [15758.642011] R10: 0000000000000400 R11: 0000000000000000 R12: ffff880181f1e350 [15758.642011] R13: 00000000000f41a8 R14: ffff88019fd4f930 R15: ffff880195145cc0 [15758.642011] FS: 00007f423267e720(0000) GS:ffff88019fd40000(0000) knlGS:0000000000000000 [15758.642011] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [15758.642011] CR2: 0000000000000048 CR3: 000000017fcdc000 CR4: 00000000000006e0 [15758.642011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [15758.642011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [15758.642011] Process gdb (pid: 16385, threadinfo ffff880181cc0000, task ffff880195145cc0) [15758.642011] Stack: [15758.642011] ffff88019fd43d98 0000000000000000 ffff88019fd43db8 ffff88019fd55fb0 [15758.642011] ffff88019fd4f860 ffff88019fd56080 ffff88019fd43e18 ffffffff8110d7a0 [15758.642011] ffff88017fcdae00 00000000000f41a8 0000000100000000 ffff880181f1e300 [15758.642011] Call Trace: [15758.642011] <IRQ> [15758.642011] [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290 [15758.642011] [<ffffffff8106478c>] scheduler_tick+0xdc/0x280 [15758.642011] [<ffffffff8107cb0e>] update_process_times+0x6e/0x90 [15758.642011] [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0 [15758.642011] [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0 [15758.642011] [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100 [15758.642011] [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10 [15758.642011] [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210 [15758.642011] [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99 [15758.642011] [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80 [15758.642011] <EOI> [15758.642011] [<ffffffff8117eb35>] ? inode_permission+0x25/0x100 [15758.642011] [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110 [15758.642011] [<ffffffff811808b6>] link_path_walk+0x76/0x880 [15758.642011] [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140 [15758.642011] [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80 [15758.642011] [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0 [15758.642011] [<ffffffff811829a8>] path_openat+0xb8/0x3c0 [15758.642011] [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90 [15758.642011] [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80 [15758.642011] [<ffffffff81182dd2>] do_filp_open+0x42/0xa0 [15758.642011] [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260 [15758.642011] [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150 [15758.642011] [<ffffffff81172827>] do_sys_open+0xf7/0x1d0 [15758.642011] [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360 [15758.642011] [<ffffffff81172920>] sys_open+0x20/0x30 [15758.642011] [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b [15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 [15758.642011] RIP [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130 [15758.642011] RSP <ffff88019fd43d88> [15758.642011] CR2: 0000000000000048 [15758.642011] ---[ end trace bb0af427d549d80a ]--- [15758.642011] Kernel panic - not syncing: Fatal exception in interrupt [15758.642011] Pid: 16385, comm: gdb Tainted: G D W 2.6.41.7-1.fc15.x86_64 #1 [15758.642011] Call Trace: [15758.642011] <IRQ> [<ffffffff815a49dc>] panic+0x91/0x1a7 [15758.642011] [<ffffffff815b088a>] oops_end+0xea/0xf0 [15758.642011] [<ffffffff815a42c1>] no_context+0x209/0x218 [15758.642011] [<ffffffff815a4499>] __bad_area_nosemaphore+0x1c9/0x1e8 [15758.642011] [<ffffffff8103ba65>] ? pvclock_clocksource_read+0x55/0xf0 [15758.642011] [<ffffffff8105c76f>] ? update_group_power+0x9f/0x130 [15758.642011] [<ffffffff815a44cb>] bad_area_nosemaphore+0x13/0x15 [15758.642011] [<ffffffff815b3076>] do_page_fault+0x416/0x4f0 [15758.642011] [<ffffffff815b28b5>] do_async_page_fault+0x35/0x80 [15758.642011] [<ffffffff815afbe5>] async_page_fault+0x25/0x30 [15758.642011] [<ffffffff8110d639>] ? perf_ctx_adjust_freq+0x49/0x130 [15758.642011] [<ffffffff8103ac29>] ? kvm_clock_read+0x19/0x20 [15758.642011] [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290 [15758.642011] [<ffffffff8106478c>] scheduler_tick+0xdc/0x280 [15758.642011] [<ffffffff8107cb0e>] update_process_times+0x6e/0x90 [15758.642011] [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0 [15758.642011] [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0 [15758.642011] [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100 [15758.642011] [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10 [15758.642011] [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210 [15758.642011] [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99 [15758.642011] [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80 [15758.642011] <EOI> [<ffffffff8117eb35>] ? inode_permission+0x25/0x100 [15758.642011] [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110 [15758.642011] [<ffffffff811808b6>] link_path_walk+0x76/0x880 [15758.642011] [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140 [15758.642011] [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80 [15758.642011] [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0 [15758.642011] [<ffffffff811829a8>] path_openat+0xb8/0x3c0 [15758.642011] [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90 [15758.642011] [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80 [15758.642011] [<ffffffff81182dd2>] do_filp_open+0x42/0xa0 [15758.642011] [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260 [15758.642011] [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150 [15758.642011] [<ffffffff81172827>] do_sys_open+0xf7/0x1d0 [15758.642011] [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360 [15758.642011] [<ffffffff81172920>] sys_open+0x20/0x30 [15758.642011] [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b
(In reply to comment #7) > > So, the particular kernel in this bug report doesn't have either of those > fixes. The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix, > but not the other. > The jump_label fix is definitely not enough. I made it first and oops was easily reproducible with it applied.
(In reply to comment #8) > Managed to crash 2.6.41.7. > We're not using perf; Hmm. OK, gdb can use perf events "implicitly", but afaics only if you play with hw breakpoints. But you are saying: > we simply use gdb to test our proprietary software (run > XXX). Strange... > [15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 > 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b > 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 at least this is clear after decodecode ;) perf_ctx_adjust_freq() hits the NULL terminated ctx->event_list. Not that this helps a lot. Still I hope this _can_ be fixed by Gleb's 1d5f003f + 86b47c25, may be ->task_ctx wasn't cleared and then this memory was freed/reused.
(In reply to comment #8) > Managed to crash 2.6.41.7. Any chance you can try vanilla 3.1 and 3.2 kernels and let us know if you can re-create the problem (_hopefully_ fixed in 3.2) ?
I can either build vanilla from scratch or use the 3.2 RPM and strip a patch (or two). What do you prefer? As for perf, we triggered the OOps w/o any breakpoints. Simply by: gdb prog (gdb) ran params BOOM - Gilboa
(In reply to comment #12) > I can either build vanilla from scratch or use the 3.2 RPM and strip a patch > (or two). > What do you prefer? Can you try 3.2 vanilla?
OK. I'll try and free some time to build a vanilla 3.2 kernel on one of the F16 VM's. - Gilboa
Haven't tried vanilla yet, but recompiled F16 3.2 kernel OOPs like crazy. I'm building a vanilla 3.2 kernel as I write this.
Thus far, vanilla 3.2 seems stable enough. I used the same configuration used by a F17 3.2 kernel. ... However, I'm currently testing (c)gdb on a VM running on my Xeon workstation (w/ nVidia binary driver). Most of the previous testing was done a headless Athlon Phenom (635) server. Any chance that the trigger for this bug is hardware / CPU related? - Gilboa
(In reply to comment #16) > Thus far, vanilla 3.2 seems stable enough. > I used the same configuration used by a F17 3.2 kernel. Where can I get the source for F17 3.2 kernel? > > ... However, I'm currently testing (c)gdb on a VM running on my Xeon > workstation (w/ nVidia binary driver). Most of the previous testing was done a > headless Athlon Phenom (635) server. > Any chance that the trigger for this bug is hardware / CPU related? > Perf subsystem is hardware dependent, so it is possible.
(In reply to comment #17) > (In reply to comment #16) > > Thus far, vanilla 3.2 seems stable enough. > > I used the same configuration used by a F17 3.2 kernel. > Where can I get the source for F17 3.2 kernel? http://koji.fedoraproject.org/koji/buildinfo?buildID=281207 Is the main build page for the 3.2 build that has been in the repositories until today. The SRPM for it is here: http://kojipkgs.fedoraproject.org/packages/kernel/3.2.0/2.fc17/src/kernel-3.2.0-2.fc17.src.rpm
Just to eliminate utrace, I started a scratch build with that patch not applied of the 3.2.0-2 kernel here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 When it is finished building, could you give it a test?
(In reply to comment #19) > Just to eliminate utrace, I started a scratch build with that patch not applied > of the 3.2.0-2 kernel here: > > http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 > > When it is finished building, could you give it a test? Oh, thanks Josh. Yes, I am starting to afraid I was wrong. I still can't imagine how utrace changes could introduce the problem like this, but given that vanilla 3.2 kernel works fine...
(In reply to comment #20) > (In reply to comment #19) > > Just to eliminate utrace, I started a scratch build with that patch not applied > > of the 3.2.0-2 kernel here: > > > > http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 > > > > When it is finished building, could you give it a test? > > Oh, thanks Josh. > > Yes, I am starting to afraid I was wrong. I still can't imagine > how utrace changes could introduce the problem like this, but > given that vanilla 3.2 kernel works fine... It works on different HW. I still would like Gilboa to try it on AMD.
Hello, I'm having difficulties reproducing this crash on vanilla 3.2 (good). ... But also having difficulties reproducing this fresh 3.1.9 (fc16 rpm from @updates). I'll try to free a couple of hours tomorrow (Sunday) to try and see which of the possible kernel configurations (3.1.7, 3.1.9, 3.2.1 from koji and vanilla 3.2) is crashing, and when/how. - Gilboa
OK. Both 2.41.9 (F15) and 3.1.9 (F16) are oopsing w/ gdb. I'll install 3.2.1 from koji and test it. - Gilboa
3.2.1 from Koji goes up the flames. [ 772.672021] <IRQ> [ 772.672021] [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290 [ 772.672021] [<ffffffff81053329>] ? sched_slice+0x59/0xa0 [ 772.672021] [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300 [ 772.672021] [<ffffffff8107e48e>] update_process_times+0x6e/0x90 [ 772.672021] [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0 [ 772.672021] [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0 [ 772.672021] [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100 [ 772.672021] [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10 [ 772.672021] [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210 [ 772.672021] [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99 [ 772.672021] [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80 [ 772.672021] <EOI> [ 772.672021] [<ffffffff815e9034>] ? sysret_audit+0x16/0x20 [ 772.672021] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 [ 772.672021] RIP [<ffffffff81112d69>] perf_ctx_adjust_freq+0x49/0x130 [ 772.672021] RSP <ffff88011fd83d88> [ 772.672021] CR2: 0000000000000048 [ 772.672021] ---[ end trace ced3abbf6a7fa769 ]--- [ 772.672021] Kernel panic - not syncing: Fatal exception in interrupt [ 772.672021] Pid: 10786, comm: gdb Tainted: P D O 3.2.1-1.fc16.x86_64 #1 [ 772.672021] Call Trace: [ 772.672021] <IRQ> [<ffffffff815d682f>] panic+0x91/0x1a7 [ 772.672021] [<ffffffff815e217a>] oops_end+0xea/0xf0 [ 772.672021] [<ffffffff815d6114>] no_context+0x214/0x223 [ 772.672021] [<ffffffff813d3b7b>] ? ata_scsi_qc_complete+0x6b/0x470 [ 772.672021] [<ffffffff815d62ec>] __bad_area_nosemaphore+0x1c9/0x1e8 [ 772.672021] [<ffffffff8105e8bc>] ? update_group_power+0x9c/0x130 [ 772.672021] [<ffffffff812b5d36>] ? cpumask_next_and+0x36/0x50 [ 772.672021] [<ffffffff815d631e>] bad_area_nosemaphore+0x13/0x15 [ 772.672021] [<ffffffff815e4c26>] do_page_fault+0x416/0x4f0 [ 772.672021] [<ffffffff815e4465>] do_async_page_fault+0x35/0x80 [ 772.672021] [<ffffffff815e1725>] async_page_fault+0x25/0x30 [ 772.672021] [<ffffffff81112d69>] ? perf_ctx_adjust_freq+0x49/0x130 [ 772.672021] [<ffffffff8103cca9>] ? kvm_clock_read+0x19/0x20 [ 772.672021] [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290 [ 772.672021] [<ffffffff81053329>] ? sched_slice+0x59/0xa0 [ 772.672021] [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300 [ 772.672021] [<ffffffff8107e48e>] update_process_times+0x6e/0x90 [ 772.672021] [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0 [ 772.672021] [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0 [ 772.672021] [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100 [ 772.672021] [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10 [ 772.672021] [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210 [ 772.672021] [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99 [ 772.672021] [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80 [ 772.672021] <EOI> [<ffffffff815e9034>] ? sysret_audit+0x16/0x20 - Gilboa
This is on the same HW where vanilla 3.2 worked fine?
In theory, yes. I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing. On the up side, this really feels like a platform (as in CPU) dependent bug. ... Oh, it may or may not be relevant - but the code we're debugging makes heavy use of numa-ctl for memory allocation. - Gilboa
(In reply to comment #26) > In theory, yes. > I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing. > > On the up side, this really feels like a platform (as in CPU) dependent bug. > > ... Oh, it may or may not be relevant - but the code we're debugging makes > heavy use of numa-ctl for memory allocation. > > - Gilboa Please try the kernel I linked to in comment #19. It has utrace disabled.
The crash on comment 24 came from fc16 koji build :(
Gilboa, first of all thanks a lot for your efforts. but I got lost, (In reply to comment #28) > > The crash on comment 24 came from fc16 koji build :( that comment says "3.2.1 from Koji". while the kernel linked to #19 is kernel-3.2.0, or I do what http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910 tells me... was it really that kernel?
(In reply to comment #29) > Gilboa, first of all thanks a lot for your efforts. Yes, thank you. > was it really that kernel? [ 772.672021] Pid: 10786, comm: gdb Tainted: P D O 3.2.1-1.fc16.x86_64 That's not the kernel I built without utrace. Also, please, try duplicating this without any proprietary modules loaded.
My mistake, sorry. Do you want to rebuild the SRPM under F16 or do you want me to use the F17 kernel as-is? - Gilboa
(In reply to comment #31) > My mistake, sorry. > Do you want to rebuild the SRPM under F16 or do you want me to use the F17 > kernel as-is? No need to rebuild. Installing it directly should work just fine. Thank you
Finally got around to build a VM to test the koji build, but there's no binary package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911 What am I missing? - Gilboa
(In reply to comment #33) > Finally got around to build a VM to test the koji build, but there's no binary > package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911 > What am I missing? Waited too long and koji auto-pruned it because it was a scratch build. I rebuilt it here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3746349 When that completes it should be identical to what was originally posted at the link that no longer works.
No go. fc17 kernel died after ~2 "runs". However, there was no callstack / OOPs message in the serial console. - Gilboa
Short update. I'm currently installing Ubunutu 11.10 on a VM to see if I can reproduce the crash under a different distro (I already tried the vanilla kernel path). ... While I'm not sure it'll do any good, it might at least confirm that this is not an upstream kernel.org bug. - Gilboa
3.2.6 triggers warn-on-slowpath (again, by "run <param>" in gdb). I'll try and reproduce it on non-tainted host/vm. [367414.345972] ------------[ cut here ]------------ [367414.345979] WARNING: at kernel/events/core.c:2047 task_ctx_sched_out+0x63/0x70() [367414.345981] Hardware name: GA-MA785GM-US2H [367414.345982] Modules linked in: bluetooth rfkill btrfs zlib_deflate libcrc32c ufs hfsplus hfs minix vfat msdos fat jfs xfs reiserfs usb_storage nfs fscache fuse ppdev parport_pc lp parport ipt_LOG ipt_MASQUERADE xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bridge stp llc it87 hwmon_vid xts gf128mul sha256_generic dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel nvidia(P) snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore r8169 snd_page_alloc sp5100_tco k10temp i2c_piix4 i2c_core edac_core edac_mce_amd mii microcode vhost_net macvtap macvlan tun virtio_net kvm_amd kvm nfsd lockd nfs_acl auth_rpcgss sunrpc binfmt_misc uinput pata_acpi ata_generic pata_atiixp wmi [last unloaded: scsi_wait_scan] [367414.346024] Pid: 28328, comm: upgen Tainted: P O 3.2.6-3.fc16.x86_64 #1 [367414.346026] Call Trace: [367414.346031] [<ffffffff8106dd4f>] warn_slowpath_common+0x7f/0xc0 [367414.346034] [<ffffffff8106ddaa>] warn_slowpath_null+0x1a/0x20 [367414.346036] [<ffffffff8110fc43>] task_ctx_sched_out+0x63/0x70 [367414.346039] [<ffffffff81113ffa>] perf_event_comm+0x8a/0x330 [367414.346042] [<ffffffff811888e2>] ? do_filp_open+0x42/0xa0 [367414.346045] [<ffffffff8117fff0>] set_task_comm+0x60/0x70 [367414.346047] [<ffffffff811dbe0f>] comm_write+0xdf/0xf0 [367414.346050] [<ffffffff81178fa3>] vfs_write+0xb3/0x180 [367414.346052] [<ffffffff811792ca>] sys_write+0x4a/0x90 [367414.346055] [<ffffffff815e9982>] system_call_fastpath+0x16/0x1b [367414.346057] ---[ end trace 33dece0994e4edf5 ]---
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.
I'll give it a test. - Gilboa
We managed to reproduce the gdb-kernel-crash a couple of times no 3.3.0 kernel running on a dual core ATOM 330 netbook w/ nouveau driver. No crash dump. (We'll try and reproduce this crash on serial on a VM w/ serial console) - Gilboa
We've dropped utrace in the 3.4.2 or newer kernels. Are you still seeing this with the latest F16 kernel update?
Hi, We moved to F17 across the board, and thus far, I'm happy to say, we haven't managed to reproduce this OOPs. Bug can be safely closed. Thanks! - Gilboa
OK, thank you for letting us know.