Bug 771894

Summary:	gdb session is oopsing latest kernel releases (F15, F16, hosts, guests)
Product:	[Fedora] Fedora	Reporter:	Gilboa Davara <gilboad>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	16	CC:	fche, gansalmon, gleb, itamar, jonathan, kernel-maint, madhu.chinakonda, onestero
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-09-06 12:17:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gilboa Davara 2012-01-05 10:44:03 UTC

Since the last round of kernel release under both F15 and F16 we're experiencing multiple kernel crashes on both F15 and F16 machines.
As far as we could test, dropping back to F15's 2.6.41.1 (as opposed to 2.6.41.4) or F16's 3.1.4 (as opposed to 3.1.6) seems to reduce the chance of crashing the kernel.
The type of gdb front-end doesn't seem to make difference (I'm using the console based cgdb, my co-workers are using eclipse, etc)

In-order to try and reproduce it in a controled environment (the physical machines all use the nVidia binary drivers, and I doubt that a tainted kernel OOPs will be welcome :)), I tried and managed to reproduce the crash under a F15 x86_64 VM running under F16/x86_64 host.

OOPs:
gilboa-vmw-probe64 login: [ 7746.776241] general protection fault: 0000 [#1] SMP
[ 7746.777215] CPU 1
[ 7746.777215] Modules linked in: nfs fscache auth_rpcgss nfs_acl 8021q garp stp llc lockd sunrpc uinput ipt_LOG xt_state bnep iptable_nat nf_nat bluetooth nf_conntrack_ipv4 rfkill nf_conntrack nf_defrag_ipv4 snd_hda_intel snd_hda_codec snd_hwdep snd_seq ppdev snd_seq_device parport_pc parport snd_pcm joydev snd_timer snd i2c_piix4 soundcore snd_page_alloc i2c_core e1000 8139cp mii ipv6 [last unloaded: scsi_wait_scan]
[ 7746.777215]
[ 7746.777215] Pid: 6323, comm: gdb Tainted: G        W   2.6.41.4-1.fc15.x86_64 #1 Bochs Bochs
[ 7746.777215] RIP: 0010:[<ffffffff810d8aad>]  [<ffffffff810d8aad>] perf_ctx_adjust_freq+0x29/0xd5
[ 7746.777215] RSP: 0018:ffff88011fc83da8  EFLAGS: 00010003
[ 7746.777215] RAX: 66524153e5894855 RBX: 66524153e5894845 RCX: 0000000000000000
[ 7746.777215] RDX: ffff88011fc95da8 RSI: 00000000000f41a8 RDI: ffff880114b6d3c0
[ 7746.777215] RBP: ffff88011fc83dd8 R08: 000000000000017f R09: 000000000000017f
[ 7746.777215] R10: 000000000000017f R11: ffff88011fc92d70 R12: ffff880114b6d410
[ 7746.777215] R13: 00000000000f41a8 R14: ffff88011fc95e80 R15: ffff88003797ae60
[ 7746.777215] FS:  00007fb0f3079720(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
[ 7746.777215] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7746.777215] CR2: 000000000046e190 CR3: 00000000378b3000 CR4: 00000000000006e0
[ 7746.777215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7746.777215] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7746.777215] Process gdb (pid: 6323, threadinfo ffff8800378ee000, task ffff88003797ae60)
[ 7746.777215] Stack:
[ 7746.777215]  ffff88011fc83db8 66524153e5894855 ffff88011fc83dd8 ffff88011fc95db0
[ 7746.777215]  ffff88011fc8f810 ffff88011fc8f740 ffff88011fc83e38 ffffffff810d8c6c
[ 7746.777215]  ffff880100000000 00000000000f41a8 0000000116d67400 ffff880114b6d3c0
[ 7746.777215] Call Trace:
[ 7746.777215]  <IRQ>
[ 7746.777215]  [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2
[ 7746.777215]  [<ffffffff810528d8>] scheduler_tick+0xd0/0x260
[ 7746.777215]  [<ffffffff8106520d>] update_process_times+0x65/0x76
[ 7746.777215]  [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f
[ 7746.777215]  [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154
[ 7746.777215]  [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde
[ 7746.777215]  [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d
[ 7746.777215]  [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a
[ 7746.777215]  [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80
[ 7746.777215]  <EOI>
[ 7746.777215]  [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8
[ 7746.777215]  [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd
[ 7746.777215]  [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19
[ 7746.777215]  [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21
[ 7746.777215]  [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9
[ 7746.777215]  [<ffffffff81062825>] ptrace_request+0x373/0x410
[ 7746.777215]  [<ffffffff8149ca36>] ? _raw_spin_lock+0xe/0x10
[ 7746.777215]  [<ffffffff810456e0>] ? task_rq_lock+0x4e/0x87
[ 7746.777215]  [<ffffffff8149ca9c>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[ 7746.777215]  [<ffffffff810451ca>] ? task_rq_unlock+0x1b/0x1d
[ 7746.777215]  [<ffffffff8104fae6>] ? wait_task_inactive+0xb3/0x129
[ 7746.777215]  [<ffffffff81018ffc>] arch_ptrace+0x1aa/0x1bb
[ 7746.777215]  [<ffffffff81062418>] sys_ptrace+0x97/0xb3
[ 7746.777215]  [<ffffffff814a31c2>] system_call_fastpath+0x16/0x1b
[ 7746.777215] Code: 5d c3 55 48 89 e5 41 55 49 89 f5 41 54 4c 8d 67 50 53 48 83 ec 18 48 8b 47 50 48 89 45 d8 48 8b 5d d8 48 83 eb 10 e9 94 00 00 00
[ 7746.777215]  7b 58 01 75 7e 48 89 df e8 ae bb ff ff 85 c0 74 72 48 8b 83
[ 7746.777215] RIP  [<ffffffff810d8aad>] perf_ctx_adjust_freq+0x29/0xd5
[ 7746.777215]  RSP <ffff88011fc83da8>
[ 7746.777215] ---[ end trace 672d713ca5841918 ]---
[ 7746.777215] Kernel panic - not syncing: Fatal exception in interrupt
[ 7746.777215] Pid: 6323, comm: gdb Tainted: G      D W   2.6.41.4-1.fc15.x86_64 #1
[ 7746.777215] Call Trace:
[ 7746.777215]  <IRQ>  [<ffffffff81493847>] panic+0x91/0x1a5
[ 7746.777215]  [<ffffffff8149dbe6>] oops_end+0xb4/0xc5
[ 7746.777215]  [<ffffffff81011d47>] die+0x5a/0x63
[ 7746.777215]  [<ffffffff8149d61f>] do_general_protection+0x128/0x130
[ 7746.777215]  [<ffffffff8149d0c5>] general_protection+0x25/0x30
[ 7746.777215]  [<ffffffff810d8aad>] ? perf_ctx_adjust_freq+0x29/0xd5
[ 7746.777215]  [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2
[ 7746.777215]  [<ffffffff810528d8>] scheduler_tick+0xd0/0x260
[ 7746.777215]  [<ffffffff8106520d>] update_process_times+0x65/0x76
[ 7746.777215]  [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f
[ 7746.777215]  [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154
[ 7746.777215]  [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde
[ 7746.777215]  [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d
[ 7746.777215]  [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a
[ 7746.777215]  [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80
[ 7746.777215]  <EOI>  [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8
[ 7746.777215]  [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd
[ 7746.777215]  [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19
[ 7746.777215]  [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21
[ 7746.777215]  [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9
[ 7746.777215]  [<ffffffff81062825>] ptrace_request+0x373/0x410
[ 7746.777215]  [<ffffffff8149ca36>] ? _raw_spin_lock+0xe/0x10
[ 7746.777215]  [<ffffffff810456e0>] ? task_rq_lock+0x4e/0x87
[ 7746.777215]  [<ffffffff8149ca9c>] ? _raw_spin_unlock_irqrestore+0x17/0x19
[ 7746.777215]  [<ffffffff810451ca>] ? task_rq_unlock+0x1b/0x1d
[ 7746.777215]  [<ffffffff8104fae6>] ? wait_task_inactive+0xb3/0x129
[ 7746.777215]  [<ffffffff81018ffc>] arch_ptrace+0x1aa/0x1bb
[ 7746.777215]  [<ffffffff81062418>] sys_ptrace+0x97/0xb3
[ 7746.777215]  [<ffffffff814a31c2>] system_call_fastpath+0x16/0x1b

Comment 1 Josh Boyer 2012-01-05 12:24:15 UTC

What were you doing in gdb when this oopsed?  Also, was the VM you tested in running on a host that was using the nVidia drivers?

Comment 2 Gilboa Davara 2012-01-05 15:32:13 UTC

Me and my co-workers were debugging our proprietary software under gdb. In many cases, a simple "run" was enough to trigger the oops.
We managed to reproduce the issue on guests that were running under nVidia-less hosts.

- Gilboa

Comment 3 Gilboa Davara 2012-01-05 15:44:00 UTC

Just to be certain I'll setup a headless server w/ VM's to see if this OOPs
reproduces reliably.

- Gilboa

Comment 4 Josh Boyer 2012-01-05 16:07:25 UTC

I found a similar report from almost a year ago, also involving ptrace.  It doesn't seem to have resolved itself though.

http://marc.info/?l=linux-kernel&m=129198247531612&w=2

Comment 5 Oleg Nesterov 2012-01-05 17:12:52 UTC

(In reply to comment #0)
> [ 7746.777215]  [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2
> [ 7746.777215]  [<ffffffff810528d8>] scheduler_tick+0xd0/0x260
> [ 7746.777215]  [<ffffffff8106520d>] update_process_times+0x65/0x76
> [ 7746.777215]  [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f
> [ 7746.777215]  [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154
> [ 7746.777215]  [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde
> [ 7746.777215]  [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d
> [ 7746.777215]  [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a
> [ 7746.777215]  [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80
> [ 7746.777215]  <EOI>
> [ 7746.777215]  [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8
> [ 7746.777215]  [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd
> [ 7746.777215]  [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19
> [ 7746.777215]  [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21
> [ 7746.777215]  [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9

Is this trace always the same? Did you use perf?

I can hardly believe this has something to do with ptrace...

Gleb, do you think this can be somehow connected with your
fixes in perf_sched_in ?

Comment 6 Gleb Natapov 2012-01-05 17:53:55 UTC

(In reply to comment #5)
> (In reply to comment #0)
> > [ 7746.777215]  [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2
> > [ 7746.777215]  [<ffffffff810528d8>] scheduler_tick+0xd0/0x260
> > [ 7746.777215]  [<ffffffff8106520d>] update_process_times+0x65/0x76
> > [ 7746.777215]  [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f
> > [ 7746.777215]  [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154
> > [ 7746.777215]  [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde
> > [ 7746.777215]  [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d
> > [ 7746.777215]  [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a
> > [ 7746.777215]  [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80
> > [ 7746.777215]  <EOI>
> > [ 7746.777215]  [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8
> > [ 7746.777215]  [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd
> > [ 7746.777215]  [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19
> > [ 7746.777215]  [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21
> > [ 7746.777215]  [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9
> 
> Is this trace always the same? Did you use perf?
> 
> I can hardly believe this has something to do with ptrace...
> 
> Gleb, do you think this can be somehow connected with your
> fixes in  perf_sched_in ?

The trace looks similar to traces I god while debugging perf. But to trigger it perf needs to be running. Does this kernel have a fix for perf_sched_in and jump_label?

Also here is bug report with similar trace where those fixes made it disappear: http://lkml.org/lkml/2011/11/5/101

Comment 7 Josh Boyer 2012-01-05 18:27:18 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #0)
> > > [ 7746.777215]  [<ffffffff810d8c6c>] perf_event_task_tick+0x113/0x1d2
> > > [ 7746.777215]  [<ffffffff810528d8>] scheduler_tick+0xd0/0x260
> > > [ 7746.777215]  [<ffffffff8106520d>] update_process_times+0x65/0x76
> > > [ 7746.777215]  [<ffffffff81080c16>] tick_sched_timer+0x75/0x9f
> > > [ 7746.777215]  [<ffffffff8107611c>] __run_hrtimer+0xb0/0x154
> > > [ 7746.777215]  [<ffffffff81080ba1>] ? tick_nohz_handler+0xde/0xde
> > > [ 7746.777215]  [<ffffffff81076833>] hrtimer_interrupt+0xe0/0x19d
> > > [ 7746.777215]  [<ffffffff814a5d9c>] smp_apic_timer_interrupt+0x77/0x8a
> > > [ 7746.777215]  [<ffffffff814a3c9e>] apic_timer_interrupt+0x6e/0x80
> > > [ 7746.777215]  <EOI>
> > > [ 7746.777215]  [<ffffffff8104f9f6>] ? try_to_wake_up+0x1b6/0x1c8
> > > [ 7746.777215]  [<ffffffff81085d4e>] ? arch_local_irq_restore+0x6/0xd
> > > [ 7746.777215]  [<ffffffff8149ca9c>] _raw_spin_unlock_irqrestore+0x17/0x19
> > > [ 7746.777215]  [<ffffffff810617ef>] unlock_task_sighand+0x1f/0x21
> > > [ 7746.777215]  [<ffffffff81061af3>] ptrace_resume+0xbb/0xc9
> > 
> > Is this trace always the same? Did you use perf?
> > 
> > I can hardly believe this has something to do with ptrace...
> > 
> > Gleb, do you think this can be somehow connected with your
> > fixes in  perf_sched_in ?
> 
> The trace looks similar to traces I god while debugging perf. But to trigger it
> perf needs to be running. Does this kernel have a fix for perf_sched_in and
> jump_label?

1d5f003f5a964711853514b04ddc872eec0fdc7b and bbbf7af4bf8fc69bc751818cf30521080fa47dcb are in 3.2-rc5 (ish).  The latter is in 3.1.5.  The former isn't queued for -stable at all from what I can tell.

So, the particular kernel in this bug report doesn't have either of those fixes.  The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix, but not the other.

> Also here is bug report with similar trace where those fixes made it disappear:
> http://lkml.org/lkml/2011/11/5/101

Gilboa, if you can recreate this fairly easily please let us know.  Also, trying the 2.6.41.7 kernel in updates-testing would be a good idea as well.

Comment 8 Gilboa Davara 2012-01-08 11:56:27 UTC

Managed to crash 2.6.41.7.
We're not using perf; we simply use gdb to test our proprietary software (run XXX).

Callstack:
[15758.642011] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[15758.642011] IP: [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130
[15758.642011] PGD 181e6d067 PUD 181d04067 PMD 0
[15758.642011] Oops: 0000 [#1] SMP
[15758.642011] CPU 5
[15758.642011] Modules linked in: lockd uinput 8021q garp stp llc sunrpc ipt_LOG xt_state bnep bluetooth iptable_nat nf_nat nf_conntrack_ipv4 rfkill nf_conntrack nf_defrag_ipv4 snd_hda_intel snd_hda_codec snd_hwdep i2c_piix4 snd_seq ppdev snd_seq_device joydev snd_pcm parport_pc i2c_core parport snd_timer snd microc
ode soundcore snd_page_alloc e1000 8139cp mii ipv6 [last unloaded: scsi_wait_scan]
[15758.642011]
[15758.642011] Pid: 16385, comm: gdb Tainted: G        W   2.6.41.7-1.fc15.x86_64 #1 Bochs Bochs
[15758.642011] RIP: 0010:[<ffffffff8110d639>]  [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130
[15758.642011] RSP: 0018:ffff88019fd43d88  EFLAGS: 00010086
[15758.642011] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX: 0000000000000000
[15758.642011] RDX: ffff88019fd55fa8 RSI: 00000000000f41a8 RDI: ffff880181f1e300
[15758.642011] RBP: ffff88019fd43db8 R08: 0000000000989680 R09: 0000000000000020
[15758.642011] R10: 0000000000000400 R11: 0000000000000000 R12: ffff880181f1e350
[15758.642011] R13: 00000000000f41a8 R14: ffff88019fd4f930 R15: ffff880195145cc0
[15758.642011] FS:  00007f423267e720(0000) GS:ffff88019fd40000(0000) knlGS:0000000000000000
[15758.642011] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15758.642011] CR2: 0000000000000048 CR3: 000000017fcdc000 CR4: 00000000000006e0
[15758.642011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15758.642011] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15758.642011] Process gdb (pid: 16385, threadinfo ffff880181cc0000, task ffff880195145cc0)
[15758.642011] Stack:
[15758.642011]  ffff88019fd43d98 0000000000000000 ffff88019fd43db8 ffff88019fd55fb0
[15758.642011]  ffff88019fd4f860 ffff88019fd56080 ffff88019fd43e18 ffffffff8110d7a0
[15758.642011]  ffff88017fcdae00 00000000000f41a8 0000000100000000 ffff880181f1e300
[15758.642011] Call Trace:
[15758.642011]  <IRQ>
[15758.642011]  [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290
[15758.642011]  [<ffffffff8106478c>] scheduler_tick+0xdc/0x280
[15758.642011]  [<ffffffff8107cb0e>] update_process_times+0x6e/0x90
[15758.642011]  [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0
[15758.642011]  [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0
[15758.642011]  [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100
[15758.642011]  [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10
[15758.642011]  [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210
[15758.642011]  [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99
[15758.642011]  [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80
[15758.642011]  <EOI>
[15758.642011]  [<ffffffff8117eb35>] ? inode_permission+0x25/0x100
[15758.642011]  [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110
[15758.642011]  [<ffffffff811808b6>] link_path_walk+0x76/0x880
[15758.642011]  [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140
[15758.642011]  [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80
[15758.642011]  [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0
[15758.642011]  [<ffffffff811829a8>] path_openat+0xb8/0x3c0
[15758.642011]  [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90
[15758.642011]  [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80
[15758.642011]  [<ffffffff81182dd2>] do_filp_open+0x42/0xa0
[15758.642011]  [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260
[15758.642011]  [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150
[15758.642011]  [<ffffffff81172827>] do_sys_open+0xf7/0x1d0
[15758.642011]  [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360
[15758.642011]  [<ffffffff81172920>] sys_open+0x20/0x30
[15758.642011]  [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b
[15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25
[15758.642011] RIP  [<ffffffff8110d639>] perf_ctx_adjust_freq+0x49/0x130
[15758.642011]  RSP <ffff88019fd43d88>
[15758.642011] CR2: 0000000000000048
[15758.642011] ---[ end trace bb0af427d549d80a ]---
[15758.642011] Kernel panic - not syncing: Fatal exception in interrupt
[15758.642011] Pid: 16385, comm: gdb Tainted: G      D W   2.6.41.7-1.fc15.x86_64 #1
[15758.642011] Call Trace:
[15758.642011]  <IRQ>  [<ffffffff815a49dc>] panic+0x91/0x1a7
[15758.642011]  [<ffffffff815b088a>] oops_end+0xea/0xf0
[15758.642011]  [<ffffffff815a42c1>] no_context+0x209/0x218
[15758.642011]  [<ffffffff815a4499>] __bad_area_nosemaphore+0x1c9/0x1e8
[15758.642011]  [<ffffffff8103ba65>] ? pvclock_clocksource_read+0x55/0xf0
[15758.642011]  [<ffffffff8105c76f>] ? update_group_power+0x9f/0x130
[15758.642011]  [<ffffffff815a44cb>] bad_area_nosemaphore+0x13/0x15
[15758.642011]  [<ffffffff815b3076>] do_page_fault+0x416/0x4f0
[15758.642011]  [<ffffffff815b28b5>] do_async_page_fault+0x35/0x80
[15758.642011]  [<ffffffff815afbe5>] async_page_fault+0x25/0x30
[15758.642011]  [<ffffffff8110d639>] ? perf_ctx_adjust_freq+0x49/0x130
[15758.642011]  [<ffffffff8103ac29>] ? kvm_clock_read+0x19/0x20
[15758.642011]  [<ffffffff8110d7a0>] perf_event_task_tick+0x80/0x290
[15758.642011]  [<ffffffff8106478c>] scheduler_tick+0xdc/0x280
[15758.642011]  [<ffffffff8107cb0e>] update_process_times+0x6e/0x90
[15758.642011]  [<ffffffff8109f694>] tick_sched_timer+0x64/0xc0
[15758.642011]  [<ffffffff810920e0>] __run_hrtimer+0x70/0x1e0
[15758.642011]  [<ffffffff8109f630>] ? tick_nohz_handler+0x100/0x100
[15758.642011]  [<ffffffff8103ac39>] ? kvm_clock_get_cycles+0x9/0x10
[15758.642011]  [<ffffffff81092a5b>] hrtimer_interrupt+0xeb/0x210
[15758.642011]  [<ffffffff815b9f49>] smp_apic_timer_interrupt+0x69/0x99
[15758.642011]  [<ffffffff815b7e1e>] apic_timer_interrupt+0x6e/0x80
[15758.642011]  <EOI>  [<ffffffff8117eb35>] ? inode_permission+0x25/0x100
[15758.642011]  [<ffffffff8115d4b4>] ? kmem_cache_free+0x104/0x110
[15758.642011]  [<ffffffff811808b6>] link_path_walk+0x76/0x880
[15758.642011]  [<ffffffff8115e58c>] ? kmem_cache_alloc_trace+0x10c/0x140
[15758.642011]  [<ffffffff812843da>] ? selinux_file_alloc_security+0x4a/0x80
[15758.642011]  [<ffffffff8118076d>] ? path_init+0x2cd/0x3a0
[15758.642011]  [<ffffffff811829a8>] path_openat+0xb8/0x3c0
[15758.642011]  [<ffffffff8107e56b>] ? recalc_sigpending+0x3b/0x90
[15758.642011]  [<ffffffff8107ec97>] ? __set_task_blocked+0x37/0x80
[15758.642011]  [<ffffffff81182dd2>] do_filp_open+0x42/0xa0
[15758.642011]  [<ffffffff8117e43b>] ? getname_flags+0x3b/0x260
[15758.642011]  [<ffffffff8118ea9f>] ? alloc_fd+0x4f/0x150
[15758.642011]  [<ffffffff81172827>] do_sys_open+0xf7/0x1d0
[15758.642011]  [<ffffffff810ca2e2>] ? audit_syscall_entry+0x242/0x360
[15758.642011]  [<ffffffff81172920>] sys_open+0x20/0x30
[15758.642011]  [<ffffffff815b7342>] system_call_fastpath+0x16/0x1b

Comment 9 Gleb Natapov 2012-01-08 16:29:20 UTC

(In reply to comment #7)
>
> So, the particular kernel in this bug report doesn't have either of those
> fixes.  The 2.6.41.7 kernel in fedora updates-testing has the jump_label fix,
> but not the other.
> 
The jump_label fix is definitely not enough. I made it first and oops was easily reproducible with it applied.

Comment 10 Oleg Nesterov 2012-01-08 17:33:12 UTC

(In reply to comment #8)
> Managed to crash 2.6.41.7.
> We're not using perf;

Hmm. OK, gdb can use perf events "implicitly", but afaics only
if you play with hw breakpoints.

But you are saying:

> we simply use gdb to test our proprietary software (run
> XXX).

Strange... 

> [15758.642011] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48
> 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b
> 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25

at least this is clear after decodecode ;) perf_ctx_adjust_freq()
hits the NULL terminated ctx->event_list. Not that this helps a lot.

Still I hope this _can_ be fixed by Gleb's 1d5f003f + 86b47c25, may
be ->task_ctx wasn't cleared and then this memory was freed/reused.

Comment 11 Oleg Nesterov 2012-01-08 17:40:02 UTC

(In reply to comment #8)
> Managed to crash 2.6.41.7.

Any chance you can try vanilla 3.1 and 3.2 kernels and
let us know if you can re-create the problem (_hopefully_
fixed in 3.2) ?

Comment 12 Gilboa Davara 2012-01-09 04:20:45 UTC

I can either build vanilla from scratch or use the 3.2 RPM and strip a patch (or two).
What do you prefer?

As for perf, we triggered the OOps w/o any breakpoints.
Simply by:
gdb prog
(gdb) ran params
BOOM

- Gilboa

Comment 13 Gleb Natapov 2012-01-09 09:14:16 UTC

(In reply to comment #12)
> I can either build vanilla from scratch or use the 3.2 RPM and strip a patch
> (or two).
> What do you prefer?

Can you try 3.2 vanilla?

Comment 14 Gilboa Davara 2012-01-09 11:56:20 UTC

OK. I'll try and free some time to build a vanilla 3.2 kernel on one of the F16 VM's.

- Gilboa

Comment 15 Gilboa Davara 2012-01-10 13:01:48 UTC

Haven't tried vanilla yet, but recompiled F16 3.2 kernel OOPs like crazy.
I'm building a vanilla 3.2 kernel as I write this.

Comment 16 Gilboa Davara 2012-01-14 10:38:03 UTC

Thus far, vanilla 3.2 seems stable enough.
I used the same configuration used by a F17 3.2 kernel.

... However, I'm currently testing (c)gdb on a VM running on my Xeon workstation (w/ nVidia binary driver). Most of the previous testing was done a headless Athlon Phenom (635) server.
Any chance that the trigger for this bug is hardware / CPU related?

- Gilboa

Comment 17 Gleb Natapov 2012-01-14 11:02:39 UTC

(In reply to comment #16)
> Thus far, vanilla 3.2 seems stable enough.
> I used the same configuration used by a F17 3.2 kernel.
Where can I get the source for  F17 3.2 kernel?

> 
> ... However, I'm currently testing (c)gdb on a VM running on my Xeon
> workstation (w/ nVidia binary driver). Most of the previous testing was done a
> headless Athlon Phenom (635) server.
> Any chance that the trigger for this bug is hardware / CPU related?
> 
Perf subsystem is hardware dependent, so it is possible.

Comment 18 Josh Boyer 2012-01-14 13:30:44 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > Thus far, vanilla 3.2 seems stable enough.
> > I used the same configuration used by a F17 3.2 kernel.
> Where can I get the source for  F17 3.2 kernel?

http://koji.fedoraproject.org/koji/buildinfo?buildID=281207

Is the main build page for the 3.2 build that has been in the repositories until today.  The SRPM for it is here:

http://kojipkgs.fedoraproject.org/packages/kernel/3.2.0/2.fc17/src/kernel-3.2.0-2.fc17.src.rpm

Comment 19 Josh Boyer 2012-01-15 16:09:28 UTC

Just to eliminate utrace, I started a scratch build with that patch not applied of the 3.2.0-2 kernel here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910

When it is finished building, could you give it a test?

Comment 20 Oleg Nesterov 2012-01-15 17:48:09 UTC

(In reply to comment #19)
> Just to eliminate utrace, I started a scratch build with that patch not applied
> of the 3.2.0-2 kernel here:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910
> 
> When it is finished building, could you give it a test?

Oh, thanks Josh.

Yes, I am starting to afraid I was wrong. I still can't imagine
how utrace changes could introduce the problem like this, but
given that vanilla 3.2 kernel works fine...

Comment 21 Gleb Natapov 2012-01-16 08:55:41 UTC

(In reply to comment #20)
> (In reply to comment #19)
> > Just to eliminate utrace, I started a scratch build with that patch not applied
> > of the 3.2.0-2 kernel here:
> > 
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910
> > 
> > When it is finished building, could you give it a test?
> 
> Oh, thanks Josh.
> 
> Yes, I am starting to afraid I was wrong. I still can't imagine
> how utrace changes could introduce the problem like this, but
> given that vanilla 3.2 kernel works fine...

It works on different HW. I still would like Gilboa to try it on AMD.

Comment 22 Gilboa Davara 2012-01-21 13:37:09 UTC

Hello,

I'm having difficulties reproducing this crash on vanilla 3.2 (good).
... But also having difficulties reproducing this fresh 3.1.9 (fc16 rpm from @updates).
I'll try to free a couple of hours tomorrow (Sunday) to try and see which of the possible kernel configurations (3.1.7, 3.1.9, 3.2.1 from koji and vanilla 3.2) is crashing, and when/how.

- Gilboa

Comment 23 Gilboa Davara 2012-01-22 16:06:30 UTC

OK. Both 2.41.9 (F15) and 3.1.9 (F16) are oopsing w/ gdb.
I'll install 3.2.1 from koji and test it.

- Gilboa

Comment 24 Gilboa Davara 2012-01-22 16:50:12 UTC

3.2.1 from Koji goes up the flames.

[  772.672021]  <IRQ> 
[  772.672021]  [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290
[  772.672021]  [<ffffffff81053329>] ? sched_slice+0x59/0xa0
[  772.672021]  [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300
[  772.672021]  [<ffffffff8107e48e>] update_process_times+0x6e/0x90
[  772.672021]  [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0
[  772.672021]  [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0
[  772.672021]  [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100
[  772.672021]  [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10
[  772.672021]  [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210
[  772.672021]  [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99
[  772.672021]  [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80
[  772.672021]  <EOI> 
[  772.672021]  [<ffffffff815e9034>] ? sysret_audit+0x16/0x20
[  772.672021] Code: 45 d8 49 39 c4 48 8d 58 f0 75 20 e9 f2 00 00 00 66 90 48 8b 43 10 48 89 45 d8 48 8b 45 d8 49 39 c4 48 8d 58 f0 0f 84 d7 00 00 00 <83> 7b 58 01 75 e1 8b 83 ec 01 00 00 83 f8 ff 74 0c 65 8b 14 25 
[  772.672021] RIP  [<ffffffff81112d69>] perf_ctx_adjust_freq+0x49/0x130
[  772.672021]  RSP <ffff88011fd83d88>
[  772.672021] CR2: 0000000000000048
[  772.672021] ---[ end trace ced3abbf6a7fa769 ]---
[  772.672021] Kernel panic - not syncing: Fatal exception in interrupt
[  772.672021] Pid: 10786, comm: gdb Tainted: P      D    O 3.2.1-1.fc16.x86_64 #1
[  772.672021] Call Trace:
[  772.672021]  <IRQ>  [<ffffffff815d682f>] panic+0x91/0x1a7
[  772.672021]  [<ffffffff815e217a>] oops_end+0xea/0xf0
[  772.672021]  [<ffffffff815d6114>] no_context+0x214/0x223
[  772.672021]  [<ffffffff813d3b7b>] ? ata_scsi_qc_complete+0x6b/0x470
[  772.672021]  [<ffffffff815d62ec>] __bad_area_nosemaphore+0x1c9/0x1e8
[  772.672021]  [<ffffffff8105e8bc>] ? update_group_power+0x9c/0x130
[  772.672021]  [<ffffffff812b5d36>] ? cpumask_next_and+0x36/0x50
[  772.672021]  [<ffffffff815d631e>] bad_area_nosemaphore+0x13/0x15
[  772.672021]  [<ffffffff815e4c26>] do_page_fault+0x416/0x4f0
[  772.672021]  [<ffffffff815e4465>] do_async_page_fault+0x35/0x80
[  772.672021]  [<ffffffff815e1725>] async_page_fault+0x25/0x30
[  772.672021]  [<ffffffff81112d69>] ? perf_ctx_adjust_freq+0x49/0x130
[  772.672021]  [<ffffffff8103cca9>] ? kvm_clock_read+0x19/0x20
[  772.672021]  [<ffffffff81112ed0>] perf_event_task_tick+0x80/0x290
[  772.672021]  [<ffffffff81053329>] ? sched_slice+0x59/0xa0
[  772.672021]  [<ffffffff81066b9c>] scheduler_tick+0xdc/0x300
[  772.672021]  [<ffffffff8107e48e>] update_process_times+0x6e/0x90
[  772.672021]  [<ffffffff810a0d84>] tick_sched_timer+0x64/0xc0
[  772.672021]  [<ffffffff81093dd0>] __run_hrtimer+0x70/0x1e0
[  772.672021]  [<ffffffff810a0d20>] ? tick_nohz_handler+0x100/0x100
[  772.672021]  [<ffffffff8103ccb9>] ? kvm_clock_get_cycles+0x9/0x10
[  772.672021]  [<ffffffff8109474b>] hrtimer_interrupt+0xeb/0x210
[  772.672021]  [<ffffffff815ebb09>] smp_apic_timer_interrupt+0x69/0x99
[  772.672021]  [<ffffffff815e99de>] apic_timer_interrupt+0x6e/0x80
[  772.672021]  <EOI>  [<ffffffff815e9034>] ? sysret_audit+0x16/0x20


- Gilboa

Comment 25 Gleb Natapov 2012-01-22 17:02:16 UTC

This is on the same HW where vanilla 3.2 worked fine?

Comment 26 Gilboa Davara 2012-01-23 03:53:05 UTC

In theory, yes.
I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing.

On the up side, this really feels like a platform (as in CPU) dependent bug.

... Oh, it may or may not be relevant - but the code we're debugging makes heavy use of numa-ctl for memory allocation.

- Gilboa

Comment 27 Josh Boyer 2012-01-23 16:04:56 UTC

(In reply to comment #26)
> In theory, yes.
> I'll build 3.2.1 vanilla tomorrow (on the same VM) and restart testing.
> 
> On the up side, this really feels like a platform (as in CPU) dependent bug.
> 
> ... Oh, it may or may not be relevant - but the code we're debugging makes
> heavy use of numa-ctl for memory allocation.
> 
> - Gilboa

Please try the kernel I linked to in comment #19.  It has utrace disabled.

Comment 28 Gilboa Davara 2012-01-23 17:03:29 UTC

The crash on comment 24 came from fc16 koji build :(

Comment 29 Oleg Nesterov 2012-01-23 17:22:26 UTC

Gilboa, first of all thanks a lot for your efforts.

but I got lost,

(In reply to comment #28)
>
> The crash on comment 24 came from fc16 koji build :(

that comment says "3.2.1 from Koji".

while the kernel linked to #19 is kernel-3.2.0, or I do what
http://koji.fedoraproject.org/koji/taskinfo?taskID=3696910
tells me...

was it really that kernel?

Comment 30 Josh Boyer 2012-01-23 18:21:33 UTC

(In reply to comment #29)
> Gilboa, first of all thanks a lot for your efforts.

Yes, thank you.

> was it really that kernel?

[  772.672021] Pid: 10786, comm: gdb Tainted: P      D    O 3.2.1-1.fc16.x86_64

That's not the kernel I built without utrace.  Also, please, try duplicating this without any proprietary modules loaded.

Comment 31 Gilboa Davara 2012-01-24 07:53:05 UTC

My mistake, sorry.
Do you want to rebuild the SRPM under F16 or do you want me to use the F17 kernel as-is?

- Gilboa

Comment 32 Josh Boyer 2012-01-24 12:29:10 UTC

(In reply to comment #31)
> My mistake, sorry.
> Do you want to rebuild the SRPM under F16 or do you want me to use the F17
> kernel as-is?

No need to rebuild.  Installing it directly should work just fine.  Thank you

Comment 33 Gilboa Davara 2012-01-30 15:36:17 UTC

Finally got around to build a VM to test the koji build, but there's no binary package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911
What am I missing?

- Gilboa

Comment 34 Josh Boyer 2012-01-30 16:03:21 UTC

(In reply to comment #33)
> Finally got around to build a VM to test the koji build, but there's no binary
> package in http://koji.fedoraproject.org/koji/taskinfo?taskID=3696911
> What am I missing?

Waited too long and koji auto-pruned it because it was a scratch build.

I rebuilt it here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3746349

When that completes it should be identical to what was originally posted at the link that no longer works.

Comment 35 Gilboa Davara 2012-02-01 14:09:51 UTC

No go. fc17 kernel died after ~2 "runs".
However, there was no callstack / OOPs message in the serial console.

- Gilboa

Comment 36 Gilboa Davara 2012-02-02 12:15:06 UTC

Short update. I'm currently installing Ubunutu 11.10 on a VM to see if I can reproduce the crash under a different distro (I already tried the vanilla kernel path).
... While I'm not sure it'll do any good, it might at least confirm that this is not an upstream kernel.org bug.

- Gilboa

Comment 37 Gilboa Davara 2012-02-26 15:20:36 UTC

3.2.6 triggers warn-on-slowpath (again, by "run <param>" in gdb).
I'll try and reproduce it on non-tainted host/vm.

[367414.345972] ------------[ cut here ]------------
[367414.345979] WARNING: at kernel/events/core.c:2047 task_ctx_sched_out+0x63/0x70()
[367414.345981] Hardware name: GA-MA785GM-US2H
[367414.345982] Modules linked in: bluetooth rfkill btrfs zlib_deflate libcrc32c ufs hfsplus hfs minix vfat msdos fat jfs xfs reiserfs usb_storage nfs fscache fuse ppdev parport_pc lp parport ipt_LOG ipt_MASQUERADE xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bridge stp llc it87 hwmon_vid xts gf128mul sha256_generic dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel nvidia(P) snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore r8169 snd_page_alloc sp5100_tco k10temp i2c_piix4 i2c_core edac_core edac_mce_amd mii microcode vhost_net macvtap macvlan tun virtio_net kvm_amd kvm nfsd lockd nfs_acl auth_rpcgss sunrpc binfmt_misc uinput pata_acpi ata_generic pata_atiixp wmi [last unloaded: scsi_wait_scan]
[367414.346024] Pid: 28328, comm: upgen Tainted: P           O 3.2.6-3.fc16.x86_64 #1
[367414.346026] Call Trace:
[367414.346031]  [<ffffffff8106dd4f>] warn_slowpath_common+0x7f/0xc0
[367414.346034]  [<ffffffff8106ddaa>] warn_slowpath_null+0x1a/0x20
[367414.346036]  [<ffffffff8110fc43>] task_ctx_sched_out+0x63/0x70
[367414.346039]  [<ffffffff81113ffa>] perf_event_comm+0x8a/0x330
[367414.346042]  [<ffffffff811888e2>] ? do_filp_open+0x42/0xa0
[367414.346045]  [<ffffffff8117fff0>] set_task_comm+0x60/0x70
[367414.346047]  [<ffffffff811dbe0f>] comm_write+0xdf/0xf0
[367414.346050]  [<ffffffff81178fa3>] vfs_write+0xb3/0x180
[367414.346052]  [<ffffffff811792ca>] sys_write+0x4a/0x90
[367414.346055]  [<ffffffff815e9982>] system_call_fastpath+0x16/0x1b
[367414.346057] ---[ end trace 33dece0994e4edf5 ]---

Comment 38 Dave Jones 2012-03-22 16:41:27 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 39 Dave Jones 2012-03-22 16:46:12 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 40 Dave Jones 2012-03-22 16:55:28 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 41 Gilboa Davara 2012-03-26 15:32:01 UTC

I'll give it a test.

- Gilboa

Comment 42 Gilboa Davara 2012-04-02 10:17:43 UTC

We managed to reproduce the gdb-kernel-crash a couple of times no 3.3.0 kernel running on a dual core ATOM 330 netbook w/ nouveau driver.
No crash dump. (We'll try and reproduce this crash on serial on a VM w/ serial console)

- Gilboa

Comment 43 Josh Boyer 2012-09-05 14:04:37 UTC

We've dropped utrace in the 3.4.2 or newer kernels.  Are you still seeing this with the latest F16 kernel update?

Comment 44 Gilboa Davara 2012-09-06 04:41:49 UTC

Hi,

We moved to F17 across the board, and thus far, I'm happy to say, we haven't managed to reproduce this OOPs.
Bug can be safely closed.

Thanks!
- Gilboa

Comment 45 Josh Boyer 2012-09-06 12:17:19 UTC

OK, thank you for letting us know.