786984 – Use of kernel perf support by PAPI causes crash

Bug 786984 - Use of kernel perf support by PAPI causes crash

Summary: Use of kernel perf support by PAPI causes crash

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	15
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-02 20:19 UTC by William Cohen
Modified:	2012-02-06 17:33 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-02-06 17:33:04 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description William Cohen 2012-02-02 20:19:09 UTC

Description of problem:

When running PAPI tests ("make fulltest") on Fedora 15 x86-64 box the tests caused the machine to crash and the machine needs to be rebooted.


Version-Release number of selected component (if applicable):
kernel-2.6.41.10-3.fc15.x86_64
papi-4.1.3-2.fc15.x86_64

How reproducible:

All the time


Steps to Reproduce:
1. yum install "papi*"
2. yumdownloader --source papi; rpm -Uvh papi*src.rpm; cd rpmbuild/SPECS; rpmbuild -ba papi.spec
3. cd ~/rpmbuild/BUILD/papi-4.1.3/src
4. while true; do make fulltest; done
5. (tests will run a while, but then the machine will crash)
  
Actual results:

[  404.654873] ------------[ cut here ]------------                             
[  404.655804] WARNING: at arch/x86/kernel/cpu/perf_event.c:1255 x86_pmu_stop+0)
[  404.655804] Hardware name: MCP55                                             
[  404.655804] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl tun ebt]
[  404.655804] Pid: 2790, comm: multiplex2 Not tainted 2.6.41.10-3.fc15.x86_64 1
[  404.655804] Call Trace:                                                      
[  404.655804]  [<ffffffff8106b7bf>] warn_slowpath_common+0x7f/0xc0             
[  404.655804]  [<ffffffff8106b81a>] warn_slowpath_null+0x1a/0x20               
[  404.655804]  [<ffffffff81024075>] x86_pmu_stop+0xc5/0xe0                     
[  404.655804]  [<ffffffff81026eb5>] x86_pmu_enable+0x95/0x270                  
[  404.655804]  [<ffffffff8110cea6>] __perf_install_in_context+0x166/0x1b0      
[  404.655804]  [<ffffffff81109420>] ? perf_adjust_period+0x1c0/0x1c0           
[  404.655804]  [<ffffffff81109468>] remote_function+0x48/0x60                  
[  404.655804]  [<ffffffff810a5077>] smp_call_function_single+0x147/0x160       
[  404.801032]  [<ffffffff8118ff82>] ? mnt_clone_write+0x12/0x30                
[  404.801032]  [<ffffffff81108274>] task_function_call+0x44/0x50               
[  404.801032]  [<ffffffff8110cd40>] ? perf_event_sched_in+0xa0/0xa0            
[  404.801032]  [<ffffffff8110addd>] perf_install_in_context+0x5d/0xa0          
[  404.801032]  [<ffffffff81110715>] sys_perf_event_open+0x695/0x950            
[  404.801032]  [<ffffffff815b7f82>] system_call_fastpath+0x16/0x1b             
[  404.801032] ---[ end trace ce673c479678ad8b ]---       
  
Expected results:

The tests will run without crashing the machine


Additional info:

Thread by Vince Weaver describing the same problem:

https://lkml.org/lkml/2011/12/16/463

Comment 1 William Cohen 2012-02-02 20:30:20 UTC

This occurs on an AMD Family 10h machine (not sure if the bug is processor specific). Dual socket, 8 core machine. Below is the processor 0 information.

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 2
model name	: Quad-Core AMD Opteron(tm) Processor 2350
stepping	: 3
cpu MHz		: 1000.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monito
r cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3d
nowprefetch osvw ibs npt lbrv svm_lock
bogomips	: 3999.68
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Comment 3 Dave Jones 2012-02-06 14:00:30 UTC

does it still happen in the 2.6.42 update ?
(This is based on upstream 3.2)

Comment 5 William Cohen 2012-02-06 17:16:42 UTC

The kernel-2.6.42.3-1.fc15.x86_64.rpm does not suffer from this problem.  Everything seems to work fine with it.

Comment 6 Dave Jones 2012-02-06 17:33:04 UTC

excellent!

Note You need to log in before you can comment on or make changes to this bug.