Bug 483957

Summary: [FOCUS] [DESTINY] oprofile does not work on nehalem chips
Product: Red Hat Enterprise MRG Reporter: IBM Bug Proxy <bugproxy>
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED NOTABUG QA Contact: David Sommerseth <davids>
Severity: high Docs Contact:
Priority: low    
Version: 1.1CC: bhu, ovasik, williams
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-12 19:21:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oprofile: Don&apos;t report nehalam as core_2
none
oprofile: Implement Intel architectural perfmon support
none
add Nehalam to list of ppro cores
none
Console screen capture when the system &apos;hangs&apos;. none

Description IBM Bug Proxy 2009-02-04 11:40:33 UTC
=Comment: #0=================================================
Sripathi Kodi <sripathik.com> - 

Hardware: Victory (elm3a112). 2 sockets.
CPUs: Intel Nehalem (Quad core, 2 threads in each thread)
From /proc/cpuinfo:
processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Genuine Intel(R) CPU           @ 0000 @ 2.67GHz

Kernel: 2.6.24.7-100.el5rt
opcontrol: oprofile 0.9.3 compiled on Sep  3 2008 00:41:52 (What came with RHEL5.2)

Starting oprofile goes fine without any errors. When I try to get the report I see this error:

[root@elm3a112 ~]# opreport
opreport error: No sample file found: try running opcontrol --dump
or specify a session containing sample files

I then booted with nmi_watchdog=1. This lets oprofile work, but IIRC this mode uses timer interrupt
for sampling, so it is low in accuracy.

I suspect the problem is because the kernel/oprofile binary don't support this new cpu type. We have
seen such bugs in the past (LTC bug 43249( RH 438342)).

I searched Linus' git tree and found some patches pertaining to this. Some of these may have to be
backported to make oprofile work.
=Comment: #1=================================================
Sripathi Kodi <sripathik.com> - 
I tried vanilla kernel 2.6.28.2 and found that the problem is fixed there. Also, we need to use the
latest oprofile user-space tools from CVS. We need to backport the fixes to MRG kernel.
=Comment: #2=================================================
Dinakar Guniguntala <dino.com> - 
These are Destiny systems (x3550 M2) and not Victory (x3650 M2)
=Comment: #3=================================================
Sripathi Kodi <sripathik.com> - 
Kiran has been trying to backport some patches from Linus' tree to get oprofile working on MRG
kernel. At this stage we have something working, but starting and stopping oprofile a few times
leads to system crash. The patches that she is trying are:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;
=4b9f12a3779c548b68bc9af7d94030868ad3aa1b

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f645f6406463a01869c50844befc76d528971690

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b99170288421c79f0c2efa8b33e26e65f4bb7fb8
=Comment: #4=================================================
Sripathi Kodi <sripathik.com> - 
(In reply to comment #3)
> Kiran has been trying to backport some patches from Linus' tree to get oprofile
> working on MRG kernel. At this stage we have something working, but starting
> and stopping oprofile a few times leads to system crash. The patches that she
> is trying are:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;
> =4b9f12a3779c548b68bc9af7d94030868ad3aa1b

This got mangled. It should be

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4b9f12a3779c548b68bc9af7d94030868ad3aa1b



=Comment: #7=================================================
Kiran Prakash1 <kirpraka.com> - 

add Nehalam to list of ppro cores


=Comment: #8=================================================
Kiran Prakash1 <kirpraka.com> - 

oprofile: Don't report nehalam as core_2


=Comment: #9=================================================
Kiran Prakash1 <kirpraka.com> - 

oprofile: Implement Intel architectural perfmon support


=Comment: #12=================================================
Kiran Prakash1 <kirpraka.com> - 
These 3 patches from the linus' tree fix the oprofile problem but the system crashes while trying to
stop and restart oprofile.

Comment 1 IBM Bug Proxy 2009-02-04 11:40:38 UTC
Created attachment 330852 [details]
oprofile: Don&apos;t report nehalam as core_2

Comment 2 IBM Bug Proxy 2009-02-04 11:40:41 UTC
Created attachment 330853 [details]
oprofile: Implement Intel architectural perfmon support

Comment 3 IBM Bug Proxy 2009-02-04 11:40:44 UTC
Created attachment 330854 [details]
add Nehalam to list of ppro cores

Comment 4 IBM Bug Proxy 2009-02-04 13:01:04 UTC
Note to RH: The three patches attached to this bug are ported from Linus' tree to make oprofile work on MRG on Nehalem chips. We have been able to run oprofile with these patches(using cvs version of oprofile user space tools). However, when we stop and restart oprofile the system seems to crash/hang. We will try to get more information about this problem.

So the patches above are required, but we probably need some more fixes to have a complete working oprofile on MRG on Nehalem.

Comment 5 IBM Bug Proxy 2009-02-23 06:31:18 UTC
I'm trying this on elm3a112. I am trying to find the problem with Kiran's kernel. The java console was hardly useful. After a lot of juggling with SOL settings in BIOS, I got it working. Hopefully I will get much better information now.

Comment 6 IBM Bug Proxy 2009-02-24 08:21:15 UTC
I recreated the problem with oprofile today and tried to grab more information from the panic that happens. However, so far I have not been able to get anything on the SOL. The panic text scrolls by on the Java console and there is no way to scroll back and see what the beginning of oops/panic was. I have kdump enabled, but it doesn't trigger. I have verified that kdump works fine on this hardware when I do "echo c >/proc/sysrq-trigger".

Comment 7 IBM Bug Proxy 2009-02-24 09:41:13 UTC
Okay, I gathered a bit more information about the problem using whatever little text scrolls by on the Java console. I will attach the backtrace screenshot to this bug.

The machine probably goes into a deadlock. The backtrace shows nmi_cpu_setup, trying to hold a spin lock. From the code it is possibly here:

static void nmi_cpu_setup(void * dummy)
{
int cpu = smp_processor_id();
struct op_msrs * msrs = &cpu_msrs[cpu];
spin_lock(&oprofilefs_lock);  <=========
model->setup_ctrs(msrs);
spin_unlock(&oprofilefs_lock);

However, the backtrace shows a branch from nmi_cpu_setup+0x0/0x66, which doesn't make sense to me. Also, I can't see the top of the backtrace, so I can't be sure about what the problem is.

Comment 8 IBM Bug Proxy 2009-02-24 09:41:18 UTC
Created attachment 333010 [details]
Console screen capture when the system &apos;hangs&apos;.

Comment 9 IBM Bug Proxy 2009-02-24 12:11:01 UTC
Thinking of an alternative approach, I wanted to see if oprofile works as-is on 2.6.29-rc4-rt2-tip. However, with this kernel I saw the original problem reported by this bug. Also, trying to stop and restart oprofile resulted in a system hang. Hence it looks like the problem is much more deep rooted. I'll see how this behaves on non-rt kernels on this hardware.

Comment 10 IBM Bug Proxy 2009-02-25 17:11:43 UTC
With 2.6.29-rc6-tip kernel (non-RT) oprofile worked just fine on this machine. However with 2.6.29-rc4-rt2-tip kernel it fails as I reported in my previous comment. Hence the problem seems to be RT specific.

Comment 11 IBM Bug Proxy 2009-06-30 07:01:24 UTC
------- Comment From kirpraka.com 2009-06-30 02:54 EDT-------
I get the following error when I run oprofile on 2.6.29.4-23.el5rt MRG kernel

[root@elm3a112 kiran]# opcontrol --start
cpu_type 'unset' is not valid
you should upgrade oprofile or force the use of timer mode
cpu_type 'unset' is not valid
you should upgrade oprofile or force the use of timer mode

the oprofile I'm using is 0.9.3.

I'll try upgrading oprofile to 0.9.4 and check if it works.

Comment 12 IBM Bug Proxy 2009-07-01 11:41:26 UTC
------- Comment From kirpraka.com 2009-07-01 07:39 EDT-------
Tried with oprofile 0.9.5 cvs version. The bug has been fixed.
It worked fine on 2.6.29 based MRG kernel 2.6.29.4-23.el5rt and mainline RT kernel 2.6.29.5-rt22.

Comment 13 IBM Bug Proxy 2009-07-01 18:50:56 UTC
------- Comment From dvhltc.com 2009-07-01 14:49 EDT-------
Let's discuss at our next MRG Interlock (Jul 13).  I plan to look for every rt bug with a RH bug number in the title on that call.  However, it wouldn't hurt to send a pointer to this bug to the rhel-rt-ibm mailing list, Cc'ing Clark Williams, and asking how they would like to proceed.

Comment 14 IBM Bug Proxy 2009-10-23 06:21:06 UTC
------- Comment From kirpraka.com 2009-10-23 02:13 EDT-------
I tested oprofile on MRG1.2 based on RHEL 5.4 and it works perfectly fine without any errors.

Comment 15 IBM Bug Proxy 2009-10-26 15:21:06 UTC
------- Comment From sripathik.com 2009-10-26 11:10 EDT-------
Okay, so we can close this bug.