Bug 536022 (RHQ-416)

Summary: Native system fails to gather cpu metrics for dual core opterons
Product: [Other] RHQ Project Reporter: Heiko W. Rupp <hrupp>
Component: PluginsAssignee: Ian Springer <ian.springer>
Status: CLOSED NEXTRELEASE QA Contact: Jeff Weiss <jweiss>
Severity: medium Docs Contact:
Priority: high    
Version: 0.1CC: ccrouch, dajohnso
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-416
Whiteboard:
Fixed In Version: 1.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Suse Linux 2.6.5-7 AMD Dual-Core Opterons
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Heiko W. Rupp 2008-05-02 08:05:00 UTC
2008-04-24 10:16:25,648 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data from: org.rhq.plugins.platform.LinuxPlatformComponent@d006a7
org.rhq.core.system.SystemInfoException: Cannot refresh the native CPU information
       at org.rhq.core.system.CpuInformation.refresh(CpuInformation.java:67)
       at org.rhq.core.system.CpuInformation.<init>(CpuInformation.java:37)
       at org.rhq.core.system.NativeSystemInfo.getCpu(NativeSystemInfo.java:275)
       at org.rhq.plugins.platform.PlatformComponent.getValues(PlatformComponent.java :163)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
       at org.rhq.core.system.CpuInformation.refresh(CpuInformation.java:64)




Comment 1 Charles Crouch 2008-05-07 21:42:34 UTC
From case 176801

Comment 2 Jaroslaw Kijanowski 2008-05-12 06:56:14 UTC
Couldn't reproduce this on RHEL4U6 on Dual-Core AMD Opteron(tm) Processor 2220

Comment 3 Heiko W. Rupp 2008-05-13 17:48:35 UTC
Cpu info:

Here is the output:
------------------------------------------------------
sigar> cpuinfo
2 total CPUs..
Vendor........AMD
Model.........Dual-Core AMD Opteron(tm) Processor 2218
Mhz...........2600
Cache size....1024




Comment 4 Charles Crouch 2008-05-13 20:40:24 UTC
rest of the cpuinfo and some sysinfo

CPU 0.........
User Time.....0.0%
Sys Time......0.0%
Idle Time.....100.0%
Wait Time.....0.0%
Nice Time.....0.0%
Combined......0.0%

CPU 1.........
User Time.....0.0%
Sys Time......0.0%
Idle Time.....100.0%
Wait Time.....0.0%
Nice Time.....0.0%
Combined......0.0%

Totals........
User Time.....2.4%
Sys Time......0.0%
Idle Time.....97.5%
Wait Time.....0.0%
Nice Time.....0.0%
Combined......2.4%


sigar> sysinfo
Sigar version.......java=1.5.0.3, native=1.5.0.1
Build date..........java=04/16/2008 09:44 AM, native=01/31/2008 10:22 PM
Archlib.............libsigar-x86-linux.so
...
Current user........jboss

OS description......SuSE 9
OS name.............Linux
OS arch.............x86_64
OS machine..........x86_64
OS version..........2.6.5-7.282-smp
OS patch level......unknown
OS vendor...........SuSE
OS vendor version...9
OS code name........
OS data model.......32
OS cpu endian.......little
Java vm version.....1.5.0_10-b03
Java vm vendor......Sun Microsystems Inc.
Java home.........../opt/java/jdk1.5.0_10/jre
 9:22 AM  up 83 days, 20:13, load average: 0.36, 0.25, 0.20


Comment 5 Charles Crouch 2008-05-13 20:43:51 UTC
So it looks like Sigar is returning us the right information (two sets of CPU info, apart from the Totals one not looking right), but we're not doing the right thing with it in our native integration. As a workaround can they turn off CPU metrics on the platform?

Comment 6 Heiko W. Rupp 2008-05-14 10:35:51 UTC
What Sigar is reporting on its command line is one thing. What its methods are returning another.

The code in question is:

            Cpu[] cpuList = sigar.getCpuList();
            CpuInfo[] cpuInfoList = sigar.getCpuInfoList();
            CpuPerc[] cpuPercentageList = sigar.getCpuPercList();

            cpu = (cpuList != null) ? cpuList[this.cpuIndex] : null;
            cpuInfo = (cpuInfoList != null) ? cpuInfoList[this.cpuIndex] : null;   // <<<--- HERE

According to the stack trace it is working for the cpuList[] array, but failing for the cpuInfoList[] array,
which means that the array returned by Sigar is too short in this case.

Looking at Sigar forum and Jira, there are a few issues about wrong number of cpus detected with amd cpus - even if I did not find the model in question.

To quote Doug: "The sigar cpu_list functions are actually supposed to return the physical number of CPUs (sockets) and in the case of multi-core each socket would have the aggregate times. There's some issues with newer models we're in the process of getting fixed in the 1.5.1 version of Sigar. "

See e.g. 
http://jira.hyperic.com/browse/HHQ-929
http://jira.hyperic.com/browse/HHQ-945 (just as ref, fixed in sigar 1.5)


Comment 7 Heiko W. Rupp 2008-05-15 10:02:45 UTC
I am proposing the following patch to CpuInformation.java in order to catch the AIOOBE

Index: .
===================================================================
--- .	(revision 826)
+++ .	(working copy)
@@ -60,9 +60,10 @@
             CpuInfo[] cpuInfoList = sigar.getCpuInfoList();
             CpuPerc[] cpuPercentageList = sigar.getCpuPercList();
 
-            cpu = (cpuList != null) ? cpuList[this.cpuIndex] : null;
-            cpuInfo = (cpuInfoList != null) ? cpuInfoList[this.cpuIndex] : null;
-            cpuPercentage = (cpuPercentageList != null) ? cpuPercentageList[this.cpuIndex] : null;
+            cpu = (cpuList != null && cpuIndex < cpuList.length) ? cpuList[this.cpuIndex] : null;
+            cpuInfo = (cpuInfoList != null && cpuIndex < cpuInfoList.length) ? cpuInfoList[this.cpuIndex] : null;
+            cpuPercentage = (cpuPercentageList != null && cpuIndex < cpuPercentageList.length) ? cpuPercentageList[this.cpuIndex]
+                : null;
         } catch (Exception e) {
             throw new SystemInfoException("Cannot refresh the native CPU information", e);
         } finally {


We should still investigate with the Sigar guys what is going on, but this way, the user is just getting no information instead of errors.
Should we log this issue nevertheless ?

Comment 8 Heiko W. Rupp 2008-05-18 17:56:27 UTC
Sigar forum post http://forums.hyperic.com/jiveforums/thread.jspa?threadID=5272&tstart=0

Comment 9 Heiko W. Rupp 2008-05-27 15:35:08 UTC
The "master" is with ips too ...

Comment 10 Charles Crouch 2008-05-29 16:30:43 UTC
The forum thread needs updating with the latest info from the case

Comment 11 Greg Hinkle 2008-06-05 15:32:22 UTC
Ian, have you committed the workaround? My main concern is that this should not blow up. Even if it doesn't accurately collect the data in certain circumstances until sigar is updated.

Comment 12 Ian Springer 2008-06-05 23:52:51 UTC
Fixed - r941.

The new code is modeled after how SIGAR's cpuinfo command gathers the same information.

Here's the updated code in NativeSystemInfo:

    public int getNumberOfCpus() {
        Sigar sigar = new Sigar();

        try {
            // NOTE: This will return the number of cores, not the number of sockets.
            return sigar.getCpuPercList().length;
        } catch (Exception e) {
            throw new UnsupportedOperationException("Cannot get number of CPUs from native layer", e);
        } finally {
            sigar.close();
        }
    }

and in CpuInformation:

    public void refresh() {
        Sigar sigar = new Sigar();
        try {
            // This is supposed to return one CpuInfo per *socket*, but on some platforms, it will return one per *core*.
            // In either case, all CpuInfo's in the list should be identical.
            // NOTE: The results of getCpuInfoList() should be more consistent in SIGAR 1.5.1 and later
            //       (see http://jira.hyperic.com/browse/SIGAR-71).
            CpuInfo[] cpuInfoList = sigar.getCpuInfoList();
            if (cpuInfoList != null && cpuInfoList.length >= 1) {
                // Since all CpuInfo's in the list should be identical, we can always just grab the first one in the list.
                // We do *not* want to use this.cpuIndex as the index, because that is the *core* index, and this list
                // may be a list of *sockets*.
                this.cpuInfo = cpuInfoList[0];
            }
            else {
                log.error("Sigar.getCpuInfoList() returned null or empty array: "
                        + ((cpuInfoList != null) ? Arrays.asList(cpuInfoList) : cpuInfoList));
                this.cpuInfo = null;
            }

            // This should return one Cpu per *core*.
            Cpu[] cpuList = sigar.getCpuList();
            if (cpuList != null && this.cpuIndex < cpuList.length) {
                this.cpu = cpuList[this.cpuIndex];
            }
            else {
                log.error("Sigar.getCpuList() returned null or array with size smaller than or equal to this CPU's index ("
                        + this.cpuIndex + "): " + ((cpuList != null) ? Arrays.asList(cpuList) : cpuList));
                this.cpu = null;
            }

            // This should return one CpuPerc per *core*.
            CpuPerc[] cpuPercentageList = sigar.getCpuPercList();
            if (cpuPercentageList != null && this.cpuIndex < cpuPercentageList.length) {
                this.cpuPercentage = cpuPercentageList[this.cpuIndex];
            }
            else {
                log.error("Sigar.getCpuPercList() returned null or array with size smaller than or equal to this CPU's index ("
                        + this.cpuIndex + "): " + ((cpuPercentageList != null) ? Arrays.asList(cpuPercentageList) : cpuPercentageList));
                this.cpuPercentage = null;
            }            
        } catch (Exception e) {
            throw new SystemInfoException("Cannot refresh the native CPU information", e);
        } finally {
            sigar.close();
        }

        return;
    }

*** NOTE *** This fix should be tested on all of our supported platforms.


Comment 13 Corey Welton 2008-06-11 17:45:07 UTC
QA -> witte

Comment 14 Jeff Weiss 2008-06-27 20:38:36 UTC
I'm going to close this, we don't have access to hardware to test this.  Untested.

Comment 15 Charles Crouch 2008-06-27 20:48:50 UTC
from ips: 
" *** NOTE *** This fix should be tested on all of our supported platforms."

We should test on all platforms not just the one where the problem was first seen.

Comment 16 Ian Springer 2008-07-01 21:14:13 UTC
Moving back to Ready for QA...


Comment 17 Joseph Marques 2008-07-02 11:12:52 UTC
ian makes a good point about testing.  we should at least know all the places where this occurs, even if we don't plan on fixing them immediately thereafter.

Comment 18 Jeff Weiss 2008-10-07 19:37:32 UTC
We haven't seen any problems related to this in a whole release cycle, closing.

Comment 19 Red Hat Bugzilla 2009-11-10 21:08:51 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-416
This bug is related to RHQ-529