Bug 536022 (RHQ-416)
Summary: | Native system fails to gather cpu metrics for dual core opterons | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Heiko W. Rupp <hrupp> |
Component: | Plugins | Assignee: | Ian Springer <ian.springer> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Jeff Weiss <jweiss> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 0.1 | CC: | ccrouch, dajohnso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://jira.rhq-project.org/browse/RHQ-416 | ||
Whiteboard: | |||
Fixed In Version: | 1.1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
Suse Linux 2.6.5-7 AMD Dual-Core Opterons
|
|
Last Closed: | Type: | --- | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Heiko W. Rupp
2008-05-02 08:05:00 UTC
From case 176801 Couldn't reproduce this on RHEL4U6 on Dual-Core AMD Opteron(tm) Processor 2220 Cpu info: Here is the output: ------------------------------------------------------ sigar> cpuinfo 2 total CPUs.. Vendor........AMD Model.........Dual-Core AMD Opteron(tm) Processor 2218 Mhz...........2600 Cache size....1024 rest of the cpuinfo and some sysinfo CPU 0......... User Time.....0.0% Sys Time......0.0% Idle Time.....100.0% Wait Time.....0.0% Nice Time.....0.0% Combined......0.0% CPU 1......... User Time.....0.0% Sys Time......0.0% Idle Time.....100.0% Wait Time.....0.0% Nice Time.....0.0% Combined......0.0% Totals........ User Time.....2.4% Sys Time......0.0% Idle Time.....97.5% Wait Time.....0.0% Nice Time.....0.0% Combined......2.4% sigar> sysinfo Sigar version.......java=1.5.0.3, native=1.5.0.1 Build date..........java=04/16/2008 09:44 AM, native=01/31/2008 10:22 PM Archlib.............libsigar-x86-linux.so ... Current user........jboss OS description......SuSE 9 OS name.............Linux OS arch.............x86_64 OS machine..........x86_64 OS version..........2.6.5-7.282-smp OS patch level......unknown OS vendor...........SuSE OS vendor version...9 OS code name........ OS data model.......32 OS cpu endian.......little Java vm version.....1.5.0_10-b03 Java vm vendor......Sun Microsystems Inc. Java home.........../opt/java/jdk1.5.0_10/jre 9:22 AM up 83 days, 20:13, load average: 0.36, 0.25, 0.20 So it looks like Sigar is returning us the right information (two sets of CPU info, apart from the Totals one not looking right), but we're not doing the right thing with it in our native integration. As a workaround can they turn off CPU metrics on the platform? What Sigar is reporting on its command line is one thing. What its methods are returning another. The code in question is: Cpu[] cpuList = sigar.getCpuList(); CpuInfo[] cpuInfoList = sigar.getCpuInfoList(); CpuPerc[] cpuPercentageList = sigar.getCpuPercList(); cpu = (cpuList != null) ? cpuList[this.cpuIndex] : null; cpuInfo = (cpuInfoList != null) ? cpuInfoList[this.cpuIndex] : null; // <<<--- HERE According to the stack trace it is working for the cpuList[] array, but failing for the cpuInfoList[] array, which means that the array returned by Sigar is too short in this case. Looking at Sigar forum and Jira, there are a few issues about wrong number of cpus detected with amd cpus - even if I did not find the model in question. To quote Doug: "The sigar cpu_list functions are actually supposed to return the physical number of CPUs (sockets) and in the case of multi-core each socket would have the aggregate times. There's some issues with newer models we're in the process of getting fixed in the 1.5.1 version of Sigar. " See e.g. http://jira.hyperic.com/browse/HHQ-929 http://jira.hyperic.com/browse/HHQ-945 (just as ref, fixed in sigar 1.5) I am proposing the following patch to CpuInformation.java in order to catch the AIOOBE Index: . =================================================================== --- . (revision 826) +++ . (working copy) @@ -60,9 +60,10 @@ CpuInfo[] cpuInfoList = sigar.getCpuInfoList(); CpuPerc[] cpuPercentageList = sigar.getCpuPercList(); - cpu = (cpuList != null) ? cpuList[this.cpuIndex] : null; - cpuInfo = (cpuInfoList != null) ? cpuInfoList[this.cpuIndex] : null; - cpuPercentage = (cpuPercentageList != null) ? cpuPercentageList[this.cpuIndex] : null; + cpu = (cpuList != null && cpuIndex < cpuList.length) ? cpuList[this.cpuIndex] : null; + cpuInfo = (cpuInfoList != null && cpuIndex < cpuInfoList.length) ? cpuInfoList[this.cpuIndex] : null; + cpuPercentage = (cpuPercentageList != null && cpuIndex < cpuPercentageList.length) ? cpuPercentageList[this.cpuIndex] + : null; } catch (Exception e) { throw new SystemInfoException("Cannot refresh the native CPU information", e); } finally { We should still investigate with the Sigar guys what is going on, but this way, the user is just getting no information instead of errors. Should we log this issue nevertheless ? The "master" is with ips too ... The forum thread needs updating with the latest info from the case Ian, have you committed the workaround? My main concern is that this should not blow up. Even if it doesn't accurately collect the data in certain circumstances until sigar is updated. Fixed - r941. The new code is modeled after how SIGAR's cpuinfo command gathers the same information. Here's the updated code in NativeSystemInfo: public int getNumberOfCpus() { Sigar sigar = new Sigar(); try { // NOTE: This will return the number of cores, not the number of sockets. return sigar.getCpuPercList().length; } catch (Exception e) { throw new UnsupportedOperationException("Cannot get number of CPUs from native layer", e); } finally { sigar.close(); } } and in CpuInformation: public void refresh() { Sigar sigar = new Sigar(); try { // This is supposed to return one CpuInfo per *socket*, but on some platforms, it will return one per *core*. // In either case, all CpuInfo's in the list should be identical. // NOTE: The results of getCpuInfoList() should be more consistent in SIGAR 1.5.1 and later // (see http://jira.hyperic.com/browse/SIGAR-71). CpuInfo[] cpuInfoList = sigar.getCpuInfoList(); if (cpuInfoList != null && cpuInfoList.length >= 1) { // Since all CpuInfo's in the list should be identical, we can always just grab the first one in the list. // We do *not* want to use this.cpuIndex as the index, because that is the *core* index, and this list // may be a list of *sockets*. this.cpuInfo = cpuInfoList[0]; } else { log.error("Sigar.getCpuInfoList() returned null or empty array: " + ((cpuInfoList != null) ? Arrays.asList(cpuInfoList) : cpuInfoList)); this.cpuInfo = null; } // This should return one Cpu per *core*. Cpu[] cpuList = sigar.getCpuList(); if (cpuList != null && this.cpuIndex < cpuList.length) { this.cpu = cpuList[this.cpuIndex]; } else { log.error("Sigar.getCpuList() returned null or array with size smaller than or equal to this CPU's index (" + this.cpuIndex + "): " + ((cpuList != null) ? Arrays.asList(cpuList) : cpuList)); this.cpu = null; } // This should return one CpuPerc per *core*. CpuPerc[] cpuPercentageList = sigar.getCpuPercList(); if (cpuPercentageList != null && this.cpuIndex < cpuPercentageList.length) { this.cpuPercentage = cpuPercentageList[this.cpuIndex]; } else { log.error("Sigar.getCpuPercList() returned null or array with size smaller than or equal to this CPU's index (" + this.cpuIndex + "): " + ((cpuPercentageList != null) ? Arrays.asList(cpuPercentageList) : cpuPercentageList)); this.cpuPercentage = null; } } catch (Exception e) { throw new SystemInfoException("Cannot refresh the native CPU information", e); } finally { sigar.close(); } return; } *** NOTE *** This fix should be tested on all of our supported platforms. QA -> witte I'm going to close this, we don't have access to hardware to test this. Untested. from ips: " *** NOTE *** This fix should be tested on all of our supported platforms." We should test on all platforms not just the one where the problem was first seen. Moving back to Ready for QA... ian makes a good point about testing. we should at least know all the places where this occurs, even if we don't plan on fixing them immediately thereafter. We haven't seen any problems related to this in a whole release cycle, closing. This bug was previously known as http://jira.rhq-project.org/browse/RHQ-416 This bug is related to RHQ-529 |