Bug 747977 - matahari host agent reports wrong cpu core count
Summary: matahari host agent reports wrong cpu core count
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: sigar
Version: 16
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Zane Bitter
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 714249
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-21 15:19 UTC by Zane Bitter
Modified: 2011-11-25 01:53 UTC (History)
11 users (show)

Fixed In Version: sigar-1.6.5-0.5.git58097d9.fc16
Clone Of: 714249
Environment:
Last Closed: 2011-11-25 01:53:16 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Zane Bitter 2011-10-21 15:19:46 UTC
+++ This bug was initially created as a clone of Bug #714249 +++

Description of problem:
matahari host properties cpu count wrong

Version-Release number of selected component (if applicable):
[root@hp-dl360g5-02 matahari]# rpm -qa | grep matahari
matahari-lib-0.4.0-5.el6.x86_64
matahari-service-0.4.0-5.el6.x86_64
matahari-host-0.4.0-5.el6.x86_64
matahari-net-0.4.0-5.el6.x86_64
matahari-broker-0.4.0-5.el6.x86_64
matahari-agent-lib-0.4.0-5.el6.x86_64
matahari-0.4.0-5.el6.x86_64
[root@hp-dl360g5-02 matahari]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.1 (Santiago)
[root@hp-dl360g5-02 matahari]# 


How reproducible:
100%

>>> host1.getProperties()
 (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
 (cpu_count, 4), 
 (cpu_cores, 16), 
 (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 

[root@hp-z200-07 init.d]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 2
processor	: 1
cpu cores	: 2
processor	: 2         4 cpus x 2 cores = 8, not 16 as above
cpu cores	: 2
processor	: 3
cpu cores	: 2
[root@hp-z200-07 init.d]# 

 

>>> host2.getProperties()
  (hostname, 'hp-dl360g5-02.rhts.eng.bos.redhat.com'), 
  (cpu_count, 8), 
  (cpu_cores, 64), 
  (cpu_model, 'Xeon'),

[root@hp-dl360g5-02 matahari]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 2
processor	: 1
cpu cores	: 2
processor	: 2
cpu cores	: 2
processor	: 3
cpu cores	: 2    8 cpus x 2 cores = 16 cores, not 64 as above
processor	: 4
cpu cores	: 2
processor	: 5
cpu cores	: 2
processor	: 6
cpu cores	: 2
processor	: 7
cpu cores	: 2
[root@hp-dl360g5-02 matahari]#


>>> host3.getProperties()
  (hostname, 'dell-pe2850-01.rhts.eng.bos.redhat.com'),
  (cpu_count, 4), 
  (cpu_cores, 16), 
  (cpu_model, 'Xeon'), 

[root@dell-pe2850-01 ~]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 1
processor	: 1
cpu cores	: 1
processor	: 2
cpu cores	: 1
processor	: 3
cpu cores	: 1

--- Additional comment from pm-rhel on 2011-06-17 13:22:37 EDT ---

Since this issue was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from abeekhof on 2011-07-07 23:44:36 EDT ---

The sigar API appears to store "the total number of cores in the system" in a "per cpu" struct.  Highly intuitive.

The correct logic would appear to be:

diff --git a/src/lib/host.c b/src/lib/host.c
index 311149f..2259e2d 100644
--- a/src/lib/host.c
+++ b/src/lib/host.c
@@ -244,12 +244,10 @@ host_get_cpu_details(void)
     sigar_cpu_info_list_get(host_init.sigar, &cpus);
 
     cpuinfo.cpus = cpus.number;
-    for (lpc = 0; lpc < cpus.number; lpc++) {
+    if(cpus.number > 0) {
         sigar_cpu_info_t *cpu = (sigar_cpu_info_t *) cpus.data;
-        if (cpuinfo.model == NULL) {
-            cpuinfo.model = g_strdup(cpu->model);
-        }
-        cpuinfo.cores += cpu->total_cores;
+       cpuinfo.model = g_strdup(cpu->model);
+       cpuinfo.cores = cpu->total_cores;
     }
 
     sigar_cpu_info_list_destroy(host_init.sigar, &cpus);


Adam: can you test and confirm please?

--- Additional comment from astokes on 2011-07-12 10:24:02 EDT ---

Here is test output from my machine with applied patch:

akimbo, /tmp/matahari  $ cat /proc/cpuinfo | egrep 'processor|cores'                          
processor       : 0
cpu cores       : 2
processor       : 1
cpu cores       : 2
processor       : 2
cpu cores       : 2
processor       : 3
cpu cores       : 2

So this looks like 4 cpus x 2 = 8 cores

My test output is as follows

In MhApiHostSuite::testCpuCount:                                                                
mh_api_host.h:111: Trace: "Verify cpu count: 4"                         
.                                                                                               
In MhApiHostSuite::testCpuNumberOfCores:                                                        
/mh_api_host.h:119: Trace: "Verify cpu num of cores: 4"                  
.                                                                                               
In MhApiHostSuite::testCpuWordSize:                                                             
mh_api_host.h:127: Trace: "Verify cpu wordsize: 64"

So our cpu cores output still seems to be incorrect

--- Additional comment from abeekhof on 2011-07-13 23:25:34 EDT ---

Adam, you might need to dig into sigar, perhaps their API is misbehaving.

--- Additional comment from astokes on 2011-08-09 17:01:15 EDT ---

Opened a thread here: 

http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917

--- Additional comment from zbitter on 2011-08-11 07:02:49 EDT ---

OK, actually this is pretty simple and the problem here is hyperthreads.

There are 3 different levels of things we're counting:

1) Sockets (sockets) -> actually pretty irrelevant; we don't really care how the cores are distributed between sockets.
2) Cores (cores) -> cpu_cores
3) Hyperthreads (processors) -> cpu_count

Remember, to the kernel a "processor" is just "a thing that will run stuff for me" and that means a hyperthread, not a physical core.

So, in the examples given:

>>> host1.getProperties()
 (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
 (cpu_count, 4), 
 (cpu_cores, 16), 
 (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 

i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that cpu_cores should be 8, not 16; actually it should be 2.


>>> host2.getProperties()
  (hostname, 'hp-dl360g5-02.rhts.eng.bos.redhat.com'), 
  (cpu_count, 8), 
  (cpu_cores, 64), 
  (cpu_model, 'Xeon'),

This is a Xeon box with 2 sockets, each containing a dual-core CPU, each core having 2 hyperthreads. So the correct cpu_cores is 4, not 16 or 64.


>>> host3.getProperties()
  (hostname, 'dell-pe2850-01.rhts.eng.bos.redhat.com'),
  (cpu_count, 4), 
  (cpu_cores, 16), 
  (cpu_model, 'Xeon'), 

And this is 2 single-core Xeons with 2 hyperthreads each. So the correct cpu_cores here is 2.


Finally, it looks like Adam has a dual-core CPU with 2 hyperthreads per core, so it should be reporting 2 cores, not 4 or 8.


Although a lot of this is unintuitive, I should mention that sigar is being completely consistent with the kernel here; the output is exactly the same as what you see in /proc/cpuinfo. (Having said that, the lscpu output is generally more helpful.)

--- Additional comment from zbitter on 2011-08-11 08:14:48 EDT ---

Actually, I take that last bit back. On my machine with

1 socket
4 cores
8 processors

sigar reports:
8 cpus
8 total_sockets
16 cores_per_socket
8 total_cores

Report 8 *sockets* is completely insane, and there's no obvious way of telling the actual number of cores. I'll have to dig into their code and find out what they're doing.

--- Additional comment from zbitter on 2011-08-11 09:43:28 EDT ---

The sigar implementation was presumably written by someone with an AMD (or otherwise non-hyperthreaded) box.

total_cores = sysconf(_SC_NPROCESSORS)

That's wrong, because it's giving the number of processors, not the number of cores. Clearly. It's right there in the name.


cores_per_socket = logical processor count from cpuid instruction

According to AMD, "LogicalProcessorCount is the number of cores per 
processor." Unfortunately that's only true for AMD, since they don't have hyperthreading. In the presence of hyperthreading, LogicalProcessorCount is actually the logical processor count. (Funnily enough, that one is right there in the name too.) How it's coming up with 16 on my quad-core/8-thread box is something of a mystery.


total_sockets = (total_cores < cores_per_socket) ? total_cores : total_cores / cores_per_socket

Which would actually be correct, were all the input data not complete rubbish.


Options:
1) Write it ourselves
2) Report something other than the number of cores as the number of cores
3) Fix sigar

All of which are horrible.

--- Additional comment from zbitter on 2011-08-12 13:30:12 EDT ---

Created attachment 518067 [details]
sigar (trunk) patch

Attaching a (completely untested) patch for sigar (trunk). I will attempt to test this next week, assuming I can find a way to get sigar compiling on Fedora.

This is relying on /proc/cpuinfo for all of its information, since the kernel appears to be the only reliable source (even the hardware CPUINFO instruction lies, apparently). However, as a result, this may well not work with older versions of the kernel. Even if it does, verifying it on every kernel on every imaginable piece of hardware will be a nightmare. So it may be very difficult to get this patch upstream.

Backporting to sigar 1.6 is also not straightforward.

--- Additional comment from zbitter on 2011-08-12 13:34:27 EDT ---

I've posted a patch very similar to Andrew's (above) to the mailing list. This at least will get us reporting the number of logical cpus as the number of cores (rather than the square of that number).

I've also attached an as-yet untested patch to get sigar to tell us the actual number of cores (see previous comment).

--- Additional comment from abeekhof on 2011-08-15 23:53:06 EDT ---

(In reply to comment #6)
> So, in the examples given:
> 
> >>> host1.getProperties()
>  (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
>  (cpu_count, 4), 
>  (cpu_cores, 16), 
>  (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 
> 
> i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that
> cpu_cores should be 8, not 16; actually it should be 2.

I think we want to report the total number of cores rather than the cores/cpu metric that sigar is giving us (if that means we have to do the multiplication ourselves, I'm fine with that).  

So I'd expect 8 in this case too.

--- Additional comment from errata-xmlrpc on 2011-08-16 01:45:11 EDT ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2011:11386-01
http://errata.devel.redhat.com/errata/show/11386

--- Additional comment from zbitter on 2011-08-16 04:04:13 EDT ---

(In reply to comment #11)
> (In reply to comment #6)
> > So, in the examples given:
> > 
> > >>> host1.getProperties()
> >  (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
> >  (cpu_count, 4), 
> >  (cpu_cores, 16), 
> >  (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 
> > 
> > i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that
> > cpu_cores should be 8, not 16; actually it should be 2.
> 
> I think we want to report the total number of cores rather than the cores/cpu
> metric that sigar is giving us (if that means we have to do the multiplication
> ourselves, I'm fine with that).  
> 
> So I'd expect 8 in this case too.

The total number of cores is 2. There is only one socket. If you want to count threads instead of cores, there's 4 (but note that the cpu_count is already reporting that). There is in no way, shape or form 8 of anything in this box.

--- Additional comment from zbitter on 2011-08-16 04:24:52 EDT ---

I am confused as to how this is ON_QA when there are no patches upstream or in distcvs to address it?

--- Additional comment from dajohnso on 2011-08-17 10:49:57 EDT ---

Still seeing this as an issue in v0.4.2-6

--- Additional comment from zbitter on 2011-08-17 11:51:47 EDT ---

I have now delivered the matahari patch I sent to the ML to matahari upstream. (However it is not in the v0.4.2-6 build.)

I have also posted a tested version of the sigar patch to the Sigar development forum, as a reply to the thread that Adam started:
http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917

A version that will apply cleanly to the current version of sigar in Fedora is here:
https://github.com/zaneb/sigar/commit/cae27197baf656ae0d7734913d8714995a787ee4

--- Additional comment from rbryant on 2011-09-02 09:55:33 EDT ---

The matahari patch is included in the latest build, but I'm moving this back to ASSIGNED since we should probably hold off until the sigar patch is available in a build as well.

--- Additional comment from errata-xmlrpc on 2011-09-07 16:57:40 EDT ---

Bug report changed from MODIFIED to ON_QA status by the Errata System: 
Advisory RHBA-2011:11386-01: 
http://errata.devel.redhat.com/errata/stateview/11386

--- Additional comment from dajohnso on 2011-09-09 14:42:00 EDT ---

Still looks like this is broke.  I did find and install sigar-1.6.5-0.3 which had the following comment

   * Wed Sep 07 2011 Zane Bitter <zbitter> - 
   1.6.5-0.3.git58097d9 - Get CPU counts from /proc/cpuinfo Resolves: #714249 


host.properties():
------------------
    hostname            dell-pem600-01.rhts.eng.bos.redhat.com
    last_updated        1315593329000000000
    free_swap           6160376
    sequence            7
    free_mem            2913080
    os                  Linux (2.6.32-193.el6.x86_64)
    process_statistics  {'running': 1, 'sleeping': 232, 'zombie': 0, 'idle': 0, 'stopped': 0, 'total': 233}
    cpu_count           8
    cpu_model           Xeon
    cpu_cores           8           <- should be 8 x 4 = 32
    wordsize            64
    swap                6160376
    custom_uuid         U04YE1SC9Q7CFJUPOQXF
    memory              3921168
    update_interval     5
    cpu_flags           fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority
    arch                x86_64
    uuid                ce76da54cbe7c1d9ed1d44ce00000023
qmf: quit


[root@dell-pem600-01 ~]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 4
processor	: 1
cpu cores	: 4
processor	: 2
cpu cores	: 4
processor	: 3
cpu cores	: 4
processor	: 4
cpu cores	: 4
processor	: 5
cpu cores	: 4
processor	: 6
cpu cores	: 4
processor	: 7
cpu cores	: 4


[root@dell-pem600-01 ~]# rpm -qa | egrep 'matahari|sigar' | sort
matahari-0.4.4-2.el6.x86_64
matahari-agent-lib-0.4.4-2.el6.x86_64
matahari-broker-0.4.4-2.el6.x86_64
matahari-consoles-0.4.4-2.el6.x86_64
matahari-debuginfo-0.4.4-2.el6.x86_64
matahari-host-0.4.4-2.el6.x86_64
matahari-lib-0.4.4-2.el6.x86_64
matahari-network-0.4.4-2.el6.x86_64
matahari-service-0.4.4-2.el6.x86_64
matahari-sysconfig-0.4.4-2.el6.x86_64
sigar-1.6.5-0.3.git58097d9.el6.x86_64

--- Additional comment from zbitter on 2011-09-14 09:10:18 EDT ---

That is correct behaviour. That box does not have 32 cores; it has 2 quad-core sockets. 2 x 4 = 8.

Comment 1 Fedora Update System 2011-10-21 15:44:52 UTC
sigar-1.6.5-0.5.git58097d9.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/sigar-1.6.5-0.5.git58097d9.fc16

Comment 2 Fedora Update System 2011-10-22 00:11:32 UTC
Package sigar-1.6.5-0.5.git58097d9.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing sigar-1.6.5-0.5.git58097d9.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14706
then log in and leave karma (feedback).

Comment 3 Fedora Update System 2011-11-25 01:53:16 UTC
sigar-1.6.5-0.5.git58097d9.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.