Bug 714249

Summary: matahari host agent reports wrong cpu core count
Product: Red Hat Enterprise Linux 6 Reporter: Dave Johnson <dajohnso>
Component: matahariAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Dave Johnson <dajohnso>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: abeekhof, akarol, astokes, ccrouch, matahari-maint, rbryant, whayutin, zbitter
Target Milestone: rc   
Target Release: 6.2   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: matahari-0.4.3-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: requesting the CPU core count from the Matahari host agent Consequence: In some cases, the wrong CPU core count was returned. Change: Matahari and the supporting library, sigar, have been updated to ensure that the core count is not improperly affected by hyperthreading support. Consequence: The expecte CPU core count is returned
Story Points: ---
Clone Of:
: 747977 (view as bug list) Environment:
Last Closed: 2011-12-06 11:39:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743047, 747977, 748554    
Attachments:
Description Flags
sigar (trunk) patch none

Description Dave Johnson 2011-06-17 17:20:30 UTC
Description of problem:
matahari host properties cpu count wrong

Version-Release number of selected component (if applicable):
[root@hp-dl360g5-02 matahari]# rpm -qa | grep matahari
matahari-lib-0.4.0-5.el6.x86_64
matahari-service-0.4.0-5.el6.x86_64
matahari-host-0.4.0-5.el6.x86_64
matahari-net-0.4.0-5.el6.x86_64
matahari-broker-0.4.0-5.el6.x86_64
matahari-agent-lib-0.4.0-5.el6.x86_64
matahari-0.4.0-5.el6.x86_64
[root@hp-dl360g5-02 matahari]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.1 (Santiago)
[root@hp-dl360g5-02 matahari]# 


How reproducible:
100%

>>> host1.getProperties()
 (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
 (cpu_count, 4), 
 (cpu_cores, 16), 
 (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 

[root@hp-z200-07 init.d]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 2
processor	: 1
cpu cores	: 2
processor	: 2         4 cpus x 2 cores = 8, not 16 as above
cpu cores	: 2
processor	: 3
cpu cores	: 2
[root@hp-z200-07 init.d]# 

 

>>> host2.getProperties()
  (hostname, 'hp-dl360g5-02.rhts.eng.bos.redhat.com'), 
  (cpu_count, 8), 
  (cpu_cores, 64), 
  (cpu_model, 'Xeon'),

[root@hp-dl360g5-02 matahari]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 2
processor	: 1
cpu cores	: 2
processor	: 2
cpu cores	: 2
processor	: 3
cpu cores	: 2    8 cpus x 2 cores = 16 cores, not 64 as above
processor	: 4
cpu cores	: 2
processor	: 5
cpu cores	: 2
processor	: 6
cpu cores	: 2
processor	: 7
cpu cores	: 2
[root@hp-dl360g5-02 matahari]#


>>> host3.getProperties()
  (hostname, 'dell-pe2850-01.rhts.eng.bos.redhat.com'),
  (cpu_count, 4), 
  (cpu_cores, 16), 
  (cpu_model, 'Xeon'), 

[root@dell-pe2850-01 ~]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 1
processor	: 1
cpu cores	: 1
processor	: 2
cpu cores	: 1
processor	: 3
cpu cores	: 1

Comment 2 Andrew Beekhof 2011-07-08 03:44:36 UTC
The sigar API appears to store "the total number of cores in the system" in a "per cpu" struct.  Highly intuitive.

The correct logic would appear to be:

diff --git a/src/lib/host.c b/src/lib/host.c
index 311149f..2259e2d 100644
--- a/src/lib/host.c
+++ b/src/lib/host.c
@@ -244,12 +244,10 @@ host_get_cpu_details(void)
     sigar_cpu_info_list_get(host_init.sigar, &cpus);
 
     cpuinfo.cpus = cpus.number;
-    for (lpc = 0; lpc < cpus.number; lpc++) {
+    if(cpus.number > 0) {
         sigar_cpu_info_t *cpu = (sigar_cpu_info_t *) cpus.data;
-        if (cpuinfo.model == NULL) {
-            cpuinfo.model = g_strdup(cpu->model);
-        }
-        cpuinfo.cores += cpu->total_cores;
+       cpuinfo.model = g_strdup(cpu->model);
+       cpuinfo.cores = cpu->total_cores;
     }
 
     sigar_cpu_info_list_destroy(host_init.sigar, &cpus);


Adam: can you test and confirm please?

Comment 3 Adam Stokes 2011-07-12 14:24:02 UTC
Here is test output from my machine with applied patch:

akimbo, /tmp/matahari  $ cat /proc/cpuinfo | egrep 'processor|cores'                          
processor       : 0
cpu cores       : 2
processor       : 1
cpu cores       : 2
processor       : 2
cpu cores       : 2
processor       : 3
cpu cores       : 2

So this looks like 4 cpus x 2 = 8 cores

My test output is as follows

In MhApiHostSuite::testCpuCount:                                                                
mh_api_host.h:111: Trace: "Verify cpu count: 4"                         
.                                                                                               
In MhApiHostSuite::testCpuNumberOfCores:                                                        
/mh_api_host.h:119: Trace: "Verify cpu num of cores: 4"                  
.                                                                                               
In MhApiHostSuite::testCpuWordSize:                                                             
mh_api_host.h:127: Trace: "Verify cpu wordsize: 64"

So our cpu cores output still seems to be incorrect

Comment 4 Andrew Beekhof 2011-07-14 03:25:34 UTC
Adam, you might need to dig into sigar, perhaps their API is misbehaving.

Comment 5 Adam Stokes 2011-08-09 21:01:15 UTC
Opened a thread here: 

http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917

Comment 6 Zane Bitter 2011-08-11 11:02:49 UTC
OK, actually this is pretty simple and the problem here is hyperthreads.

There are 3 different levels of things we're counting:

1) Sockets (sockets) -> actually pretty irrelevant; we don't really care how the cores are distributed between sockets.
2) Cores (cores) -> cpu_cores
3) Hyperthreads (processors) -> cpu_count

Remember, to the kernel a "processor" is just "a thing that will run stuff for me" and that means a hyperthread, not a physical core.

So, in the examples given:

>>> host1.getProperties()
 (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
 (cpu_count, 4), 
 (cpu_cores, 16), 
 (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 

i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that cpu_cores should be 8, not 16; actually it should be 2.


>>> host2.getProperties()
  (hostname, 'hp-dl360g5-02.rhts.eng.bos.redhat.com'), 
  (cpu_count, 8), 
  (cpu_cores, 64), 
  (cpu_model, 'Xeon'),

This is a Xeon box with 2 sockets, each containing a dual-core CPU, each core having 2 hyperthreads. So the correct cpu_cores is 4, not 16 or 64.


>>> host3.getProperties()
  (hostname, 'dell-pe2850-01.rhts.eng.bos.redhat.com'),
  (cpu_count, 4), 
  (cpu_cores, 16), 
  (cpu_model, 'Xeon'), 

And this is 2 single-core Xeons with 2 hyperthreads each. So the correct cpu_cores here is 2.


Finally, it looks like Adam has a dual-core CPU with 2 hyperthreads per core, so it should be reporting 2 cores, not 4 or 8.


Although a lot of this is unintuitive, I should mention that sigar is being completely consistent with the kernel here; the output is exactly the same as what you see in /proc/cpuinfo. (Having said that, the lscpu output is generally more helpful.)

Comment 7 Zane Bitter 2011-08-11 12:14:48 UTC
Actually, I take that last bit back. On my machine with

1 socket
4 cores
8 processors

sigar reports:
8 cpus
8 total_sockets
16 cores_per_socket
8 total_cores

Report 8 *sockets* is completely insane, and there's no obvious way of telling the actual number of cores. I'll have to dig into their code and find out what they're doing.

Comment 8 Zane Bitter 2011-08-11 13:43:28 UTC
The sigar implementation was presumably written by someone with an AMD (or otherwise non-hyperthreaded) box.

total_cores = sysconf(_SC_NPROCESSORS)

That's wrong, because it's giving the number of processors, not the number of cores. Clearly. It's right there in the name.


cores_per_socket = logical processor count from cpuid instruction

According to AMD, "LogicalProcessorCount is the number of cores per 
processor." Unfortunately that's only true for AMD, since they don't have hyperthreading. In the presence of hyperthreading, LogicalProcessorCount is actually the logical processor count. (Funnily enough, that one is right there in the name too.) How it's coming up with 16 on my quad-core/8-thread box is something of a mystery.


total_sockets = (total_cores < cores_per_socket) ? total_cores : total_cores / cores_per_socket

Which would actually be correct, were all the input data not complete rubbish.


Options:
1) Write it ourselves
2) Report something other than the number of cores as the number of cores
3) Fix sigar

All of which are horrible.

Comment 9 Zane Bitter 2011-08-12 17:30:12 UTC
Created attachment 518067 [details]
sigar (trunk) patch

Attaching a (completely untested) patch for sigar (trunk). I will attempt to test this next week, assuming I can find a way to get sigar compiling on Fedora.

This is relying on /proc/cpuinfo for all of its information, since the kernel appears to be the only reliable source (even the hardware CPUINFO instruction lies, apparently). However, as a result, this may well not work with older versions of the kernel. Even if it does, verifying it on every kernel on every imaginable piece of hardware will be a nightmare. So it may be very difficult to get this patch upstream.

Backporting to sigar 1.6 is also not straightforward.

Comment 10 Zane Bitter 2011-08-12 17:34:27 UTC
I've posted a patch very similar to Andrew's (above) to the mailing list. This at least will get us reporting the number of logical cpus as the number of cores (rather than the square of that number).

I've also attached an as-yet untested patch to get sigar to tell us the actual number of cores (see previous comment).

Comment 11 Andrew Beekhof 2011-08-16 03:53:06 UTC
(In reply to comment #6)
> So, in the examples given:
> 
> >>> host1.getProperties()
>  (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
>  (cpu_count, 4), 
>  (cpu_cores, 16), 
>  (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 
> 
> i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that
> cpu_cores should be 8, not 16; actually it should be 2.

I think we want to report the total number of cores rather than the cores/cpu metric that sigar is giving us (if that means we have to do the multiplication ourselves, I'm fine with that).  

So I'd expect 8 in this case too.

Comment 13 Zane Bitter 2011-08-16 08:04:13 UTC
(In reply to comment #11)
> (In reply to comment #6)
> > So, in the examples given:
> > 
> > >>> host1.getProperties()
> >  (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), 
> >  (cpu_count, 4), 
> >  (cpu_cores, 16), 
> >  (cpu_model, 'Core(TM) i3 CPU         540  @ 3.07GHz'), 
> > 
> > i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that
> > cpu_cores should be 8, not 16; actually it should be 2.
> 
> I think we want to report the total number of cores rather than the cores/cpu
> metric that sigar is giving us (if that means we have to do the multiplication
> ourselves, I'm fine with that).  
> 
> So I'd expect 8 in this case too.

The total number of cores is 2. There is only one socket. If you want to count threads instead of cores, there's 4 (but note that the cpu_count is already reporting that). There is in no way, shape or form 8 of anything in this box.

Comment 15 Dave Johnson 2011-08-17 14:49:57 UTC
Still seeing this as an issue in v0.4.2-6

Comment 16 Zane Bitter 2011-08-17 15:51:47 UTC
I have now delivered the matahari patch I sent to the ML to matahari upstream. (However it is not in the v0.4.2-6 build.)

I have also posted a tested version of the sigar patch to the Sigar development forum, as a reply to the thread that Adam started:
http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917

A version that will apply cleanly to the current version of sigar in Fedora is here:
https://github.com/zaneb/sigar/commit/cae27197baf656ae0d7734913d8714995a787ee4

Comment 17 Russell Bryant 2011-09-02 13:55:33 UTC
The matahari patch is included in the latest build, but I'm moving this back to ASSIGNED since we should probably hold off until the sigar patch is available in a build as well.

Comment 19 Dave Johnson 2011-09-09 18:42:00 UTC
Still looks like this is broke.  I did find and install sigar-1.6.5-0.3 which had the following comment

   * Wed Sep 07 2011 Zane Bitter <zbitter> - 
   1.6.5-0.3.git58097d9 - Get CPU counts from /proc/cpuinfo Resolves: #714249 


host.properties():
------------------
    hostname            dell-pem600-01.rhts.eng.bos.redhat.com
    last_updated        1315593329000000000
    free_swap           6160376
    sequence            7
    free_mem            2913080
    os                  Linux (2.6.32-193.el6.x86_64)
    process_statistics  {'running': 1, 'sleeping': 232, 'zombie': 0, 'idle': 0, 'stopped': 0, 'total': 233}
    cpu_count           8
    cpu_model           Xeon
    cpu_cores           8           <- should be 8 x 4 = 32
    wordsize            64
    swap                6160376
    custom_uuid         U04YE1SC9Q7CFJUPOQXF
    memory              3921168
    update_interval     5
    cpu_flags           fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority
    arch                x86_64
    uuid                ce76da54cbe7c1d9ed1d44ce00000023
qmf: quit


[root@dell-pem600-01 ~]# cat /proc/cpuinfo | egrep 'processor|cores'
processor	: 0
cpu cores	: 4
processor	: 1
cpu cores	: 4
processor	: 2
cpu cores	: 4
processor	: 3
cpu cores	: 4
processor	: 4
cpu cores	: 4
processor	: 5
cpu cores	: 4
processor	: 6
cpu cores	: 4
processor	: 7
cpu cores	: 4


[root@dell-pem600-01 ~]# rpm -qa | egrep 'matahari|sigar' | sort
matahari-0.4.4-2.el6.x86_64
matahari-agent-lib-0.4.4-2.el6.x86_64
matahari-broker-0.4.4-2.el6.x86_64
matahari-consoles-0.4.4-2.el6.x86_64
matahari-debuginfo-0.4.4-2.el6.x86_64
matahari-host-0.4.4-2.el6.x86_64
matahari-lib-0.4.4-2.el6.x86_64
matahari-network-0.4.4-2.el6.x86_64
matahari-service-0.4.4-2.el6.x86_64
matahari-sysconfig-0.4.4-2.el6.x86_64
sigar-1.6.5-0.3.git58097d9.el6.x86_64

Comment 20 Zane Bitter 2011-09-14 13:10:18 UTC
That is correct behaviour. That box does not have 32 cores; it has 2 quad-core sockets. 2 x 4 = 8.

Comment 21 Russell Bryant 2011-11-16 21:17:06 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: requesting the CPU core count from the Matahari host agent
Consequence: In some cases, the wrong CPU core count was returned.
Change: Matahari and the supporting library, sigar, have been updated to ensure that the core count is not improperly affected by hyperthreading support.
Consequence: The expecte CPU core count is returned

Comment 22 errata-xmlrpc 2011-12-06 11:39:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1569.html