Bug 714249
Summary: | matahari host agent reports wrong cpu core count | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dave Johnson <dajohnso> | ||||
Component: | matahari | Assignee: | Zane Bitter <zbitter> | ||||
Status: | CLOSED ERRATA | QA Contact: | Dave Johnson <dajohnso> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 6.1 | CC: | abeekhof, akarol, astokes, ccrouch, matahari-maint, rbryant, whayutin, zbitter | ||||
Target Milestone: | rc | ||||||
Target Release: | 6.2 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | matahari-0.4.3-1.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: requesting the CPU core count from the Matahari host agent
Consequence: In some cases, the wrong CPU core count was returned.
Change: Matahari and the supporting library, sigar, have been updated to ensure that the core count is not improperly affected by hyperthreading support.
Consequence: The expecte CPU core count is returned
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 747977 (view as bug list) | Environment: | |||||
Last Closed: | 2011-12-06 11:39:05 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 743047, 747977, 748554 | ||||||
Attachments: |
|
Description
Dave Johnson
2011-06-17 17:20:30 UTC
The sigar API appears to store "the total number of cores in the system" in a "per cpu" struct. Highly intuitive. The correct logic would appear to be: diff --git a/src/lib/host.c b/src/lib/host.c index 311149f..2259e2d 100644 --- a/src/lib/host.c +++ b/src/lib/host.c @@ -244,12 +244,10 @@ host_get_cpu_details(void) sigar_cpu_info_list_get(host_init.sigar, &cpus); cpuinfo.cpus = cpus.number; - for (lpc = 0; lpc < cpus.number; lpc++) { + if(cpus.number > 0) { sigar_cpu_info_t *cpu = (sigar_cpu_info_t *) cpus.data; - if (cpuinfo.model == NULL) { - cpuinfo.model = g_strdup(cpu->model); - } - cpuinfo.cores += cpu->total_cores; + cpuinfo.model = g_strdup(cpu->model); + cpuinfo.cores = cpu->total_cores; } sigar_cpu_info_list_destroy(host_init.sigar, &cpus); Adam: can you test and confirm please? Here is test output from my machine with applied patch: akimbo, /tmp/matahari $ cat /proc/cpuinfo | egrep 'processor|cores' processor : 0 cpu cores : 2 processor : 1 cpu cores : 2 processor : 2 cpu cores : 2 processor : 3 cpu cores : 2 So this looks like 4 cpus x 2 = 8 cores My test output is as follows In MhApiHostSuite::testCpuCount: mh_api_host.h:111: Trace: "Verify cpu count: 4" . In MhApiHostSuite::testCpuNumberOfCores: /mh_api_host.h:119: Trace: "Verify cpu num of cores: 4" . In MhApiHostSuite::testCpuWordSize: mh_api_host.h:127: Trace: "Verify cpu wordsize: 64" So our cpu cores output still seems to be incorrect Adam, you might need to dig into sigar, perhaps their API is misbehaving. Opened a thread here: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917 OK, actually this is pretty simple and the problem here is hyperthreads. There are 3 different levels of things we're counting: 1) Sockets (sockets) -> actually pretty irrelevant; we don't really care how the cores are distributed between sockets. 2) Cores (cores) -> cpu_cores 3) Hyperthreads (processors) -> cpu_count Remember, to the kernel a "processor" is just "a thing that will run stuff for me" and that means a hyperthread, not a physical core. So, in the examples given: >>> host1.getProperties() (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), (cpu_count, 4), (cpu_cores, 16), (cpu_model, 'Core(TM) i3 CPU 540 @ 3.07GHz'), i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that cpu_cores should be 8, not 16; actually it should be 2. >>> host2.getProperties() (hostname, 'hp-dl360g5-02.rhts.eng.bos.redhat.com'), (cpu_count, 8), (cpu_cores, 64), (cpu_model, 'Xeon'), This is a Xeon box with 2 sockets, each containing a dual-core CPU, each core having 2 hyperthreads. So the correct cpu_cores is 4, not 16 or 64. >>> host3.getProperties() (hostname, 'dell-pe2850-01.rhts.eng.bos.redhat.com'), (cpu_count, 4), (cpu_cores, 16), (cpu_model, 'Xeon'), And this is 2 single-core Xeons with 2 hyperthreads each. So the correct cpu_cores here is 2. Finally, it looks like Adam has a dual-core CPU with 2 hyperthreads per core, so it should be reporting 2 cores, not 4 or 8. Although a lot of this is unintuitive, I should mention that sigar is being completely consistent with the kernel here; the output is exactly the same as what you see in /proc/cpuinfo. (Having said that, the lscpu output is generally more helpful.) Actually, I take that last bit back. On my machine with 1 socket 4 cores 8 processors sigar reports: 8 cpus 8 total_sockets 16 cores_per_socket 8 total_cores Report 8 *sockets* is completely insane, and there's no obvious way of telling the actual number of cores. I'll have to dig into their code and find out what they're doing. The sigar implementation was presumably written by someone with an AMD (or otherwise non-hyperthreaded) box. total_cores = sysconf(_SC_NPROCESSORS) That's wrong, because it's giving the number of processors, not the number of cores. Clearly. It's right there in the name. cores_per_socket = logical processor count from cpuid instruction According to AMD, "LogicalProcessorCount is the number of cores per processor." Unfortunately that's only true for AMD, since they don't have hyperthreading. In the presence of hyperthreading, LogicalProcessorCount is actually the logical processor count. (Funnily enough, that one is right there in the name too.) How it's coming up with 16 on my quad-core/8-thread box is something of a mystery. total_sockets = (total_cores < cores_per_socket) ? total_cores : total_cores / cores_per_socket Which would actually be correct, were all the input data not complete rubbish. Options: 1) Write it ourselves 2) Report something other than the number of cores as the number of cores 3) Fix sigar All of which are horrible. Created attachment 518067 [details]
sigar (trunk) patch
Attaching a (completely untested) patch for sigar (trunk). I will attempt to test this next week, assuming I can find a way to get sigar compiling on Fedora.
This is relying on /proc/cpuinfo for all of its information, since the kernel appears to be the only reliable source (even the hardware CPUINFO instruction lies, apparently). However, as a result, this may well not work with older versions of the kernel. Even if it does, verifying it on every kernel on every imaginable piece of hardware will be a nightmare. So it may be very difficult to get this patch upstream.
Backporting to sigar 1.6 is also not straightforward.
I've posted a patch very similar to Andrew's (above) to the mailing list. This at least will get us reporting the number of logical cpus as the number of cores (rather than the square of that number). I've also attached an as-yet untested patch to get sigar to tell us the actual number of cores (see previous comment). (In reply to comment #6) > So, in the examples given: > > >>> host1.getProperties() > (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), > (cpu_count, 4), > (cpu_cores, 16), > (cpu_model, 'Core(TM) i3 CPU 540 @ 3.07GHz'), > > i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that > cpu_cores should be 8, not 16; actually it should be 2. I think we want to report the total number of cores rather than the cores/cpu metric that sigar is giving us (if that means we have to do the multiplication ourselves, I'm fine with that). So I'd expect 8 in this case too. (In reply to comment #11) > (In reply to comment #6) > > So, in the examples given: > > > > >>> host1.getProperties() > > (hostname, 'hp-z200-07.rhts.eng.bos.redhat.com'), > > (cpu_count, 4), > > (cpu_cores, 16), > > (cpu_model, 'Core(TM) i3 CPU 540 @ 3.07GHz'), > > > > i3 is a dual-core CPU with 2 hyperthreads per core. Dave suggested that > > cpu_cores should be 8, not 16; actually it should be 2. > > I think we want to report the total number of cores rather than the cores/cpu > metric that sigar is giving us (if that means we have to do the multiplication > ourselves, I'm fine with that). > > So I'd expect 8 in this case too. The total number of cores is 2. There is only one socket. If you want to count threads instead of cores, there's 4 (but note that the cpu_count is already reporting that). There is in no way, shape or form 8 of anything in this box. Still seeing this as an issue in v0.4.2-6 I have now delivered the matahari patch I sent to the ML to matahari upstream. (However it is not in the v0.4.2-6 build.) I have also posted a tested version of the sigar patch to the Sigar development forum, as a reply to the thread that Adam started: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=42917 A version that will apply cleanly to the current version of sigar in Fedora is here: https://github.com/zaneb/sigar/commit/cae27197baf656ae0d7734913d8714995a787ee4 The matahari patch is included in the latest build, but I'm moving this back to ASSIGNED since we should probably hold off until the sigar patch is available in a build as well. Still looks like this is broke. I did find and install sigar-1.6.5-0.3 which had the following comment * Wed Sep 07 2011 Zane Bitter <zbitter> - 1.6.5-0.3.git58097d9 - Get CPU counts from /proc/cpuinfo Resolves: #714249 host.properties(): ------------------ hostname dell-pem600-01.rhts.eng.bos.redhat.com last_updated 1315593329000000000 free_swap 6160376 sequence 7 free_mem 2913080 os Linux (2.6.32-193.el6.x86_64) process_statistics {'running': 1, 'sleeping': 232, 'zombie': 0, 'idle': 0, 'stopped': 0, 'total': 233} cpu_count 8 cpu_model Xeon cpu_cores 8 <- should be 8 x 4 = 32 wordsize 64 swap 6160376 custom_uuid U04YE1SC9Q7CFJUPOQXF memory 3921168 update_interval 5 cpu_flags fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dts tpr_shadow vnmi flexpriority arch x86_64 uuid ce76da54cbe7c1d9ed1d44ce00000023 qmf: quit [root@dell-pem600-01 ~]# cat /proc/cpuinfo | egrep 'processor|cores' processor : 0 cpu cores : 4 processor : 1 cpu cores : 4 processor : 2 cpu cores : 4 processor : 3 cpu cores : 4 processor : 4 cpu cores : 4 processor : 5 cpu cores : 4 processor : 6 cpu cores : 4 processor : 7 cpu cores : 4 [root@dell-pem600-01 ~]# rpm -qa | egrep 'matahari|sigar' | sort matahari-0.4.4-2.el6.x86_64 matahari-agent-lib-0.4.4-2.el6.x86_64 matahari-broker-0.4.4-2.el6.x86_64 matahari-consoles-0.4.4-2.el6.x86_64 matahari-debuginfo-0.4.4-2.el6.x86_64 matahari-host-0.4.4-2.el6.x86_64 matahari-lib-0.4.4-2.el6.x86_64 matahari-network-0.4.4-2.el6.x86_64 matahari-service-0.4.4-2.el6.x86_64 matahari-sysconfig-0.4.4-2.el6.x86_64 sigar-1.6.5-0.3.git58097d9.el6.x86_64 That is correct behaviour. That box does not have 32 cores; it has 2 quad-core sockets. 2 x 4 = 8. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: requesting the CPU core count from the Matahari host agent Consequence: In some cases, the wrong CPU core count was returned. Change: Matahari and the supporting library, sigar, have been updated to ensure that the core count is not improperly affected by hyperthreading support. Consequence: The expecte CPU core count is returned Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1569.html |