1468041 – Ceilometer CPU utilization over 100%

Bug 1468041 - Ceilometer CPU utilization over 100%

Summary: Ceilometer CPU utilization over 100%

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	z4
Target Release:	10.0 (Newton)
Assignee:	Mehdi ABAAKOUK
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:	1471913
Blocks:	1472359
TreeView+	depends on / blocked

Reported:	2017-07-05 20:42 UTC by Siggy Sigwald
Modified:	2022-08-16 12:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-ceilometer-7.1.0-2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1472359 (view as bug list)
Environment:
Last Closed:	2017-09-06 17:04:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1527620	None	None	None	2017-07-17 08:18:20 UTC
OpenStack gerrit	484274	None	None	None	2017-07-17 07:52:14 UTC
OpenStack gerrit	484640	None	None	None	2017-07-18 06:55:32 UTC
Red Hat Issue Tracker	OSP-4652	None	None	None	2022-08-16 12:52:55 UTC
Red Hat Product Errata	RHBA-2017:2651	normal	SHIPPED_LIVE	openstack-ceilometer bug fix advisory	2017-09-06 20:54:19 UTC

Description Siggy Sigwald 2017-07-05 20:42:45 UTC

Description of problem:
Our numbers for cpu_util are wrong so it doesn't allow us to monitor cpu_util via ceilometer and we have to use other mechanisms (non-OpenStack). It is occurring in production.
In kilo, we're seeing ceilometer metrics that are clearly incorrect, for example, cpu_util > 100%, which may be due to the bug here: https://bugs.launchpad.net/ceilometer/+bug/1527620. 
This doesn't seem to be fixed in newton.


Version-Release number of selected component (if applicable):

openstack-ceilometer-alarm-2015.1.3-3.el7ost.noarch
openstack-ceilometer-api-2015.1.3-3.el7ost.noarch
openstack-ceilometer-central-2015.1.3-3.el7ost.noarch
openstack-ceilometer-collector-2015.1.3-3.el7ost.noarch
openstack-ceilometer-common-2015.1.3-3.el7ost.noarch
openstack-ceilometer-compute-2015.1.3-3.el7ost.noarch
openstack-ceilometer-notification-2015.1.3-3.el7ost.noarch
python-ceilometer-2015.1.3-3.el7ost.noarch
python-ceilometerclient-1.0.14-1.el7ost.noarch

Comment 1 Siggy Sigwald 2017-07-05 20:45:30 UTC

There are several nova instances that are logging > 100% cpu_util as shown by ceilometer.  

See comment #31 and the attachment for an example of cpu_util history for a single instance.  
vcpu's are pinned and are on the same numa node for instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83: 

grep "vcpu placement='static'" /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/virsh/virsh_-r_dumpxml_instance-00000175   -A 16

  <vcpu placement='static'>12</vcpu>
  <cputune>
    <shares>12288</shares>
    <vcpupin vcpu='0' cpuset='12'/>
    <vcpupin vcpu='1' cpuset='13'/>
    <vcpupin vcpu='2' cpuset='14'/>
    <vcpupin vcpu='3' cpuset='15'/>
    <vcpupin vcpu='4' cpuset='16'/>
    <vcpupin vcpu='5' cpuset='17'/>
    <vcpupin vcpu='6' cpuset='18'/>
    <vcpupin vcpu='7' cpuset='19'/>
    <vcpupin vcpu='8' cpuset='20'/>
    <vcpupin vcpu='9' cpuset='21'/>
    <vcpupin vcpu='10' cpuset='22'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='12-23'/>
  </cputune>

cat /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/processor/lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2500.000
BogoMIPS:              5007.93
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Ceilometer also shows 12 vcpus: 

grep  "'vcpus':" /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/var/log/ceilometer/compute.log  | head -1 | grep vcpus

(partial result shown)
'ram': 131072, 'ephemeral': 0, 'vcpus': 12, 'disk': 2, u'id': u'87dad299-861c-42cd-abac-64cdf6eb6c70'}, u'id': u'9831d763-43a5-4d2c-9c0d-4d34bcf32b83'

See comment #29 for the top 10 highest cpu_util and the 3 instances that logged them.  
Information in file instance-9831d763-43a5-4d2c-9c0d-4d34bcf32b83_cpu_history.txt 

See comment #26 and the attachment that demonstrate how ceilometer data is calculated from raw cpu time collected from libvirt
Ceilometer_Calculations.xlsx 

Assumptions:

1) the scale factor (see comment #3) divides by the number of cpu's so cpu_util should never be > 100%

Observations:

1) Whenever cpu_util is over 100% then the next several samples are 0.0.  How many following samples are 0.0 seems to depend on how much over 100% it is. See comment #31 
2) Instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83 has 12 vcpus according to libvirt and that is what ceilometer reports also.  Comment #32
3) Instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83 is the only nova instance on the compute host it is running on. See /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/virsh/virsh_list_--all 
4) vcpus are pinned to cpuset 12 through 23 and are on the same numa node.  See comment #32.  
5) If you dig down deep enough with the grep in comment #26 you will find quite a few more instances logging > 100% cpu_util, just not radically so much as the three shown.  

Conclusion:

It looks to me like ceilometer is correctly reporting the data it is getting from the hypervisor.

Comment 4 Mehdi ABAAKOUK 2017-07-17 07:50:55 UTC

Siggy Sigwald have well described the issue, Ceilometer can't really do anything, the 100% cpu is reported by libvirt itself.

Comment 5 Mehdi ABAAKOUK 2017-07-17 07:56:37 UTC

I propose the backport upstream that cap the cpu_util to 100%

Comment 7 Mehdi ABAAKOUK 2017-07-17 19:04:24 UTC

On the Ceilomete side, cpu_util is computed like this:

ceilometer polls libvirt every 10 minutes and retrieve the cpu_time in nanoseconds. Each cpu_time sample have the polled timestamp, because it's a timestamp it depends on the host clock.

cpu_util is not the true/real cpu utilization in percent, but the rate of change of the cpu_time scaled with the number of vcpu of the instance. Rate of change computed with the diff of timestamps can't be nanoseconds precision like the cpu_time is. That's why the cpu_util is not precise. So, capping it to 100% is ok.


On libvirt side:

Since OSP12, Ceilometer comsumes vcpu.<vcpu_number>.time and vcpu.<vcpu_number>.wait instead of cpu_time (and fallback to cpu_time is libvirt is too old) to build a more accurate cpu_time sample (and then a more accurate cpu_util). Using this new libvirt statistics avoid the 100% pike and then 0 most of the time. But I don't known if the rhel7 libvirt provides them.

If it provides them, I can take a look if it's possible to backport https://review.openstack.org/#/c/451297/

The support can be checked with "virsh nodecpustats <domain_id_or_name>", can you check that ?

Comment 8 Mehdi ABAAKOUK 2017-07-18 06:55:33 UTC

I have proposed upstream a solution to not use timestamp and get a more accurate rate of change: https://review.openstack.org/#/c/484640/

Comment 10 Sasha Smolyak 2017-08-21 12:32:44 UTC

Looks like it's fixed
[stack@undercloud-0 ~]$ gnocchi measures show f91a31d2-0299-4553-bf32-9c9ccd7b4615
+---------------------------+-------------+----------------+
| timestamp                 | granularity |          value |
+---------------------------+-------------+----------------+
| 2017-08-21T11:10:00+00:00 |       300.0 | 0.270526577729 |
| 2017-08-21T11:20:00+00:00 |       300.0 | 0.256668196409 |
| 2017-08-21T11:30:00+00:00 |       300.0 | 0.259932436628 |
| 2017-08-21T11:40:00+00:00 |       300.0 |   11.109217152 |
| 2017-08-21T11:50:00+00:00 |       300.0 |  88.5498293714 |
| 2017-08-21T12:00:00+00:00 |       300.0 |          100.0 |
| 2017-08-21T12:10:00+00:00 |       300.0 |  99.9955968962 |
| 2017-08-21T12:20:00+00:00 |       300.0 |  99.9995608443 |
| 2017-08-21T12:30:00+00:00 |       300.0 |  99.3340708628 |
+---------------------------+-------------+----------------+

Comment 12 errata-xmlrpc 2017-09-06 17:04:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2651

Note You need to log in before you can comment on or make changes to this bug.