Bug 1468041 - Ceilometer CPU utilization over 100%
Ceilometer CPU utilization over 100%
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer (Show other bugs)
10.0 (Newton)
x86_64 Linux
unspecified Severity high
: z4
: 10.0 (Newton)
Assigned To: Mehdi ABAAKOUK
Sasha Smolyak
: Triaged, ZStream
Depends On: 1471913
Blocks: 1472359
  Show dependency treegraph
 
Reported: 2017-07-05 16:42 EDT by Siggy Sigwald
Modified: 2018-02-22 03:42 EST (History)
8 users (show)

See Also:
Fixed In Version: openstack-ceilometer-7.1.0-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1472359 (view as bug list)
Environment:
Last Closed: 2017-09-06 13:04:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1527620 None None None 2017-07-17 04:18 EDT
OpenStack gerrit 484274 None None None 2017-07-17 03:52 EDT
OpenStack gerrit 484640 None None None 2017-07-18 02:55 EDT

  None (edit)
Description Siggy Sigwald 2017-07-05 16:42:45 EDT
Description of problem:
Our numbers for cpu_util are wrong so it doesn't allow us to monitor cpu_util via ceilometer and we have to use other mechanisms (non-OpenStack). It is occurring in production.
In kilo, we're seeing ceilometer metrics that are clearly incorrect, for example, cpu_util > 100%, which may be due to the bug here: https://bugs.launchpad.net/ceilometer/+bug/1527620. 
This doesn't seem to be fixed in newton.


Version-Release number of selected component (if applicable):

openstack-ceilometer-alarm-2015.1.3-3.el7ost.noarch
openstack-ceilometer-api-2015.1.3-3.el7ost.noarch
openstack-ceilometer-central-2015.1.3-3.el7ost.noarch
openstack-ceilometer-collector-2015.1.3-3.el7ost.noarch
openstack-ceilometer-common-2015.1.3-3.el7ost.noarch
openstack-ceilometer-compute-2015.1.3-3.el7ost.noarch
openstack-ceilometer-notification-2015.1.3-3.el7ost.noarch
python-ceilometer-2015.1.3-3.el7ost.noarch
python-ceilometerclient-1.0.14-1.el7ost.noarch
Comment 1 Siggy Sigwald 2017-07-05 16:45:30 EDT
There are several nova instances that are logging > 100% cpu_util as shown by ceilometer.  

See comment #31 and the attachment for an example of cpu_util history for a single instance.  
vcpu's are pinned and are on the same numa node for instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83: 

grep "vcpu placement='static'" /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/virsh/virsh_-r_dumpxml_instance-00000175   -A 16

  <vcpu placement='static'>12</vcpu>
  <cputune>
    <shares>12288</shares>
    <vcpupin vcpu='0' cpuset='12'/>
    <vcpupin vcpu='1' cpuset='13'/>
    <vcpupin vcpu='2' cpuset='14'/>
    <vcpupin vcpu='3' cpuset='15'/>
    <vcpupin vcpu='4' cpuset='16'/>
    <vcpupin vcpu='5' cpuset='17'/>
    <vcpupin vcpu='6' cpuset='18'/>
    <vcpupin vcpu='7' cpuset='19'/>
    <vcpupin vcpu='8' cpuset='20'/>
    <vcpupin vcpu='9' cpuset='21'/>
    <vcpupin vcpu='10' cpuset='22'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='12-23'/>
  </cputune>

cat /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/processor/lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2500.000
BogoMIPS:              5007.93
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Ceilometer also shows 12 vcpus: 

grep  "'vcpus':" /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/var/log/ceilometer/compute.log  | head -1 | grep vcpus

(partial result shown)
'ram': 131072, 'ephemeral': 0, 'vcpus': 12, 'disk': 2, u'id': u'87dad299-861c-42cd-abac-64cdf6eb6c70'}, u'id': u'9831d763-43a5-4d2c-9c0d-4d34bcf32b83'

See comment #29 for the top 10 highest cpu_util and the 3 instances that logged them.  
Information in file instance-9831d763-43a5-4d2c-9c0d-4d34bcf32b83_cpu_history.txt 

See comment #26 and the attachment that demonstrate how ceilometer data is calculated from raw cpu time collected from libvirt
Ceilometer_Calculations.xlsx 

Assumptions:

1) the scale factor (see comment #3) divides by the number of cpu's so cpu_util should never be > 100%

Observations:

1) Whenever cpu_util is over 100% then the next several samples are 0.0.  How many following samples are 0.0 seems to depend on how much over 100% it is. See comment #31 
2) Instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83 has 12 vcpus according to libvirt and that is what ceilometer reports also.  Comment #32
3) Instance 9831d763-43a5-4d2c-9c0d-4d34bcf32b83 is the only nova instance on the compute host it is running on. See /cases/01868571/sosreport-20170623-183614/nwblwiczvmpn2-c-pr-cmp-31.localdomain/sos_commands/virsh/virsh_list_--all 
4) vcpus are pinned to cpuset 12 through 23 and are on the same numa node.  See comment #32.  
5) If you dig down deep enough with the grep in comment #26 you will find quite a few more instances logging > 100% cpu_util, just not radically so much as the three shown.  

Conclusion:

It looks to me like ceilometer is correctly reporting the data it is getting from the hypervisor.
Comment 4 Mehdi ABAAKOUK 2017-07-17 03:50:55 EDT
Siggy Sigwald have well described the issue, Ceilometer can't really do anything, the 100% cpu is reported by libvirt itself.
Comment 5 Mehdi ABAAKOUK 2017-07-17 03:56:37 EDT
I propose the backport upstream that cap the cpu_util to 100%
Comment 7 Mehdi ABAAKOUK 2017-07-17 15:04:24 EDT
On the Ceilomete side, cpu_util is computed like this:

ceilometer polls libvirt every 10 minutes and retrieve the cpu_time in nanoseconds. Each cpu_time sample have the polled timestamp, because it's a timestamp it depends on the host clock.

cpu_util is not the true/real cpu utilization in percent, but the rate of change of the cpu_time scaled with the number of vcpu of the instance. Rate of change computed with the diff of timestamps can't be nanoseconds precision like the cpu_time is. That's why the cpu_util is not precise. So, capping it to 100% is ok.


On libvirt side:

Since OSP12, Ceilometer comsumes vcpu.<vcpu_number>.time and vcpu.<vcpu_number>.wait instead of cpu_time (and fallback to cpu_time is libvirt is too old) to build a more accurate cpu_time sample (and then a more accurate cpu_util). Using this new libvirt statistics avoid the 100% pike and then 0 most of the time. But I don't known if the rhel7 libvirt provides them.

If it provides them, I can take a look if it's possible to backport https://review.openstack.org/#/c/451297/

The support can be checked with "virsh nodecpustats <domain_id_or_name>", can you check that ?
Comment 8 Mehdi ABAAKOUK 2017-07-18 02:55:33 EDT
I have proposed upstream a solution to not use timestamp and get a more accurate rate of change: https://review.openstack.org/#/c/484640/
Comment 10 Sasha Smolyak 2017-08-21 08:32:44 EDT
Looks like it's fixed
[stack@undercloud-0 ~]$ gnocchi measures show f91a31d2-0299-4553-bf32-9c9ccd7b4615
+---------------------------+-------------+----------------+
| timestamp                 | granularity |          value |
+---------------------------+-------------+----------------+
| 2017-08-21T11:10:00+00:00 |       300.0 | 0.270526577729 |
| 2017-08-21T11:20:00+00:00 |       300.0 | 0.256668196409 |
| 2017-08-21T11:30:00+00:00 |       300.0 | 0.259932436628 |
| 2017-08-21T11:40:00+00:00 |       300.0 |   11.109217152 |
| 2017-08-21T11:50:00+00:00 |       300.0 |  88.5498293714 |
| 2017-08-21T12:00:00+00:00 |       300.0 |          100.0 |
| 2017-08-21T12:10:00+00:00 |       300.0 |  99.9955968962 |
| 2017-08-21T12:20:00+00:00 |       300.0 |  99.9995608443 |
| 2017-08-21T12:30:00+00:00 |       300.0 |  99.3340708628 |
+---------------------------+-------------+----------------+
Comment 12 errata-xmlrpc 2017-09-06 13:04:34 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2651

Note You need to log in before you can comment on or make changes to this bug.