Bug 1520694 - Unable to calculate rates correctly when sample is handled by another controller
Summary: Unable to calculate rates correctly when sample is handled by another controller
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Upstream M1
: 14.0 (Rocky)
Assignee: Mehdi ABAAKOUK
QA Contact: Sasha Smolyak
URL:
Whiteboard:
: 1525977 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-05 00:29 UTC by David Vallee Delisle
Modified: 2021-12-10 15:44 UTC (History)
9 users (show)

Fixed In Version: openstack-ceilometer-10.0.1-0.20180530162349.1c02e4b.el7ost
Doc Type: No Doc Update
Doc Text:
-
Clone Of:
Environment:
Last Closed: 2019-01-11 11:48:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 530885 0 'None' MERGED gnocchi: configure archive policies on Ceilo side 2020-09-24 12:32:27 UTC
Red Hat Issue Tracker OSP-8676 0 None None None 2021-12-10 15:44:13 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:48:50 UTC

Description David Vallee Delisle 2017-12-05 00:29:52 UTC
Description of problem:
It looks like agent-notification is calculating the rate transformation on any controllers and it often has no "previous" value stored in its cache. That means that the rates are not really calculated. 

Version-Release number of selected component (if applicable):
openstack-ceilometer-notification-7.1.0-4

How reproducible:
All the time

Actual results:

agent-notification.log-20171129:2017-11-28 15:22:55.626 273838 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:22:55.642 273838 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.          allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:22:55.663 273838 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.            allocation, volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

agent-notification.log-20171129:2017-11-28 15:32:55.283 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:32:55.301 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:32:55.318 82546 WARNING ceilometer.transformer.conversions [-] dropping sample with no predecessor:        (<name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339>,)

agent-notification.log-20171129:2017-11-28 15:42:55.200 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:42:55.216 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:42:55.232 82546 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.allocation,  volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

Expected results:

Agent Notification should not be dropping samples. It should be able to get the previous value, or at least have access to that information. It looks like it's taking this value from the cache.

Additional info:

Some samples are not plotted in gnocchi.

Comment 6 Mehdi ABAAKOUK 2017-12-07 15:12:18 UTC
We have two ways to do that:

- The current way to do it, on Ceilometer side, by setting workload_partitioning=True

This creates many new queues on rabbitmq to be able to ensure that all "cpu" samples are routed to the same ceilometer-agent-notification worker.

But this increases the cpu usage of ceilometer-agent-notification, the load on rabbitmq, and adds lag to the processing.

Also that's solution is not perfect because samples can still comes unordered. So if the received sample is older that the previous kept one, it will be dropped. This computation of the rate of change will be good, but some points will miss like when workload_partitioning=False.

This feature does not have comprehensive testing and I have reviewed many fixes upstream that are not backported in stable versions. It decreases a the performance of Ceilometer.

- A better way to do it, on Gnocchi side:

Create a special archive policy for all rated metrics (cpu_util, network.*rate, disk.*rate, ...), that computes the "rate:last" aggregation.

Better calculation, Gnocchi keep all needed points to compute that correctly.
No more missing point for "rate of change" computation.

But it requires Gnocchi 4.X, so that can't be used before OSP12. And the archive policy need to be create manually.

Comment 15 Mehdi ABAAKOUK 2018-01-02 08:53:59 UTC
vcpus, disk.ephemeral.size, disk.root.size are sent by nova every hour, so that normal you didn't see them every 10 minutes.

Others are the rate metrics issue I'm talking about comment 6 and 9.

Comment 16 Mehdi ABAAKOUK 2018-01-04 14:39:26 UTC
*** Bug 1525977 has been marked as a duplicate of this bug. ***

Comment 22 Sasha Smolyak 2018-11-05 12:09:13 UTC
Verified, automated

Comment 25 errata-xmlrpc 2019-01-11 11:48:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.