Bug 1520694

Summary: Unable to calculate rates correctly when sample is handled by another controller
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: openstack-ceilometerAssignee: Mehdi ABAAKOUK <mabaakou>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: djuran, jdanjou, jruzicka, mabaakou, marjones, pkundal, rlondhe, sacpatil, srevivo
Target Milestone: Upstream M1Keywords: FutureFeature, Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ceilometer-10.0.1-0.20180530162349.1c02e4b.el7ost Doc Type: No Doc Update
Doc Text:
-
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:48:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vallee Delisle 2017-12-05 00:29:52 UTC
Description of problem:
It looks like agent-notification is calculating the rate transformation on any controllers and it often has no "previous" value stored in its cache. That means that the rates are not really calculated. 

Version-Release number of selected component (if applicable):
openstack-ceilometer-notification-7.1.0-4

How reproducible:
All the time

Actual results:

agent-notification.log-20171129:2017-11-28 15:22:55.626 273838 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:22:55.642 273838 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.          allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:22:55.663 273838 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.            allocation, volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

agent-notification.log-20171129:2017-11-28 15:32:55.283 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:32:55.301 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:32:55.318 82546 WARNING ceilometer.transformer.conversions [-] dropping sample with no predecessor:        (<name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339>,)

agent-notification.log-20171129:2017-11-28 15:42:55.200 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:42:55.216 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:42:55.232 82546 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.allocation,  volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

Expected results:

Agent Notification should not be dropping samples. It should be able to get the previous value, or at least have access to that information. It looks like it's taking this value from the cache.

Additional info:

Some samples are not plotted in gnocchi.

Comment 6 Mehdi ABAAKOUK 2017-12-07 15:12:18 UTC
We have two ways to do that:

- The current way to do it, on Ceilometer side, by setting workload_partitioning=True

This creates many new queues on rabbitmq to be able to ensure that all "cpu" samples are routed to the same ceilometer-agent-notification worker.

But this increases the cpu usage of ceilometer-agent-notification, the load on rabbitmq, and adds lag to the processing.

Also that's solution is not perfect because samples can still comes unordered. So if the received sample is older that the previous kept one, it will be dropped. This computation of the rate of change will be good, but some points will miss like when workload_partitioning=False.

This feature does not have comprehensive testing and I have reviewed many fixes upstream that are not backported in stable versions. It decreases a the performance of Ceilometer.

- A better way to do it, on Gnocchi side:

Create a special archive policy for all rated metrics (cpu_util, network.*rate, disk.*rate, ...), that computes the "rate:last" aggregation.

Better calculation, Gnocchi keep all needed points to compute that correctly.
No more missing point for "rate of change" computation.

But it requires Gnocchi 4.X, so that can't be used before OSP12. And the archive policy need to be create manually.

Comment 15 Mehdi ABAAKOUK 2018-01-02 08:53:59 UTC
vcpus, disk.ephemeral.size, disk.root.size are sent by nova every hour, so that normal you didn't see them every 10 minutes.

Others are the rate metrics issue I'm talking about comment 6 and 9.

Comment 16 Mehdi ABAAKOUK 2018-01-04 14:39:26 UTC
*** Bug 1525977 has been marked as a duplicate of this bug. ***

Comment 22 Sasha Smolyak 2018-11-05 12:09:13 UTC
Verified, automated

Comment 25 errata-xmlrpc 2019-01-11 11:48:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045