1520694 – Unable to calculate rates correctly when sample is handled by another controller

Bug 1520694 - Unable to calculate rates correctly when sample is handled by another controller

Summary: Unable to calculate rates correctly when sample is handled by another controller

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Upstream M1
Target Release:	14.0 (Rocky)
Assignee:	Mehdi ABAAKOUK
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1525977 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-05 00:29 UTC by David Vallee Delisle
Modified:	2021-12-10 15:44 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-ceilometer-10.0.1-0.20180530162349.1c02e4b.el7ost
Doc Type:	No Doc Update
Doc Text:	-
Clone Of:
Environment:
Last Closed:	2019-01-11 11:48:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	530885	'None'	MERGED	gnocchi: configure archive policies on Ceilo side	2020-09-24 12:32:27 UTC
Red Hat Issue Tracker	OSP-8676	None	None	None	2021-12-10 15:44:13 UTC
Red Hat Product Errata	RHEA-2019:0045	None	None	None	2019-01-11 11:48:50 UTC

Description David Vallee Delisle 2017-12-05 00:29:52 UTC

Description of problem:
It looks like agent-notification is calculating the rate transformation on any controllers and it often has no "previous" value stored in its cache. That means that the rates are not really calculated. 

Version-Release number of selected component (if applicable):
openstack-ceilometer-notification-7.1.0-4

How reproducible:
All the time

Actual results:

agent-notification.log-20171129:2017-11-28 15:22:55.626 273838 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:22:55.642 273838 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.          allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:22:55.663 273838 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.            allocation, volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:22:54.105361> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

agent-notification.log-20171129:2017-11-28 15:32:55.283 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:32:55.301 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:32:55.318 82546 WARNING ceilometer.transformer.conversions [-] dropping sample with no predecessor:        (<name: disk.device.allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:32:54.093339>,)

agent-notification.log-20171129:2017-11-28 15:42:55.200 82546 DEBUG ceilometer.pipeline [-] Pipeline disk_sink: Transform sample <name: disk.device.     allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> from 0 transformer _publish_samples /usr/lib/python2.7/site-packages/ceilometer/         pipeline.py:486

agent-notification.log-20171129:2017-11-28 15:42:55.216 82546 DEBUG ceilometer.transformer.conversions [-] handling sample <name: disk.device.           allocation, volume: 643072, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:186

agent-notification.log-20171129:2017-11-28 15:42:55.232 82546 DEBUG ceilometer.transformer.conversions [-] converted to: <name: disk.device.allocation,  volume: 0.0, resource_id: 033f8e35-64b1-4cb3-8dc7-43803c2ce894-hda, timestamp: 2017-11-28T15:42:54.125560> handle_sample /usr/lib/python2.7/site-packages/ceilometer/transformer/conversions.py:220

Expected results:

Agent Notification should not be dropping samples. It should be able to get the previous value, or at least have access to that information. It looks like it's taking this value from the cache.

Additional info:

Some samples are not plotted in gnocchi.

Comment 6 Mehdi ABAAKOUK 2017-12-07 15:12:18 UTC

We have two ways to do that:

- The current way to do it, on Ceilometer side, by setting workload_partitioning=True

This creates many new queues on rabbitmq to be able to ensure that all "cpu" samples are routed to the same ceilometer-agent-notification worker.

But this increases the cpu usage of ceilometer-agent-notification, the load on rabbitmq, and adds lag to the processing.

Also that's solution is not perfect because samples can still comes unordered. So if the received sample is older that the previous kept one, it will be dropped. This computation of the rate of change will be good, but some points will miss like when workload_partitioning=False.

This feature does not have comprehensive testing and I have reviewed many fixes upstream that are not backported in stable versions. It decreases a the performance of Ceilometer.

- A better way to do it, on Gnocchi side:

Create a special archive policy for all rated metrics (cpu_util, network.*rate, disk.*rate, ...), that computes the "rate:last" aggregation.

Better calculation, Gnocchi keep all needed points to compute that correctly.
No more missing point for "rate of change" computation.

But it requires Gnocchi 4.X, so that can't be used before OSP12. And the archive policy need to be create manually.

Comment 15 Mehdi ABAAKOUK 2018-01-02 08:53:59 UTC

vcpus, disk.ephemeral.size, disk.root.size are sent by nova every hour, so that normal you didn't see them every 10 minutes.

Others are the rate metrics issue I'm talking about comment 6 and 9.

Comment 16 Mehdi ABAAKOUK 2018-01-04 14:39:26 UTC

*** Bug 1525977 has been marked as a duplicate of this bug. ***

Comment 22 Sasha Smolyak 2018-11-05 12:09:13 UTC

Verified, automated

Comment 25 errata-xmlrpc 2019-01-11 11:48:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.