1374496 – Gnocchi-metricd high cpu usage with 200 instances (60s polling interval)

Bug 1374496 - Gnocchi-metricd high cpu usage with 200 instances (60s polling interval)

Summary: Gnocchi-metricd high cpu usage with 200 instances (60s polling interval)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-gnocchi
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Julien Danjou
QA Contact:	Yurii Prokulevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-08 20:35 UTC by Alex Krzos
Modified:	2016-12-14 15:59 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-gnocchi-3.0.1-0.20160923180636.c6b2c51.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 15:59:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Graphs showing CPU/Memory/Disk/Gnocchi-metricd CPU usage when booting instances & gnocchi status during run (697.97 KB, application/x-gzip) 2016-09-12 19:42 UTC, Alex Krzos	no flags	Details
Gnocchi-with-Swift-2016-10.18.2 (1.60 MB, application/x-gzip) 2016-10-24 10:12 UTC, Alex Krzos	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:2948	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Alex Krzos 2016-09-08 20:35:12 UTC

Description of problem:
Controller CPU and Disk resources are highly utilized with only 200 tiny instances running with ceilometer polling configured to 60s.  

In comparison to having ceilometer store data in mongodb, there is much more system resource usage at the same scale of instances.

I deployed 3 controllers with 12 computes, reconfigured Gnocchi to run correctly (redis auth, started gnocchi services, and gnocchi-upgrade), then ran a browbeat config to spawn 200 tiny cirros instances.

Version-Release number of selected component (if applicable):
OSP10 deployed by OSPd 10
Build 2016-08-29.1

How reproducible:
I have seen similar resource usage with OSP9 clouds configured for Gnocchi as well.

Steps to Reproduce:
1.
2.
3.

Actual results:
Viewing individual system processes I can see Gnocchi-metricd is using ~50% of all available CPU.  I would expect for this size of cloud to handle much more than 200 tiny instances before saturating cpu/disk resources.  In comparison to mongodb as a backend I see far less cpu and disk used for the same polling interval and instance count.

Expected results:


Additional info:

Comment 2 Julien Danjou 2016-09-09 08:32:45 UTC

Thanks for the resport Alex. I have a few questions.

- How many metricd instances do you have running, on how many computers?
- Can you run "gnocchi status" a few times with a few 10s seconds of interval to see how the backlog is behaving?

We made great improvement on job dispatching also in 2.2, so it would be nice to be able to test it at some point maybe.

Comment 3 Julien Danjou 2016-09-09 08:42:54 UTC

Also can you give a try increasing metric_processing_delay to 60 ?

Comment 4 Alex Krzos 2016-09-12 19:05:28 UTC

(In reply to Julien Danjou from comment #2)
> Thanks for the resport Alex. I have a few questions.
> 
> - How many metricd instances do you have running, on how many computers?

On each controller I have:
24 metricd processing
1  metricd scheduler
1  metricd reporting
1  metricd janitor

24 httpd processes hosting gnocchi-api

1 gnocchi-statsd process

So across the cloud I have a total of:
72 metricd processing processes
3 metricd schedulers
3 metricd reporting processes
3 metricd janitor processes

72 httpd processes hosting gnocchi-api

3 gnocchi-statsd processes

> - Can you run "gnocchi status" a few times with a few 10s seconds of
> interval to see how the backlog is behaving?
> 

This is after 200 instances are booted:  (I will attach the "dump" of this for the time frame of 0 instances to 200 instances)

Sat Sep 10 05:04:28 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 700   |
| storage/total number of measures to process         | 700   |
+-----------------------------------------------------+-------+
Sat Sep 10 05:04:41 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 1633  |
| storage/total number of measures to process         | 5909  |
+-----------------------------------------------------+-------+
Sat Sep 10 05:04:57 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 622   |
| storage/total number of measures to process         | 14378 |
+-----------------------------------------------------+-------+
Sat Sep 10 05:05:11 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 1211  |
| storage/total number of measures to process         | 1208  |
+-----------------------------------------------------+-------+
Sat Sep 10 05:05:24 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 2034  |
| storage/total number of measures to process         | 6601  |
+-----------------------------------------------------+-------+
Sat Sep 10 05:05:39 UTC 2016
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 989   |
| storage/total number of measures to process         | 1067  |
+-----------------------------------------------------+-------+


> We made great improvement on job dispatching also in 2.2, so it would be
> nice to be able to test it at some point maybe.

Comment 5 Alex Krzos 2016-09-12 19:40:53 UTC

(In reply to Julien Danjou from comment #3)
> Also can you give a try increasing metric_processing_delay to 60 ?

I tuned this across all three controllers and I see less resource usage after restarting metricd.

CPU usage across the three controllers has fallen from a steady ~90% to spikes that use ~70% to 15% but averages to 50-40% over longer time spans.

Disk IOPs per 10s interval have dropped from 1.6K-1.16K to 60-700 across each controller.

Attached are some usage graphs with the cpu graph annotated to show when instances were booted and when adjusting metric_processing_delay was performed.

Comment 6 Alex Krzos 2016-09-12 19:42:16 UTC

Created attachment 1200273 [details]
Graphs showing CPU/Memory/Disk/Gnocchi-metricd CPU usage when booting instances & gnocchi status during run

Comment 7 Julien Danjou 2016-09-13 07:56:29 UTC

Ok, so as far as I can tell, this is totally normal. The more aggressive you are on processing new points (to achieve some kind of real time processing) the more CPU is going to be used since almost no batching is going to happen.

Batching would happen more naturally if you have even more measures coming in.

My suggestion would be to change the default upstream to something higher (e.g. 30s), but that's mainly what I'd say.

Just 2 more questions:
- Are you using 2.2 (I read OSP 10 so I imagine yes) ?
- Is your coordinator set to redis? (it should be)

Comment 8 Alex Krzos 2016-09-13 15:13:18 UTC

(In reply to Julien Danjou from comment #7)
> Ok, so as far as I can tell, this is totally normal. The more aggressive you
> are on processing new points (to achieve some kind of real time processing)
> the more CPU is going to be used since almost no batching is going to happen.
> 
> Batching would happen more naturally if you have even more measures coming
> in.
> 
> My suggestion would be to change the default upstream to something higher
> (e.g. 30s), but that's mainly what I'd say.
> 
> Just 2 more questions:
> - Are you using 2.2 (I read OSP 10 so I imagine yes) ?

Gnocchi 2.2

[root@overcloud-controller-0 ~]# rpm -qa | grep gnocchi
openstack-gnocchi-common-2.2.1-0.20160821103314.7a40269.el7ost.noarch
openstack-gnocchi-statsd-2.2.1-0.20160821103314.7a40269.el7ost.noarch
python-gnocchiclient-2.5.1-0.20160819083701.b816e3c.el7ost.noarch
puppet-gnocchi-9.1.0-0.20160815142605.7aa0c60.el7ost.noarch
openstack-gnocchi-carbonara-2.2.1-0.20160821103314.7a40269.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-2.2.1-0.20160821103314.7a40269.el7ost.noarch
openstack-gnocchi-metricd-2.2.1-0.20160821103314.7a40269.el7ost.noarch
python-gnocchi-2.2.1-0.20160821103314.7a40269.el7ost.noarch
openstack-gnocchi-api-2.2.1-0.20160821103314.7a40269.el7ost.noarch


> - Is your coordinator set to redis? (it should be)

Yes the coordination driver is using the redis backend.

Comment 9 Julien Danjou 2016-09-13 15:39:24 UTC

Ok so everything seems pretty normal to me, except that yeah maybe metricd is being a bit too slow. We worked on that with in master already, especially with the Ceph backend.

The CPU usage is obviously going to be higher than just MongoDB as the data are being processing and aggregated by metricd whereas MongoDB it was done as query time. So metricd does the work, but the retrieval time is going to be < 1s in all time whereas MongoDB would have been > 20 min per query. ;-)

Is there anything else I can see/check with you or are we good with that?

If you want to try our master branch, that would be interesting too, especially for us, but if I can understand if it's too far a stretch. :)

Comment 10 Julien Danjou 2016-09-13 16:19:56 UTC

I just realized that you're running more than 2.2, it's master from end of August. So that's cool.

But could you upgrade to the latest version of Gnocchi from master? That should leverage the new storage engine and having feedback! The new puddle should be available tomorrow with a more recent version.

Comment 11 Julien Danjou 2016-09-16 14:06:22 UTC

I sent https://review.openstack.org/#/c/371561/ upstream to increase the default delays, so the CPU usage is less aggressive. That will be included in 3.0.

Alex, to reply to your question the version 0.20160821103314.7a40269 is:

https://github.com/openstack/gnocchi/commit/7a40269

So it's basically between 2.2 and 3.0. :-)

Comment 12 Julien Danjou 2016-10-04 14:37:11 UTC

The higher polling default values are now default in Gnocchi 3.0.

Comment 14 Julien Danjou 2016-10-20 12:58:23 UTC

Alex, we don't have enough hardware to test at any scale like you did so we can't validate this as being really fixed. But you should see much improvement with OSP 10.

Can you check that and confirm so we can move that bug to VERIFIED ?

Comment 15 Alex Krzos 2016-10-20 13:02:08 UTC

(In reply to Julien Danjou from comment #14)
> Alex, we don't have enough hardware to test at any scale like you did so we
> can't validate this as being really fixed. But you should see much
> improvement with OSP 10.
> 
> Can you check that and confirm so we can move that bug to VERIFIED ?

Certainly,  I'll re-test with the latest available bits today.

-Alex

Comment 16 Alex Krzos 2016-10-24 10:11:36 UTC

I have re-ran this same scenario on hardware with the same specs as previous with the latest passed_phase1 build that I could use that was available at the time of building the undercloud (2016-10-18.2)

However several major performance impacting options have changed such as the Gnocchi version and the backend storage for Gnocchi is now Swift vs File.

During the 200 instances booting + 1 hour extra quiesce time:
The CPU is less utilized at the conclusion of 200 instance booting, however the disks are at >50% utilized after all 200 instances are booted.  The version of Gnocchi includes the longer default delay times for metric_processing, metric_reporting, and metric_cleanup (now 60s, 120s, 300s respectively).  There is far less Gnocchi CPU usage as shown in attached graphs during this timeframe.  At this point we should see what the cloud system resource utilization looks like when more Swift tunings are applied and land in our downstream product, see this bug for further details (https://bugzilla.redhat.com/show_bug.cgi?id=1381721)

200 Instances + several days of quiesce time (The bad part):
Allowing the instances to continue to run, we can see a growth in memory that does not appear to stop, and extending the test time out (3 days), we can see Ceilometer-collector consumes all memory eventually resulting in the OOM killing processes since there is no swap and the collector will continue this behavior.  This is a documented in this upstream bug (https://bugs.launchpad.net/ceilometer/+bug/1551667) It is unclear to me at this time if the patches for this bug have landed downstream yet.

Gnocchi Versions:
openstack-gnocchi-carbonara-3.0.1-0.20161006151137.a852145.el7ost.noarch
python-gnocchiclient-2.6.0-1.el7ost.noarch
openstack-gnocchi-common-3.0.1-0.20161006151137.a852145.el7ost.noarch
openstack-gnocchi-metricd-3.0.1-0.20161006151137.a852145.el7ost.noarch
python-gnocchi-3.0.1-0.20161006151137.a852145.el7ost.noarch
openstack-gnocchi-api-3.0.1-0.20161006151137.a852145.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.0.1-0.20161006151137.a852145.el7ost.noarch
openstack-gnocchi-statsd-3.0.1-0.20161006151137.a852145.el7ost.noarch
puppet-gnocchi-9.4.0-1.el7ost.noarch

I am unsure if we should close this bug on the basis that Gnocchi using far less CPU than before, however this can just simply be a product of the major backend storage driver change (Swift vs File) vs the newer defaults and latest code.  Additionally, the cloud will eventually fail with Ceilometer-Collector consuming all memory eventually anyway.

Comment 17 Alex Krzos 2016-10-24 10:12:43 UTC

Created attachment 1213432 [details]
Gnocchi-with-Swift-2016-10.18.2

Comment 18 Julien Danjou 2016-10-24 12:42:49 UTC

Hi Alex,

Thanks a lot for your report, that is so helpful. So with my developer hat, I can vouch  that most of the improvement you saw for CPU usage are due to changes we did upstream and high delays between runs. The Swift issues as you noticed are tracked in another report.

We also track the Ceilometer issue about memory usage here: https://bugzilla.redhat.com/show_bug.cgi?id=1336664

So this since particular issue is being fixed in this bug, I suggest we follow up on the 2 others issue we have.

Comment 20 errata-xmlrpc 2016-12-14 15:59:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.