Description of problem: Controller CPU and Disk resources are highly utilized with only 200 tiny instances running with ceilometer polling configured to 60s. In comparison to having ceilometer store data in mongodb, there is much more system resource usage at the same scale of instances. I deployed 3 controllers with 12 computes, reconfigured Gnocchi to run correctly (redis auth, started gnocchi services, and gnocchi-upgrade), then ran a browbeat config to spawn 200 tiny cirros instances. Version-Release number of selected component (if applicable): OSP10 deployed by OSPd 10 Build 2016-08-29.1 How reproducible: I have seen similar resource usage with OSP9 clouds configured for Gnocchi as well. Steps to Reproduce: 1. 2. 3. Actual results: Viewing individual system processes I can see Gnocchi-metricd is using ~50% of all available CPU. I would expect for this size of cloud to handle much more than 200 tiny instances before saturating cpu/disk resources. In comparison to mongodb as a backend I see far less cpu and disk used for the same polling interval and instance count. Expected results: Additional info:
Thanks for the resport Alex. I have a few questions. - How many metricd instances do you have running, on how many computers? - Can you run "gnocchi status" a few times with a few 10s seconds of interval to see how the backlog is behaving? We made great improvement on job dispatching also in 2.2, so it would be nice to be able to test it at some point maybe.
Also can you give a try increasing metric_processing_delay to 60 ?
(In reply to Julien Danjou from comment #2) > Thanks for the resport Alex. I have a few questions. > > - How many metricd instances do you have running, on how many computers? On each controller I have: 24 metricd processing 1 metricd scheduler 1 metricd reporting 1 metricd janitor 24 httpd processes hosting gnocchi-api 1 gnocchi-statsd process So across the cloud I have a total of: 72 metricd processing processes 3 metricd schedulers 3 metricd reporting processes 3 metricd janitor processes 72 httpd processes hosting gnocchi-api 3 gnocchi-statsd processes > - Can you run "gnocchi status" a few times with a few 10s seconds of > interval to see how the backlog is behaving? > This is after 200 instances are booted: (I will attach the "dump" of this for the time frame of 0 instances to 200 instances) Sat Sep 10 05:04:28 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 700 | | storage/total number of measures to process | 700 | +-----------------------------------------------------+-------+ Sat Sep 10 05:04:41 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 1633 | | storage/total number of measures to process | 5909 | +-----------------------------------------------------+-------+ Sat Sep 10 05:04:57 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 622 | | storage/total number of measures to process | 14378 | +-----------------------------------------------------+-------+ Sat Sep 10 05:05:11 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 1211 | | storage/total number of measures to process | 1208 | +-----------------------------------------------------+-------+ Sat Sep 10 05:05:24 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 2034 | | storage/total number of measures to process | 6601 | +-----------------------------------------------------+-------+ Sat Sep 10 05:05:39 UTC 2016 +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 989 | | storage/total number of measures to process | 1067 | +-----------------------------------------------------+-------+ > We made great improvement on job dispatching also in 2.2, so it would be > nice to be able to test it at some point maybe.
(In reply to Julien Danjou from comment #3) > Also can you give a try increasing metric_processing_delay to 60 ? I tuned this across all three controllers and I see less resource usage after restarting metricd. CPU usage across the three controllers has fallen from a steady ~90% to spikes that use ~70% to 15% but averages to 50-40% over longer time spans. Disk IOPs per 10s interval have dropped from 1.6K-1.16K to 60-700 across each controller. Attached are some usage graphs with the cpu graph annotated to show when instances were booted and when adjusting metric_processing_delay was performed.
Created attachment 1200273 [details] Graphs showing CPU/Memory/Disk/Gnocchi-metricd CPU usage when booting instances & gnocchi status during run
Ok, so as far as I can tell, this is totally normal. The more aggressive you are on processing new points (to achieve some kind of real time processing) the more CPU is going to be used since almost no batching is going to happen. Batching would happen more naturally if you have even more measures coming in. My suggestion would be to change the default upstream to something higher (e.g. 30s), but that's mainly what I'd say. Just 2 more questions: - Are you using 2.2 (I read OSP 10 so I imagine yes) ? - Is your coordinator set to redis? (it should be)
(In reply to Julien Danjou from comment #7) > Ok, so as far as I can tell, this is totally normal. The more aggressive you > are on processing new points (to achieve some kind of real time processing) > the more CPU is going to be used since almost no batching is going to happen. > > Batching would happen more naturally if you have even more measures coming > in. > > My suggestion would be to change the default upstream to something higher > (e.g. 30s), but that's mainly what I'd say. > > Just 2 more questions: > - Are you using 2.2 (I read OSP 10 so I imagine yes) ? Gnocchi 2.2 [root@overcloud-controller-0 ~]# rpm -qa | grep gnocchi openstack-gnocchi-common-2.2.1-0.20160821103314.7a40269.el7ost.noarch openstack-gnocchi-statsd-2.2.1-0.20160821103314.7a40269.el7ost.noarch python-gnocchiclient-2.5.1-0.20160819083701.b816e3c.el7ost.noarch puppet-gnocchi-9.1.0-0.20160815142605.7aa0c60.el7ost.noarch openstack-gnocchi-carbonara-2.2.1-0.20160821103314.7a40269.el7ost.noarch openstack-gnocchi-indexer-sqlalchemy-2.2.1-0.20160821103314.7a40269.el7ost.noarch openstack-gnocchi-metricd-2.2.1-0.20160821103314.7a40269.el7ost.noarch python-gnocchi-2.2.1-0.20160821103314.7a40269.el7ost.noarch openstack-gnocchi-api-2.2.1-0.20160821103314.7a40269.el7ost.noarch > - Is your coordinator set to redis? (it should be) Yes the coordination driver is using the redis backend.
Ok so everything seems pretty normal to me, except that yeah maybe metricd is being a bit too slow. We worked on that with in master already, especially with the Ceph backend. The CPU usage is obviously going to be higher than just MongoDB as the data are being processing and aggregated by metricd whereas MongoDB it was done as query time. So metricd does the work, but the retrieval time is going to be < 1s in all time whereas MongoDB would have been > 20 min per query. ;-) Is there anything else I can see/check with you or are we good with that? If you want to try our master branch, that would be interesting too, especially for us, but if I can understand if it's too far a stretch. :)
I just realized that you're running more than 2.2, it's master from end of August. So that's cool. But could you upgrade to the latest version of Gnocchi from master? That should leverage the new storage engine and having feedback! The new puddle should be available tomorrow with a more recent version.
I sent https://review.openstack.org/#/c/371561/ upstream to increase the default delays, so the CPU usage is less aggressive. That will be included in 3.0. Alex, to reply to your question the version 0.20160821103314.7a40269 is: https://github.com/openstack/gnocchi/commit/7a40269 So it's basically between 2.2 and 3.0. :-)
The higher polling default values are now default in Gnocchi 3.0.
Alex, we don't have enough hardware to test at any scale like you did so we can't validate this as being really fixed. But you should see much improvement with OSP 10. Can you check that and confirm so we can move that bug to VERIFIED ?
(In reply to Julien Danjou from comment #14) > Alex, we don't have enough hardware to test at any scale like you did so we > can't validate this as being really fixed. But you should see much > improvement with OSP 10. > > Can you check that and confirm so we can move that bug to VERIFIED ? Certainly, I'll re-test with the latest available bits today. -Alex
I have re-ran this same scenario on hardware with the same specs as previous with the latest passed_phase1 build that I could use that was available at the time of building the undercloud (2016-10-18.2) However several major performance impacting options have changed such as the Gnocchi version and the backend storage for Gnocchi is now Swift vs File. During the 200 instances booting + 1 hour extra quiesce time: The CPU is less utilized at the conclusion of 200 instance booting, however the disks are at >50% utilized after all 200 instances are booted. The version of Gnocchi includes the longer default delay times for metric_processing, metric_reporting, and metric_cleanup (now 60s, 120s, 300s respectively). There is far less Gnocchi CPU usage as shown in attached graphs during this timeframe. At this point we should see what the cloud system resource utilization looks like when more Swift tunings are applied and land in our downstream product, see this bug for further details (https://bugzilla.redhat.com/show_bug.cgi?id=1381721) 200 Instances + several days of quiesce time (The bad part): Allowing the instances to continue to run, we can see a growth in memory that does not appear to stop, and extending the test time out (3 days), we can see Ceilometer-collector consumes all memory eventually resulting in the OOM killing processes since there is no swap and the collector will continue this behavior. This is a documented in this upstream bug (https://bugs.launchpad.net/ceilometer/+bug/1551667) It is unclear to me at this time if the patches for this bug have landed downstream yet. Gnocchi Versions: openstack-gnocchi-carbonara-3.0.1-0.20161006151137.a852145.el7ost.noarch python-gnocchiclient-2.6.0-1.el7ost.noarch openstack-gnocchi-common-3.0.1-0.20161006151137.a852145.el7ost.noarch openstack-gnocchi-metricd-3.0.1-0.20161006151137.a852145.el7ost.noarch python-gnocchi-3.0.1-0.20161006151137.a852145.el7ost.noarch openstack-gnocchi-api-3.0.1-0.20161006151137.a852145.el7ost.noarch openstack-gnocchi-indexer-sqlalchemy-3.0.1-0.20161006151137.a852145.el7ost.noarch openstack-gnocchi-statsd-3.0.1-0.20161006151137.a852145.el7ost.noarch puppet-gnocchi-9.4.0-1.el7ost.noarch I am unsure if we should close this bug on the basis that Gnocchi using far less CPU than before, however this can just simply be a product of the major backend storage driver change (Swift vs File) vs the newer defaults and latest code. Additionally, the cloud will eventually fail with Ceilometer-Collector consuming all memory eventually anyway.
Created attachment 1213432 [details] Gnocchi-with-Swift-2016-10.18.2
Hi Alex, Thanks a lot for your report, that is so helpful. So with my developer hat, I can vouch that most of the improvement you saw for CPU usage are due to changes we did upstream and high delays between runs. The Swift issues as you noticed are tracked in another report. We also track the Ceilometer issue about memory usage here: https://bugzilla.redhat.com/show_bug.cgi?id=1336664 So this since particular issue is being fixed in this bug, I suggest we follow up on the 2 others issue we have.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html