Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1432909 - Ceilometer Collector is a bottleneck for large scale clouds with Telemetry
Ceilometer Collector is a bottleneck for large scale clouds with Telemetry
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
medium Severity medium
: rc
: 12.0 (Pike)
Assigned To: Julien Danjou
Sasha Smolyak
scale_lab
: Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-03-16 07:40 EDT by Alex Krzos
Modified: 2018-02-05 14:04 EST (History)
5 users (show)

See Also:
Fixed In Version: openstack-ceilometer-9.0.2-0.20170925173740.1057885.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-13 16:17:04 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:3462 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-15 20:43:25 EST

  None (edit)
Description Alex Krzos 2017-03-16 07:40:34 EDT
Description of problem:
Recent Scale lab testing has found that Ceilometer-Collector was the bottleneck in taking messages out of the message bus and posting them to Gnocchi API such that they can then be processed by Gnocchi Metricd daemons.  

For the hardware tested, going above 2500 started to show sympotms of ceilometer-collector lagging in processing messages off the queue (Typically unbounded memory growth if prefetch is set to 0 (unlimited) or a Gnocchi Backlog that can never "reach" 0 backlogged work, though metricd can handle the capacity given to it.)

Version-Release number of selected component (if applicable):
Newton GA (OSP10)

How reproducible:
Always with enough hardware to host instances

Steps to Reproduce:
1. Deploy Cloud with Telemetry Services
2. Deploy many instances in the cloud
3.

Actual results:
Going above 2,500 instances can show "lag" in processed data although Metricd is handling the capacity. 

Expected results:
To scale above 2,500 instances

Additional info:
Ceilometer-collector is removed in OSP Pike (OSP12)  We need to test agent-notification to see if this bottleneck is moved further back or if a new bottleneck exists.

Perhaps a combination of higher ceilometer-collector workers + rabbit_qos_prefetch_count + executor_thread_pool_size can squeeze more scale out of the setup though time ran short on attempting to tune options.
Comment 1 Julien Danjou 2017-09-14 09:22:34 EDT
The collector has been deprecated in OSP12 and is not installed anymore.
Comment 5 Julien Danjou 2017-11-15 10:15:10 EST
Ceilometer collector is no more deployed and installed in OSP12. The bottleneck is gone.
Comment 8 errata-xmlrpc 2017-12-13 16:17:04 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.