Bug 1444541

Summary: Slow Performance with Gnocchi Posting new Measures via Ceilometer Notification Agent into Ceph Storage
Product: Red Hat OpenStack Reporter: Alex Krzos <akrzos>
Component: openstack-gnocchiAssignee: Julien Danjou <jdanjou>
Status: CLOSED ERRATA QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: high    
Version: 11.0 (Ocata)CC: apevec, ggillies, jdanjou, jschluet, lhh, nchandek, pkilambi, tvignaud
Target Milestone: Upstream M2Keywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: openstack-gnocchi-3.1.6-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:23:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Krzos 2017-04-22 03:44:52 UTC
Description of problem:
Scale testing with Ceilometer Agent-Notification publishing directly to Gnocchi has revealing that spawning threads to create measure objects and adding a key to the omap object "measure" for the Gnocchi backlog causes performance issues.  The current code for Gnocchi 3.1 creates N number of Threads (Based on core count) [0] to create the ceph backlog measures but all write a key to a single object.  When measuring Gnocchi API Request times (Apache Log %D) we can see POST taking >1min(And much greater at large scale and eventually greater than the httpd timeout (120s). Http gateway timeout Error messages in Ceilometer Notification Agent logs are a good indicator that this problem is occurring.

By batching all new measure objects and the adding of the key to the omap object we reduce the request time from >1min-30s or greater to ~2-5s on the same hardware.  Example patch [1] reduces overall time that is required to move data with Ceilometer Notification-Agent into Gnocchi Ceph Storage for processing.



Version-Release number of selected component (if applicable):
Ocata Beta (OSP11)
Build 2017-04-06.4

openstack-gnocchi-api-3.1.2-3.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.2-3.el7ost.noarch
python-gnocchiclient-3.1.0-1.el7ost.noarch
openstack-gnocchi-common-3.1.2-3.el7ost.noarch
openstack-gnocchi-metricd-3.1.2-3.el7ost.noarch
puppet-gnocchi-10.3.0-2.el7ost.noarch
python-gnocchi-3.1.2-3.el7ost.noarch
openstack-gnocchi-statsd-3.1.2-3.el7ost.noarch


How reproducible:
With a large enough scale you will see this issue.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1430588

[0] https://github.com/openstack/gnocchi/blob/stable/3.1/gnocchi/rest/__init__.py#L1460-L1463
[1] https://gist.github.com/akrzos/9d841feff51050c12913faf39634383a#file-gistfile1-txt-L1427-L1440

Comment 1 Julien Danjou 2017-06-22 18:33:48 UTC
This has been ever been backported to OSP 10, so marking as done.

Comment 4 Julien Danjou 2017-11-15 14:50:16 UTC
This is a performance issue that is not verifiable by standard QE process. What Alex described has been solved in Gnocchi 4 because the way the data are written are now batched. The patch has already been backported to OSP 10 and OSP 11 and shipped to customers.

Comment 8 Julien Danjou 2017-12-04 12:49:23 UTC
This has been merged in Gnocchi 3.1.5. I realize that the Fixed in Version is wrong here.

Comment 16 errata-xmlrpc 2017-12-13 21:23:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462