Bug 1444541

Summary:	Slow Performance with Gnocchi Posting new Measures via Ceilometer Notification Agent into Ceph Storage
Product:	Red Hat OpenStack	Reporter:	Alex Krzos <akrzos>
Component:	openstack-gnocchi	Assignee:	Julien Danjou <jdanjou>
Status:	CLOSED ERRATA	QA Contact:	Sasha Smolyak <ssmolyak>
Severity:	high	Docs Contact:
Priority:	high
Version:	11.0 (Ocata)	CC:	apevec, ggillies, jdanjou, jschluet, lhh, nchandek, pkilambi, tvignaud
Target Milestone:	Upstream M2	Keywords:	Triaged
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	scale_lab
Fixed In Version:	openstack-gnocchi-3.1.6-1.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-13 21:23:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Krzos 2017-04-22 03:44:52 UTC

Description of problem:
Scale testing with Ceilometer Agent-Notification publishing directly to Gnocchi has revealing that spawning threads to create measure objects and adding a key to the omap object "measure" for the Gnocchi backlog causes performance issues. The current code for Gnocchi 3.1 creates N number of Threads (Based on core count) [0] to create the ceph backlog measures but all write a key to a single object. When measuring Gnocchi API Request times (Apache Log %D) we can see POST taking >1min(And much greater at large scale and eventually greater than the httpd timeout (120s). Http gateway timeout Error messages in Ceilometer Notification Agent logs are a good indicator that this problem is occurring.

By batching all new measure objects and the adding of the key to the omap object we reduce the request time from >1min-30s or greater to ~2-5s on the same hardware. Example patch [1] reduces overall time that is required to move data with Ceilometer Notification-Agent into Gnocchi Ceph Storage for processing.

Version-Release number of selected component (if applicable):
Ocata Beta (OSP11)
Build 2017-04-06.4

openstack-gnocchi-api-3.1.2-3.el7ost.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.2-3.el7ost.noarch
python-gnocchiclient-3.1.0-1.el7ost.noarch
openstack-gnocchi-common-3.1.2-3.el7ost.noarch
openstack-gnocchi-metricd-3.1.2-3.el7ost.noarch
puppet-gnocchi-10.3.0-2.el7ost.noarch
python-gnocchi-3.1.2-3.el7ost.noarch
openstack-gnocchi-statsd-3.1.2-3.el7ost.noarch

How reproducible:
With a large enough scale you will see this issue.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1430588

[0] https://github.com/openstack/gnocchi/blob/stable/3.1/gnocchi/rest/__init__.py#L1460-L1463
[1] https://gist.github.com/akrzos/9d841feff51050c12913faf39634383a#file-gistfile1-txt-L1427-L1440

Comment 1 Julien Danjou 2017-06-22 18:33:48 UTC

This has been ever been backported to OSP 10, so marking as done.

Comment 4 Julien Danjou 2017-11-15 14:50:16 UTC

This is a performance issue that is not verifiable by standard QE process. What Alex described has been solved in Gnocchi 4 because the way the data are written are now batched. The patch has already been backported to OSP 10 and OSP 11 and shipped to customers.

Comment 8 Julien Danjou 2017-12-04 12:49:23 UTC

This has been merged in Gnocchi 3.1.5. I realize that the Fixed in Version is wrong here.

Comment 16 errata-xmlrpc 2017-12-13 21:23:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462