1478555 – Prevent starved metrics from growing measures unbounded (Ceph and Swift)

Bug 1478555 - Prevent starved metrics from growing measures unbounded (Ceph and Swift)

Summary: Prevent starved metrics from growing measures unbounded (Ceph and Swift)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-gnocchi
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	z5
Target Release:	10.0 (Newton)
Assignee:	Julien Danjou
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1426552 1481327 (view as bug list)
Depends On:
Blocks:	1389374 1481327 1481328
TreeView+	depends on / blocked

Reported:	2017-08-04 19:18 UTC by Alex Krzos
Modified:	2021-03-11 15:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-gnocchi-3.0.14-1.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1481327 1481328 (view as bug list)
Environment:
Last Closed:	2017-09-28 16:37:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2824	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory	2017-09-28 20:34:18 UTC

Description Alex Krzos 2017-08-04 19:18:40 UTC

Description of problem:
It is possibly to exceed the capacity of a Tripleo deployed cloud's Telemetry service such that some metrics will never be processed and thus have measures grow unbounded. When this occurs there is several major failure conditions which will occur based on the speed of the unbounded growth. This bug's purpose is to support a change to the Gnocchi v3 scheduling of unprocessed data to a "round-robin" function which would reduce the likely hood of a metric being starved and thus growing unbounded.

Version-Release number of selected component (if applicable):
Gnocchi/3.0 - Newton
Gnocchi/3.1 - Ocata

How reproducible:
With ~100 instances a 3 controller cloud with 24 logical cores/machine (thus 6 workers per controller, 18 total metricd processing workers) will exceed capacity and eventually fail an osd if Ceph. Swift long term failure conditions are not well know but it is suspected that Swift will eventually slow down when a single container is overloaded with many small objects.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

One potential fix [0] that has been proposed is to sort the Gnocchi backlog from oldest to newest and process the oldest backlog first. This prevents some metrics from being starved and growing their number of measures in the Ceph omap and Ceph metrics pool as 16-byte objects. It may be possible for Ceph to do the sorting beforehand though that route needs to be investigated.

[0] https://github.com/gnocchixyz/gnocchi/pull/266

Comment 4 Pradeep Kilambi 2017-08-29 14:27:02 UTC

*** Bug 1481327 has been marked as a duplicate of this bug. ***

Comment 5 Pradeep Kilambi 2017-08-30 13:02:18 UTC

rebased rdo to 3.0.14:

 https://review.rdoproject.org/r/9055

Comment 6 Julien Danjou 2017-09-04 07:30:13 UTC

*** Bug 1426552 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2017-09-28 16:37:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2824

Note You need to log in before you can comment on or make changes to this bug.