Bug 1481327

Summary: Prevent starved metrics from growing measures unbounded (Ceph and Swift)
Product: Red Hat OpenStack Reporter: Pradeep Kilambi <pkilambi>
Component: openstack-gnocchiAssignee: Pradeep Kilambi <pkilambi>
Status: CLOSED DUPLICATE QA Contact: Sasha Smolyak <ssmolyak>
Severity: urgent Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: akrzos, apevec, jdanjou, jschluet, jtaleric, lhh, racedoro, ssmolyak
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1478555 Environment:
Last Closed: 2017-08-29 14:27:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1478555, 1481328    
Bug Blocks:    

Description Pradeep Kilambi 2017-08-14 15:30:55 UTC
+++ This bug was initially created as a clone of Bug #1478555 +++

Description of problem:
It is possibly to exceed the capacity of a Tripleo deployed cloud's Telemetry service such that some metrics will never be processed and thus have measures grow unbounded.  When this occurs there is several major failure conditions which will occur based on the speed of the unbounded growth.  This bug's purpose is to support a change to the Gnocchi v3 scheduling of unprocessed data to a "round-robin" function which would reduce the likely hood of a metric being starved and thus growing unbounded.

Version-Release number of selected component (if applicable):
Gnocchi/3.0 - Newton 
Gnocchi/3.1 - Ocata

How reproducible:
With ~100 instances a 3 controller cloud with 24 logical cores/machine (thus 6 workers per controller, 18 total metricd processing workers) will exceed capacity and eventually fail an osd if Ceph.  Swift long term failure conditions are not well know but it is suspected that Swift will eventually slow down when a single container is overloaded with many small objects.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

One potential fix [0] that has been proposed is to sort the Gnocchi backlog from oldest to newest and process the oldest backlog first.  This prevents some metrics from being starved and growing their number of measures in the Ceph omap and Ceph metrics pool as 16-byte objects.  It may be possible for Ceph to do the sorting beforehand though that route needs to be investigated.


[0] https://github.com/gnocchixyz/gnocchi/pull/266

Comment 1 Pradeep Kilambi 2017-08-29 14:27:02 UTC

*** This bug has been marked as a duplicate of bug 1478555 ***