Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1047872

Summary: [ceilometer] mongo aggregation pipeline for resource retrieval fails with excessive memory use
Product: Red Hat OpenStack Reporter: Eoghan Glynn <eglynn>
Component: openstack-ceilometerAssignee: Eoghan Glynn <eglynn>
Status: CLOSED ERRATA QA Contact: Kevin Whitney <kwhitney>
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: ajeain, breeler, dsulliva, jruzicka, pbrady, srevivo, yeylon
Target Milestone: z1Keywords: ZStream
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ceilometer-2013.2.1-2.el6ost Doc Type: Bug Fix
Doc Text:
In the Telemetry service, in-memory sorting used by the mongodb aggregation framework when deriving resource listings from the metering-store caused access to resources to fail with excessive memory use. This was fixed by constructing the resource list via map-reduce instead of an aggregation pipeline. Now the metering-store size is no longer bounded by the size of available memory.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-23 14:22:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eoghan Glynn 2014-01-02 12:23:54 UTC
Description of problem:

The mongodb storage driver currently uses an aggregation pipeline over the meter collection in order to construct a list of resources adorned with first & last sample timestamps etc.

The problem with this approach is that the mongodb aggregation framework performs sorting in-memory, in this case operating over a potentially very large collection (particularly if the GET /v2/resources was not constrained with query params, e.g. to limit to a single tenant for example).

It turns out the mongodb innards are hardcoded to abort any sorts in an aggregation pipeline that will consume more than 10% of physical memory.


Version-Release number of selected component (if applicable):

mongodb-server-2.4.6-1.el6.x86_64
openstack-ceilometer-api-2013.2-4.el6ost.noarch
openstack-ceilometer-central-2013.2-4.el6ost.noarch
openstack-ceilometer-collector-2013.2-4.el6ost.noarch
openstack-ceilometer-common-2013.2-4.el6ost.noarch


How reproducible:

100% if the meter collection is sufficiently large.


Steps to Reproduce:

1. Allow meter collection to grow to at least X elements (actual value of X to be filled in by gilles, who has observed this issue in production with the new internal lab).

Note that the meter collection size can be retrieved via:

  $ mongo ceilometer
  > db.meter.count()


2. Attempt to list resources with an unconstrained query:

  $ ceilometer resource-list


Actual results:

The resource listing fails:

  $ ceilometer resource-list
  WARNING (http:172) Request returned failure status.
  HTTPInternalServerError (HTTP 500)

with an error similar to the following observed in the API logfile /var/log/ceilometer/api.log:

2013-12-17 03:56:57.516 21917 ERROR wsme.api [-] Server-side error:
"command SON([('aggregate', u'meter'), ('pipeline', [{'$match': {}},
{'$sort': {'timestamp': -1, 'project_id': -1, 'user_id': -1}},
{'$group': {'meters_unit': {'$push': '$counter_unit'}, 'source':
{'$first': '$source'}, 'project_id': {'$first': '$project_id'},
'user_id': {'$first': '$user_id'}, 'last_sample_timestamp': {'$max':
'$timestamp'}, 'meters_name': {'$push': '$counter_name'},
'first_sample_timestamp': {'$min': '$timestamp'}, 'meters_type':
{'$push': '$counter_type'}, '_id': '$resource_id', 'metadata':
{'$first': '$resource_metadata'}}}])]) failed: exception: terminating
request:  request heap use exceeded 10% of physical RAM".
Detail:<TRUNCATED>


Expected results:

The resource list should display all known resources.


Additional info:

The issue can worked around by partitioning the resource query per-tenant, e.g.:

  for project in $(keystone tenant-list | awk '/ True / {print $2}')
  do
    ceilometer resource-list -q project=$project | grep -vE '(\+-|Resource ID)'
  done

Comment 1 Eoghan Glynn 2014-01-10 15:36:52 UTC
Fix part 1 proposed on master upstream:

  https://review.openstack.org/65671

and duely landed:

  https://github.com/openstack/ceilometer/commit/7c4c0e8f

Backport proposed on stable/havana upstream:

  https://review.openstack.org/65947

Comment 2 Eoghan Glynn 2014-01-10 15:38:07 UTC
Fix part 1 proposed on master upstream:

  https://review.openstack.org/65962

Comment 3 Eoghan Glynn 2014-01-10 16:29:06 UTC
Typo in Comment 2 above:

  s/part 1/part 2/

Comment 4 Eoghan Glynn 2014-01-10 16:30:32 UTC
Internal backports for both fixes:

  https://code.engineering.redhat.com/gerrit/18265
  https://code.engineering.redhat.com/gerrit/18266

Comment 5 Eoghan Glynn 2014-01-10 16:46:14 UTC
Internal backports have landed.

Comment 7 Eoghan Glynn 2014-01-15 14:48:22 UTC
Fix part 1 landed on stable/havana upstream:

  https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ef71dc6a11

Fix part 2 landed on master upstream:

  https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ba6641af

Comment 8 Eoghan Glynn 2014-01-15 14:52:07 UTC
Backport of fix part 2 proposed to stable/havana upstream:

   https://review.openstack.org/66861

Comment 9 Kevin Whitney 2014-01-15 20:06:25 UTC
Verified

1) Create a meter (dummy) and resource (resdummy)
2) Add 15 samples to meter dummy

mongo ceilometer

db.resource.find({ "meter.counter_name": "dummy", "_id" : "resdummy"}).count()
1

db.meter.find({"counter_name":"dummy"}).count()
15

Examine the resource document and verify it does not contain an entry for each  data sample.

db.resource.find({ "meter.counter_name": "dummy", "_id" : "resdummy"})

db.resource.find({"meter.counter_name": "dummy", "_id" : "resdummy"})
{ "_id" : "res45000", "metadata" : { }, "meter" : [ 	{ 	"counter_name" : "dummy", 	"counter_unit" : "something", 	"counter_type" : "cumulative" } ], "project_id" : "e97a90c759f64dfaadf319cf08cb1ab2", "source" : "e97a90c759f64dfaadf319cf08cb1ab2:openstack", "user_id" : "e1aa40339c9d45a582b4a13640ae3eab" }

3) create 25,000 resources 
 ceilometer resource-list 
    <works as expected>

Comment 12 Lon Hohberger 2014-02-04 17:19:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2014-0046.html

Comment 13 Eoghan Glynn 2014-02-17 19:23:16 UTC
*** Bug 1065420 has been marked as a duplicate of this bug. ***