Bug 1047872 - [ceilometer] mongo aggregation pipeline for resource retrieval fails with excessive memory use
Summary: [ceilometer] mongo aggregation pipeline for resource retrieval fails with exc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer
Version: 4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 4.0
Assignee: Eoghan Glynn
QA Contact: Kevin Whitney
URL:
Whiteboard:
: 1065420 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-02 12:23 UTC by Eoghan Glynn
Modified: 2018-12-04 16:48 UTC (History)
7 users (show)

Fixed In Version: openstack-ceilometer-2013.2.1-2.el6ost
Doc Type: Bug Fix
Doc Text:
In the Telemetry service, in-memory sorting used by the mongodb aggregation framework when deriving resource listings from the metering-store caused access to resources to fail with excessive memory use. This was fixed by constructing the resource list via map-reduce instead of an aggregation pipeline. Now the metering-store size is no longer bounded by the size of available memory.
Clone Of:
Environment:
Last Closed: 2014-01-23 14:22:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1262571 0 None None None Never
Launchpad 1267162 0 None None None Never
OpenStack gerrit 65671 0 None None None Never
OpenStack gerrit 65947 0 None None None Never
OpenStack gerrit 65962 0 None None None Never
OpenStack gerrit 66861 0 None None None Never
Red Hat Product Errata RHBA-2014:0046 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform 4 Bug Fix and Enhancement Advisory 2014-01-23 00:51:59 UTC

Description Eoghan Glynn 2014-01-02 12:23:54 UTC
Description of problem:

The mongodb storage driver currently uses an aggregation pipeline over the meter collection in order to construct a list of resources adorned with first & last sample timestamps etc.

The problem with this approach is that the mongodb aggregation framework performs sorting in-memory, in this case operating over a potentially very large collection (particularly if the GET /v2/resources was not constrained with query params, e.g. to limit to a single tenant for example).

It turns out the mongodb innards are hardcoded to abort any sorts in an aggregation pipeline that will consume more than 10% of physical memory.


Version-Release number of selected component (if applicable):

mongodb-server-2.4.6-1.el6.x86_64
openstack-ceilometer-api-2013.2-4.el6ost.noarch
openstack-ceilometer-central-2013.2-4.el6ost.noarch
openstack-ceilometer-collector-2013.2-4.el6ost.noarch
openstack-ceilometer-common-2013.2-4.el6ost.noarch


How reproducible:

100% if the meter collection is sufficiently large.


Steps to Reproduce:

1. Allow meter collection to grow to at least X elements (actual value of X to be filled in by gilles, who has observed this issue in production with the new internal lab).

Note that the meter collection size can be retrieved via:

  $ mongo ceilometer
  > db.meter.count()


2. Attempt to list resources with an unconstrained query:

  $ ceilometer resource-list


Actual results:

The resource listing fails:

  $ ceilometer resource-list
  WARNING (http:172) Request returned failure status.
  HTTPInternalServerError (HTTP 500)

with an error similar to the following observed in the API logfile /var/log/ceilometer/api.log:

2013-12-17 03:56:57.516 21917 ERROR wsme.api [-] Server-side error:
"command SON([('aggregate', u'meter'), ('pipeline', [{'$match': {}},
{'$sort': {'timestamp': -1, 'project_id': -1, 'user_id': -1}},
{'$group': {'meters_unit': {'$push': '$counter_unit'}, 'source':
{'$first': '$source'}, 'project_id': {'$first': '$project_id'},
'user_id': {'$first': '$user_id'}, 'last_sample_timestamp': {'$max':
'$timestamp'}, 'meters_name': {'$push': '$counter_name'},
'first_sample_timestamp': {'$min': '$timestamp'}, 'meters_type':
{'$push': '$counter_type'}, '_id': '$resource_id', 'metadata':
{'$first': '$resource_metadata'}}}])]) failed: exception: terminating
request:  request heap use exceeded 10% of physical RAM".
Detail:<TRUNCATED>


Expected results:

The resource list should display all known resources.


Additional info:

The issue can worked around by partitioning the resource query per-tenant, e.g.:

  for project in $(keystone tenant-list | awk '/ True / {print $2}')
  do
    ceilometer resource-list -q project=$project | grep -vE '(\+-|Resource ID)'
  done

Comment 1 Eoghan Glynn 2014-01-10 15:36:52 UTC
Fix part 1 proposed on master upstream:

  https://review.openstack.org/65671

and duely landed:

  https://github.com/openstack/ceilometer/commit/7c4c0e8f

Backport proposed on stable/havana upstream:

  https://review.openstack.org/65947

Comment 2 Eoghan Glynn 2014-01-10 15:38:07 UTC
Fix part 1 proposed on master upstream:

  https://review.openstack.org/65962

Comment 3 Eoghan Glynn 2014-01-10 16:29:06 UTC
Typo in Comment 2 above:

  s/part 1/part 2/

Comment 4 Eoghan Glynn 2014-01-10 16:30:32 UTC
Internal backports for both fixes:

  https://code.engineering.redhat.com/gerrit/18265
  https://code.engineering.redhat.com/gerrit/18266

Comment 5 Eoghan Glynn 2014-01-10 16:46:14 UTC
Internal backports have landed.

Comment 7 Eoghan Glynn 2014-01-15 14:48:22 UTC
Fix part 1 landed on stable/havana upstream:

  https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ef71dc6a11

Fix part 2 landed on master upstream:

  https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ba6641af

Comment 8 Eoghan Glynn 2014-01-15 14:52:07 UTC
Backport of fix part 2 proposed to stable/havana upstream:

   https://review.openstack.org/66861

Comment 9 Kevin Whitney 2014-01-15 20:06:25 UTC
Verified

1) Create a meter (dummy) and resource (resdummy)
2) Add 15 samples to meter dummy

mongo ceilometer

db.resource.find({ "meter.counter_name": "dummy", "_id" : "resdummy"}).count()
1

db.meter.find({"counter_name":"dummy"}).count()
15

Examine the resource document and verify it does not contain an entry for each  data sample.

db.resource.find({ "meter.counter_name": "dummy", "_id" : "resdummy"})

db.resource.find({"meter.counter_name": "dummy", "_id" : "resdummy"})
{ "_id" : "res45000", "metadata" : { }, "meter" : [ 	{ 	"counter_name" : "dummy", 	"counter_unit" : "something", 	"counter_type" : "cumulative" } ], "project_id" : "e97a90c759f64dfaadf319cf08cb1ab2", "source" : "e97a90c759f64dfaadf319cf08cb1ab2:openstack", "user_id" : "e1aa40339c9d45a582b4a13640ae3eab" }

3) create 25,000 resources 
 ceilometer resource-list 
    <works as expected>

Comment 12 Lon Hohberger 2014-02-04 17:19:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2014-0046.html

Comment 13 Eoghan Glynn 2014-02-17 19:23:16 UTC
*** Bug 1065420 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.