Bug 1252486

Summary:	Ceilometer taking over 20 minutes to return queries
Product:	Red Hat OpenStack	Reporter:	Chris Henderson <chenders>
Component:	openstack-ceilometer	Assignee:	Pradeep Kilambi <pkilambi>
Status:	CLOSED NOTABUG	QA Contact:	Yurii Prokulevych <yprokule>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.0 (RHEL 7)	CC:	jruzicka, pkilambi, yeylon
Target Milestone:	---
Target Release:	8.0 (Liberty)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-04 17:31:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Henderson 2015-08-11 14:25:08 UTC

Customer environment:
=====================
Cisco UCS B200 blades for controllers and computes
Cisco UCS C240 rack servers for Ceph storage
RHEL 7.1
RHEL-OSP5 A4 (openstack-ceilometer-api-2014.1.4-1.el7ost.noarch)
RH CEPH 1.3

Symptoms:
=========
ceilometer commands (see below) get bogged down fairly quickly (only after about 1-2 days of gathering metering samples).
The client's DEV openstack environment is quite active with ~1000 instances across 30 nova compute nodes.
The slowness just gets worse as we approach the TTL expiration (set to 5 days).  The targeted queries (-q resource=<UUID>) seem to work ok at first, but also get sluggish as more samples are gathered.
We've checked on disk IO (behind /var/lib/mongodb), but it does not look to be the source of the problem.  We used Ceph rbd initially, then went back to using local disk (300GB SAS RAID1).  But the slowness has persisted.

# time ceilometer resource-list|wc -l
5302

real        0m12.273s
user        0m2.075s
sys        0m0.194s

# time ceilometer meter-list|wc -l
40392

real        0m25.125s
user        0m15.027s
sys        0m0.398s

# time ceilometer sample-list -m volume|wc -l
2132

real        0m25.073s
user        0m0.979s
sys        0m0.096s


# time ceilometer sample-list -m image|wc -l
38055

real        1m7.497s
user        0m13.377s
sys        0m0.706s

# time ceilometer sample-list -m instance|wc -l
Error communicating with http://10.63.168.100:8777 timed out
0

real        10m0.583s
user        0m0.271s
sys        0m0.056s

# from ceilometer db
> db.meter.find().count()
15582761


NOTE:  Open ended queries on 'instance' meter consistently hang and eventually time out after 10min. During these queries, the ceilometer-api and mongod processes spike on CPU.
Also, it seems to be an issue only in the customer's DEV environment where the activity is high.  Other environments (PROD, TEST) have much less activity and does not exhibit this slowness.