1252486 – Ceilometer taking over 20 minutes to return queries

Bug 1252486 - Ceilometer taking over 20 minutes to return queries

Summary: Ceilometer taking over 20 minutes to return queries

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Pradeep Kilambi
QA Contact:	Yurii Prokulevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-11 14:25 UTC by Chris Henderson
Modified:	2019-07-11 09:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-04 17:31:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Chris Henderson 2015-08-11 14:25:08 UTC

Customer environment:
=====================
Cisco UCS B200 blades for controllers and computes
Cisco UCS C240 rack servers for Ceph storage
RHEL 7.1
RHEL-OSP5 A4 (openstack-ceilometer-api-2014.1.4-1.el7ost.noarch)
RH CEPH 1.3

Symptoms:
=========
ceilometer commands (see below) get bogged down fairly quickly (only after about 1-2 days of gathering metering samples).
The client's DEV openstack environment is quite active with ~1000 instances across 30 nova compute nodes.
The slowness just gets worse as we approach the TTL expiration (set to 5 days).  The targeted queries (-q resource=<UUID>) seem to work ok at first, but also get sluggish as more samples are gathered.
We've checked on disk IO (behind /var/lib/mongodb), but it does not look to be the source of the problem.  We used Ceph rbd initially, then went back to using local disk (300GB SAS RAID1).  But the slowness has persisted.

# time ceilometer resource-list|wc -l
5302

real        0m12.273s
user        0m2.075s
sys        0m0.194s

# time ceilometer meter-list|wc -l
40392

real        0m25.125s
user        0m15.027s
sys        0m0.398s

# time ceilometer sample-list -m volume|wc -l
2132

real        0m25.073s
user        0m0.979s
sys        0m0.096s


# time ceilometer sample-list -m image|wc -l
38055

real        1m7.497s
user        0m13.377s
sys        0m0.706s

# time ceilometer sample-list -m instance|wc -l
Error communicating with http://10.63.168.100:8777 timed out
0

real        10m0.583s
user        0m0.271s
sys        0m0.056s

# from ceilometer db
> db.meter.find().count()
15582761


NOTE:  Open ended queries on 'instance' meter consistently hang and eventually time out after 10min. During these queries, the ceilometer-api and mongod processes spike on CPU.
Also, it seems to be an issue only in the customer's DEV environment where the activity is high.  Other environments (PROD, TEST) have much less activity and does not exhibit this slowness.

Note You need to log in before you can comment on or make changes to this bug.