Customer environment: ===================== Cisco UCS B200 blades for controllers and computes Cisco UCS C240 rack servers for Ceph storage RHEL 7.1 RHEL-OSP5 A4 (openstack-ceilometer-api-2014.1.4-1.el7ost.noarch) RH CEPH 1.3 Symptoms: ========= ceilometer commands (see below) get bogged down fairly quickly (only after about 1-2 days of gathering metering samples). The client's DEV openstack environment is quite active with ~1000 instances across 30 nova compute nodes. The slowness just gets worse as we approach the TTL expiration (set to 5 days). The targeted queries (-q resource=<UUID>) seem to work ok at first, but also get sluggish as more samples are gathered. We've checked on disk IO (behind /var/lib/mongodb), but it does not look to be the source of the problem. We used Ceph rbd initially, then went back to using local disk (300GB SAS RAID1). But the slowness has persisted. # time ceilometer resource-list|wc -l 5302 real 0m12.273s user 0m2.075s sys 0m0.194s # time ceilometer meter-list|wc -l 40392 real 0m25.125s user 0m15.027s sys 0m0.398s # time ceilometer sample-list -m volume|wc -l 2132 real 0m25.073s user 0m0.979s sys 0m0.096s # time ceilometer sample-list -m image|wc -l 38055 real 1m7.497s user 0m13.377s sys 0m0.706s # time ceilometer sample-list -m instance|wc -l Error communicating with http://10.63.168.100:8777 timed out 0 real 10m0.583s user 0m0.271s sys 0m0.056s # from ceilometer db > db.meter.find().count() 15582761 NOTE: Open ended queries on 'instance' meter consistently hang and eventually time out after 10min. During these queries, the ceilometer-api and mongod processes spike on CPU. Also, it seems to be an issue only in the customer's DEV environment where the activity is high. Other environments (PROD, TEST) have much less activity and does not exhibit this slowness.