Bug 1337655

Summary: Ceilometer exceptions filling up log
Product: Red Hat OpenStack Reporter: Chris Dearborn <christopher_dearborn>
Component: rhosp-directorAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WORKSFORME QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: arkady_kanevsky, cdevine, christopher_dearborn, dbecker, eglynn, gael_rehault, John_walsh, kasmith, kurt_hey, mburns, morazi, pkilambi, randy_perryman, rhel-osp-director-maint, rsussman, smerrow, sreichar, wayne_allen
Target Milestone: ---Keywords: Triaged
Target Release: 8.0 (Liberty)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-12 14:29:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1261979    

Description Chris Dearborn 2016-05-19 18:18:18 UTC
Description of problem:
After an overcloud deployment there are ceilometer exceptions that are filling up the logs.  These exceptions are occurring at a rapid rate.

Version-Release number of selected component (if applicable):
OSP 8

How reproducible:
Deploy an overcloud.

Steps to Reproduce:
1. After overcloud deployment, ssh into a controller node and:
journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l

Actual results:
Ceilometer exceptions are spewing into the logs on the controller nodes.

Expected results:
Should see no ceilometer exceptions in the logs.

Additional info:
These commands were run back to back, and you can see that 13 identical exceptions were logged in the course of a few seconds.
[heat-admin@r13-controller-0 ~]$ journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l
13877
[heat-admin@r13-controller-0 ~]$ journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l
13890

Comment 2 Chris Dearborn 2016-05-19 18:22:11 UTC
The exception:

May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination [-] Error connecting to coordination backend.
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination Traceback (most recent call last):
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 73, in start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._coordinator.start()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 197, in start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._start()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 415, in _start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._server_info = self._client.info()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self.gen.throw(type, value, traceback)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 50, in _translate_failures
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     cause=e)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 527, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     raise value
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination [-] Error sending a heartbeat to coordination backend.
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination Traceback (most recent call last):
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 101, in heartbeat
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     self._coordinator.heartbeat()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 479, in heartbeat
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     value=self.STILL_ALIVE)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     self.gen.throw(type, value, traceback)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 50, in _translate_failures
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     cause=e)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 527, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     raise value
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination

Comment 3 Mike Burns 2016-06-02 18:32:18 UTC
Eoghan,  do you have someone who can look at this?

Comment 4 Mike Burns 2016-07-28 18:14:55 UTC
Prad, any thoughts?

Comment 5 Pradeep Kilambi 2016-07-28 18:26:45 UTC
hmm looks like the redis coordination backend is not up or reachable? 

Few things to check:

* Make sure redis coordination url is configured correctly in ceilometer.conf

* the ip for the redis backend is reachable from the controller node. 

Also make sure python-redis is installed. We had a bug iirc where it was not installed automatically.

Comment 6 Sean Merrow 2016-11-10 14:50:27 UTC
Chris: Is this still an issue? Were you ever able to follow Pradeep's suggestions for debugging the issue?

Comment 7 Chris Dearborn 2016-11-15 16:15:47 UTC
Not sure if this is still an issue or not.

I saw the tips above, but thought that eglynn or pkilambi was trying them since the needinfo was set to them.  As a result, I didn't try them.

I'm working on deploying OSP10 now, so I will check that deployment when it finishes.

Comment 8 Chris Dearborn 2016-12-12 14:29:03 UTC
Just checked an OSP10 GA2 deployment, and this issue is no longer occurring.

Closing.