Bug 1337655 - Ceilometer exceptions filling up log
Summary: Ceilometer exceptions filling up log
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Eoghan Glynn
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks: 1261979
TreeView+ depends on / blocked
 
Reported: 2016-05-19 18:18 UTC by Chris Dearborn
Modified: 2016-12-12 14:29 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-12 14:29:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chris Dearborn 2016-05-19 18:18:18 UTC
Description of problem:
After an overcloud deployment there are ceilometer exceptions that are filling up the logs.  These exceptions are occurring at a rapid rate.

Version-Release number of selected component (if applicable):
OSP 8

How reproducible:
Deploy an overcloud.

Steps to Reproduce:
1. After overcloud deployment, ssh into a controller node and:
journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l

Actual results:
Ceilometer exceptions are spewing into the logs on the controller nodes.

Expected results:
Should see no ceilometer exceptions in the logs.

Additional info:
These commands were run back to back, and you can see that 13 identical exceptions were logged in the course of a few seconds.
[heat-admin@r13-controller-0 ~]$ journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l
13877
[heat-admin@r13-controller-0 ~]$ journalctl | grep ceilometer.coordination | grep "Error connecting to coordination backend." | wc -l
13890

Comment 2 Chris Dearborn 2016-05-19 18:22:11 UTC
The exception:

May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination [-] Error connecting to coordination backend.
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination Traceback (most recent call last):
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 73, in start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._coordinator.start()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 197, in start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._start()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 415, in _start
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self._server_info = self._client.info()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     self.gen.throw(type, value, traceback)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 50, in _translate_failures
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     cause=e)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 527, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination     raise value
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.870 39518 ERROR ceilometer.coordination
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination [-] Error sending a heartbeat to coordination backend.
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination Traceback (most recent call last):
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 101, in heartbeat
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     self._coordinator.heartbeat()
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 479, in heartbeat
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     value=self.STILL_ALIVE)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     self.gen.throw(type, value, traceback)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 50, in _translate_failures
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     cause=e)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 527, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 692, in raise_from
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination     raise value
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
May 18 13:39:51 r13-controller-0.localdomain ceilometer-alarm-evaluator[39518]: 2016-05-18 13:39:51.873 39518 ERROR ceilometer.coordination

Comment 3 Mike Burns 2016-06-02 18:32:18 UTC
Eoghan,  do you have someone who can look at this?

Comment 4 Mike Burns 2016-07-28 18:14:55 UTC
Prad, any thoughts?

Comment 5 Pradeep Kilambi 2016-07-28 18:26:45 UTC
hmm looks like the redis coordination backend is not up or reachable? 

Few things to check:

* Make sure redis coordination url is configured correctly in ceilometer.conf

* the ip for the redis backend is reachable from the controller node. 

Also make sure python-redis is installed. We had a bug iirc where it was not installed automatically.

Comment 6 Sean Merrow 2016-11-10 14:50:27 UTC
Chris: Is this still an issue? Were you ever able to follow Pradeep's suggestions for debugging the issue?

Comment 7 Chris Dearborn 2016-11-15 16:15:47 UTC
Not sure if this is still an issue or not.

I saw the tips above, but thought that eglynn or pkilambi was trying them since the needinfo was set to them.  As a result, I didn't try them.

I'm working on deploying OSP10 now, so I will check that deployment when it finishes.

Comment 8 Chris Dearborn 2016-12-12 14:29:03 UTC
Just checked an OSP10 GA2 deployment, and this issue is no longer occurring.

Closing.


Note You need to log in before you can comment on or make changes to this bug.