1040983 – openstack-ceilometer-alarm-evaluator should wait for openstack-ceilometer-api service

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1040983 - openstack-ceilometer-alarm-evaluator should wait for openstack-ceilometer-api service

Summary: openstack-ceilometer-alarm-evaluator should wait for openstack-ceilometer-api...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RDO
Classification:	Community
Component:	openstack-ceilometer
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Eoghan Glynn
QA Contact:	Kevin Whitney
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-12 13:18 UTC by Alienyyg
Modified:	2014-02-02 22:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-30 06:56:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Alienyyg 2013-12-12 13:18:04 UTC

Description of problem:
I deploy oepnstack with RDO,then I wanna to creat a tempalte to do autoscaling, but after I stress the vm ,nothing happens ,and then I restart all the ceilometer service ,I got ERROR ceilometer.alarm.service in alarm-evaluator.log


Version-Release number of selected component (if applicable):

openstack-ceilometer-common-2013.2-1.el6.noarch
openstack-ceilometer-collector-2013.2-1.el6.noarch
python-ceilometerclient-1.0.6-1.el6.noarch
python-ceilometer-2013.2-1.el6.noarch
openstack-ceilometer-central-2013.2-1.el6.noarch
openstack-ceilometer-api-2013.2-1.el6.noarch
openstack-ceilometer-alarm-2013.2-1.el6.noarch


How reproducible:
100%
Steps to Reproduce:
1.create a simple autoscaling  tempalte and run it in heat(below)
2.stress the vm 
3.restart all ceilometer service 

Actual results:
nothing happen

Expected results:
the heat should create a new instance

Additional info:
the log information in alarm-evaluator.log is :

2013-12-12 07:53:28.361 31932 ERROR ceilometer.alarm.service [-] alarm evaluation cycle failed
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service Traceback (most recent call last):
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometer/alarm/service.py", line 95, in _evaluate_assigned_alarms
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     alarms = self._assigned_alarms()
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometer/alarm/service.py", line 138, in _assigned_alarms
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     'value': True}])
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometerclient/v2/alarms.py", line 70, in list
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     return self._list(options.build_url(self._path(), q))
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometerclient/common/base.py", line 57, in _list
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     resp, body = self.api.json_request('GET', url)
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometerclient/common/http.py", line 186, in json_request
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     resp, body_iter = self._http_request(url, method, **kwargs)
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service   File "/usr/lib/python2.6/site-packages/ceilometerclient/common/http.py", line 155, in _http_request
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service     raise exc.CommunicationError(message=message)
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service CommunicationError: Error communicating with http://7.7.7.7:8777 [Errno 111] ECONNREFUSED
2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service

my template is at :
http://paste.openstack.org/show/54865/


these a same bug but only fixed in devstack:
https://bugs.launchpad.net/ceilometer/+bug/1243249

Comment 1 Alienyyg 2013-12-12 13:19:55 UTC

I have change the interval from 300 to 30 in /etc/ceilometer/pipeline.yaml

Comment 2 Pádraig Brady 2013-12-12 15:19:31 UTC

So this is a general issue with service startup interdependencies.

Something similar was handled for keystone with systemd notifiers,
but on el6 we used an internal wait_until_keystone_available()
routine in the keystone init script.

Eoghan what would be the best way to determine the ceilometer API was available.
some celiometer (client) command, or a more generic curl call (which would
have awkwardness with parsing and configurable port numbers).

Comment 3 Eoghan Glynn 2013-12-13 09:55:49 UTC

If it's simply a matter of the alarm-evaluator hitting the API service before it's fully ready, then I'd have expected only the first evaluation cycle or so to fail.

But then for alarm evaluation to subsequently get back on track once the API service is fully up, i.e. the problem would be purely transient, maybe impacting the first one or two evaluation cycles at most (which occur by default at 60s intervals).

Whereas the fact the autoscaling actions *never* occur appears to suggest something else.

Two possibilities come to mind:

1. The problem with alarm evaluation is actually resolved quickly, but the lack of repeat_actions in your template results in Heat autoscaling not receiving the continuous notifications it needs in order to implement the cooldown period configured.

To check if this is the case:

  $ function get_alarm_id { ceilometer alarm-list | awk -F\| "/$1/ {print \$2}"; }
  $ ceilometer alarm-show -a $(get_alarm_id MEMAlarmHigh) | grep -E 'state|repeat'

if the state has indeed transitioned to alarm, then the problem is the lack of repeat_actions. To resolve:

  $ for a in $(get_alarm_id MEMAlarmHigh) $(get_alarm_id MEMAlarmLow); do ceilometer alarm-update -a $a --repeat-actions True ; done

Then in the future make sure you add the line:

  repeat_actions: True

to every OS::Metering::Alarm resource in your template.


2. Are you running the ceilometer-alarm-evaluator service on a different host to the ceilometer-api service?

If so, you've possibly hit this issue (now fixed) with the iptables rule installed by packstack for the ceilometer-api:

  https://bugzilla.redhat.com/1040404

or this similar issue (not yet fixed):

   https://bugzilla.redhat.com/1041560

with the iptables rules for keystone?

To check your iptables rules, on the controller host run:

  $ sudo iptables -L | grep -E 'ceilometer|keystone'

looking for the fourth field (source) to be anywhere if the rule is sufficiently open, or the controller IP address if not.

To workaround these issues:

  $ function get_iptables_index { sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n $1 | cut -f1 -d:; }

  # if the ceilometer rule is not open
  $ sudo iptables -I INPUT $(get_iptables_index ceilometer) -p tcp --dport 8777 -j ACCEPT

  # if the keystone rule is not open
  $ sudo iptables -I INPUT $(get_iptables_index keystone) -p tcp --dport 35357 -j ACCEPT

  $ sudo service iptables save

Comment 4 Alienyyg 2013-12-16 03:18:46 UTC

(In reply to Eoghan Glynn from comment #3)
> If it's simply a matter of the alarm-evaluator hitting the API service
> before it's fully ready, then I'd have expected only the first evaluation
> cycle or so to fail.
> 
> But then for alarm evaluation to subsequently get back on track once the API
> service is fully up, i.e. the problem would be purely transient, maybe
> impacting the first one or two evaluation cycles at most (which occur by
> default at 60s intervals).
> 
> Whereas the fact the autoscaling actions *never* occur appears to suggest
> something else.
> 
> Two possibilities come to mind:
> 
> 1. The problem with alarm evaluation is actually resolved quickly, but the
> lack of repeat_actions in your template results in Heat autoscaling not
> receiving the continuous notifications it needs in order to implement the
> cooldown period configured.
> 
> To check if this is the case:
> 
>   $ function get_alarm_id { ceilometer alarm-list | awk -F\| "/$1/ {print
> \$2}"; }
>   $ ceilometer alarm-show -a $(get_alarm_id MEMAlarmHigh) | grep -E
> 'state|repeat'
> 
> if the state has indeed transitioned to alarm, then the problem is the lack
> of repeat_actions. To resolve:
> 
>   $ for a in $(get_alarm_id MEMAlarmHigh) $(get_alarm_id MEMAlarmLow); do
> ceilometer alarm-update -a $a --repeat-actions True ; done
> 
> Then in the future make sure you add the line:
> 
>   repeat_actions: True
> 
> to every OS::Metering::Alarm resource in your template.
> 
> 
> 2. Are you running the ceilometer-alarm-evaluator service on a different
> host to the ceilometer-api service?
> 
> If so, you've possibly hit this issue (now fixed) with the iptables rule
> installed by packstack for the ceilometer-api:
> 
>   https://bugzilla.redhat.com/1040404
> 
> or this similar issue (not yet fixed):
> 
>    https://bugzilla.redhat.com/1041560
> 
> with the iptables rules for keystone?
> 
> To check your iptables rules, on the controller host run:
> 
>   $ sudo iptables -L | grep -E 'ceilometer|keystone'
> 
> looking for the fourth field (source) to be anywhere if the rule is
> sufficiently open, or the controller IP address if not.
> 
> To workaround these issues:
> 
>   $ function get_iptables_index { sudo iptables -L | grep -A 20
> 'INPUT.*policy ACCEPT' | grep -- -- | grep -n $1 | cut -f1 -d:; }
> 
>   # if the ceilometer rule is not open
>   $ sudo iptables -I INPUT $(get_iptables_index ceilometer) -p tcp --dport
> 8777 -j ACCEPT
> 
>   # if the keystone rule is not open
>   $ sudo iptables -I INPUT $(get_iptables_index keystone) -p tcp --dport
> 35357 -j ACCEPT
> 
>   $ sudo service iptables save

Hi,Eoghan Glynn , thanks for your reply, I run the ceilometer-alarm-evaluator service and the ceilometer-api service on the same node, and the state of the alarm is "insufficient data", this issue block me for more than one week, and after I run the template ,I didn't find any error in log file(the debug is set to "True") ,can you help me out of this question?

Comment 5 Eoghan Glynn 2013-12-16 13:11:21 UTC

Hi Alienyyg,

Is the state of the alarm *continually* in insufficient_data, or does it flip from transiently being in the alarm state then back into insufficient_data?

If the former, the issue is likely to be:

  https://bugzilla.redhat.com/1032070

You can work-around with:

  $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host
  $ sudo openstack-ceilometer-compute restart 

If the latter, the issue is likely a mismatch between the cadence of the compute agent polling interval (default: 600s) and the alarm period (60s).

You can work-around with:

  $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  $ sudo service openstack-ceilometer-compute restart

However, I noticed something else wrong with your template - I had been thrown by the description:

  Scale-up if the average CPU > 50% for 1 minute

which implies that the alarm is based on CPU util. However I see now that the actual meter name is:

  meter_name: memory

is that intentional? Note that this meter's value is the *total* amount of allocated memory, which will be static for an instance group in the absence of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal) which would be more variable.

The threshold and description imply what really meant here was:

  meter_name: cpu_util

i.e. to scale up based on CPU utilization %.

Comment 6 Alienyyg 2013-12-17 04:08:19 UTC

(In reply to Eoghan Glynn from comment #5)
> Hi Alienyyg,
> 
> Is the state of the alarm *continually* in insufficient_data, or does it
> flip from transiently being in the alarm state then back into
> insufficient_data?
> 
> If the former, the issue is likely to be:
> 
>   https://bugzilla.redhat.com/1032070
> 
> You can work-around with:
> 
>   $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host
>   $ sudo openstack-ceilometer-compute restart 
> 
> If the latter, the issue is likely a mismatch between the cadence of the
> compute agent polling interval (default: 600s) and the alarm period (60s).
> 
> You can work-around with:
> 
>   $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval:
> 60/ }' /etc/ceilometer/pipeline.yaml
>   $ sudo service openstack-ceilometer-compute restart
> 
> However, I noticed something else wrong with your template - I had been
> thrown by the description:
> 
>   Scale-up if the average CPU > 50% for 1 minute
> 
> which implies that the alarm is based on CPU util. However I see now that
> the actual meter name is:
> 
>   meter_name: memory
> 
> is that intentional? Note that this meter's value is the *total* amount of
> allocated memory, which will be static for an instance group in the absence
> of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal)
> which would be more variable.
> 
> The threshold and description imply what really meant here was:
> 
>   meter_name: cpu_util
> 
> i.e. to scale up based on CPU utilization %.

Hi Eoghan:
Sorry for the mistake about "meter_name: memory", I wanna to do autoscaling based on memory utilization originally,but now I have changed to "meter_name: cpu_util"
the status of each alarm is always "insufficient data", I have changed the interval to 60 in "/etc/ceilometer/pipeline.yaml"
the DEFAULT.host is not set in both nova.conf and ceilometer.conf, and I am running a mutinode openstack,one controller and 2 compute node ,all service but nova-compute runs on controller, the 2 compute node are responsible for nova-compute, so I can not get any response after 
   nova list --all-tenants --host $(python -c "import socket ; print socket.gethostname()")  but I think it is not the key problem.

after I deploy openstack, I did not get any ceilometer-compute service, all I get is 

openstack-ceilometer-common-2013.2-1.el6.noarch
openstack-ceilometer-collector-2013.2-1.el6.noarch
python-ceilometerclient-1.0.6-1.el6.noarch
python-ceilometer-2013.2-1.el6.noarch
openstack-ceilometer-central-2013.2-1.el6.noarch
openstack-ceilometer-api-2013.2-1.el6.noarch
openstack-ceilometer-alarm-2013.2-1.el6.noarch

then I followed 
https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install
to install ceilometer service

and after all, I can't get any response with:
ceilometer statistics -m cpu_util -q "resource_id=$INSTANCE_ID"

 and a error was found in collecor.log(every I lunch the stack ,it appears):

2013-12-17 11:59:13.261 2504 ERROR ceilometer.collector.dispatcher.database [req-e197639d-7dab-40fb-8f21-37fd434737a1 None None] Failed to record metering data: not okForStor     age
  68 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database Traceback (most recent call last):
  69 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database   File "/usr/lib/python2.6/site-packages/ceilometer/collector/dispatcher/database.py", line 65, in      record_metering_data
  70 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database     self.storage_conn.record_metering_data(meter)
  71 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database   File "/usr/lib/python2.6/site-packages/ceilometer/storage/impl_mongodb.py", line 416, in record_     metering_data
  72 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database     upsert=True,
  73 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database   File "/usr/lib64/python2.6/site-packages/pymongo/collection.py", line 479, in update
  74 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database     check_keys, self.__uuid_subtype), safe)
  75 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database   File "/usr/lib64/python2.6/site-packages/pymongo/mongo_client.py", line 920, in _send_message
  76 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database     rv = self.__check_response_to_last_error(response)
  77 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database   File "/usr/lib64/python2.6/site-packages/pymongo/mongo_client.py", line 863, in __check_response     _to_last_error
  78 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database     raise OperationFailure(details["err"], details["code"])
  79 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database OperationFailure: not okForStorage
  80 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database

maybe this is the key to solve this issue?

Best Regards
Alienyyg

Comment 7 Alienyyg 2013-12-17 06:54:38 UTC

(In reply to Eoghan Glynn from comment #5)
> Hi Alienyyg,
> 
> Is the state of the alarm *continually* in insufficient_data, or does it
> flip from transiently being in the alarm state then back into
> insufficient_data?
> 
> If the former, the issue is likely to be:
> 
>   https://bugzilla.redhat.com/1032070
> 
> You can work-around with:
> 
>   $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host
>   $ sudo openstack-ceilometer-compute restart 
> 
> If the latter, the issue is likely a mismatch between the cadence of the
> compute agent polling interval (default: 600s) and the alarm period (60s).
> 
> You can work-around with:
> 
>   $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval:
> 60/ }' /etc/ceilometer/pipeline.yaml
>   $ sudo service openstack-ceilometer-compute restart
> 
> However, I noticed something else wrong with your template - I had been
> thrown by the description:
> 
>   Scale-up if the average CPU > 50% for 1 minute
> 
> which implies that the alarm is based on CPU util. However I see now that
> the actual meter name is:
> 
>   meter_name: memory
> 
> is that intentional? Note that this meter's value is the *total* amount of
> allocated memory, which will be static for an instance group in the absence
> of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal)
> which would be more variable.
> 
> The threshold and description imply what really meant here was:
> 
>   meter_name: cpu_util
> 
> i.e. to scale up based on CPU utilization %.

and I can get the memeory of each instance via
ceilometer statistics -m memory -q "resource_id=&ID"

Comment 8 Eoghan Glynn 2013-12-18 16:40:07 UTC

Hi Alienyyg,

Can you paste the latest version of your Heat template?

Also we need to get to the bottom of why the mongodb storage driver is rejecting the attempt to store the sample data. Can you ensure that debug logging is enabled:

  $ sudo openstack-config --SET /etc/ceilometer/ceilometer.conf DEFAULT debug True
  $ sudo service openstack-ceilometer-collector restart

Then attach a fresh excerpt from the collector log containing the 'not okForStorage' failure.

One known case where this may occur is where some user metadata for an instance contains an embedded period ('.') in the user metadata key.

Can you check whether this is the case with:

  $ for i in $(nova list | grep -vE '(\+--|Task State)' | awk '{print $2}'); do nova show $i | grep meta; done

I suspect what's happening in this case is that the 'metering.' prefix used by Heat is causing those mongodb failures, can you check which event types cause the problem with something like:

 $ for r in $(grep 'not okForStorage' /var/log/ceilometer/collector.log  | grep req- | sed 's/^.*req-/req-/' | sed 's/ None .*$//'); do   grep $r /var/log/ceilometer/collector.log | grep received | sed 's/^.*event_type.: u.//' | sed 's/. .*$//'; done

Can you run this and get back to me with the problematic event types?
  
I don't actually think the 'not okForStorage' is the cause of the cpu_util not showing up in the metering store, as the ceilometer compute agent handles user metadata differently.

(See https://bugs.launchpad.net/ceilometer/+bug/1262190 for some of the metadata handling inconsistencies I discovered when looking into your issue)

Comment 9 Eoghan Glynn 2013-12-18 16:42:02 UTC

One other point, in terms of getting-started docco, you'd be much better off looking at:

  http://openstack.redhat.com/CeilometerQuickStart

as opposed to:

  https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install

as the latter is quite old at this stage.

Comment 10 Alienyyg 2013-12-20 03:01:01 UTC

(In reply to Eoghan Glynn from comment #9)
> One other point, in terms of getting-started docco, you'd be much better off
> looking at:
> 
>   http://openstack.redhat.com/CeilometerQuickStart
> 
> as opposed to:
> 
>   https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install
> 
> as the latter is quite old at this stage.

Hi Eoghan :
I am sorry but my openstack crashed yesterday due to the disk error on my controller(some remapping errors, and I cannot start my controller), so I deploy openstack once again with rdo, and I found that after the deployment finished , no ceilometer-compute service is found on controller the deployment is the same as I used,(with gre network , one controller and two compute node), I think this is the key,because when I try allinone deploment, erverything works well, might be some bug with rdo.

my openstack crashed before I got your message, So, I'am not able to get the error logs again :(

BTW,thans for your link about ceilometer

Best Regards
Alienyyg

Comment 11 Eoghan Glynn 2013-12-20 13:47:53 UTC

> no ceilometer-compute service is found on controller

Did you mean that compute agent is running is controller, or it's not even installed on the controller?

  $ sudo service openstack-ceilometer-compute status
  $ sudo rpm -qa | grep ceilometer

Comment 12 Alienyyg 2013-12-23 15:55:35 UTC

(In reply to Eoghan Glynn from comment #11)
> > no ceilometer-compute service is found on controller
> 
> Did you mean that compute agent is running is controller, or it's not even
> installed on the controller?
> 
>   $ sudo service openstack-ceilometer-compute status
>   $ sudo rpm -qa | grep ceilometer

Hi Eoghan:
really sorry for the latency, I just deploy openstack many times to get familiar to RDO this weekends,  I think I misunderstand it before, the openstack-ceilometer-compute only exist on node where nova-compute services runs on, when I ask the last questions,I found no openstack-ceilometer-compute exist on controller, but it should be. I also found it is better to run openstack in virtualbox or vmware, just because it is more flexible and easy to backup the openstack environment, I discovery this after the physical machine error and may times of openstack deployment :(  , maybe you guys can give a simple advice in the RDO quickstart document.I just finish the new openstack
tonight, and I will test autoscaling tomorrow. BTW,thanks for your reply again.
Best Regards
Alienyyg

Comment 13 Alienyyg 2013-12-30 06:56:08 UTC

Eoghan is correct, "openstack-ceilometer-alarm-evaluator should wait for openstack-ceilometer-api service" is not the way to make autoscaling unavilible, some thing wrong with my openstack, I depoly the openstack again and it seems I can monitor the status of the guest machine, but I meet something new, I think I need to raise a new bug.
regards
alienyyg

Note You need to log in before you can comment on or make changes to this bug.