Description of problem: I deploy oepnstack with RDO,then I wanna to creat a tempalte to do autoscaling, but after I stress the vm ,nothing happens ,and then I restart all the ceilometer service ,I got ERROR ceilometer.alarm.service in alarm-evaluator.log Version-Release number of selected component (if applicable): openstack-ceilometer-common-2013.2-1.el6.noarch openstack-ceilometer-collector-2013.2-1.el6.noarch python-ceilometerclient-1.0.6-1.el6.noarch python-ceilometer-2013.2-1.el6.noarch openstack-ceilometer-central-2013.2-1.el6.noarch openstack-ceilometer-api-2013.2-1.el6.noarch openstack-ceilometer-alarm-2013.2-1.el6.noarch How reproducible: 100% Steps to Reproduce: 1.create a simple autoscaling tempalte and run it in heat(below) 2.stress the vm 3.restart all ceilometer service Actual results: nothing happen Expected results: the heat should create a new instance Additional info: the log information in alarm-evaluator.log is : 2013-12-12 07:53:28.361 31932 ERROR ceilometer.alarm.service [-] alarm evaluation cycle failed 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service Traceback (most recent call last): 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometer/alarm/service.py", line 95, in _evaluate_assigned_alarms 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service alarms = self._assigned_alarms() 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometer/alarm/service.py", line 138, in _assigned_alarms 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service 'value': True}]) 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometerclient/v2/alarms.py", line 70, in list 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service return self._list(options.build_url(self._path(), q)) 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometerclient/common/base.py", line 57, in _list 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service resp, body = self.api.json_request('GET', url) 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometerclient/common/http.py", line 186, in json_request 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service resp, body_iter = self._http_request(url, method, **kwargs) 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service File "/usr/lib/python2.6/site-packages/ceilometerclient/common/http.py", line 155, in _http_request 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service raise exc.CommunicationError(message=message) 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service CommunicationError: Error communicating with http://7.7.7.7:8777 [Errno 111] ECONNREFUSED 2013-12-12 07:53:28.361 31932 TRACE ceilometer.alarm.service my template is at : http://paste.openstack.org/show/54865/ these a same bug but only fixed in devstack: https://bugs.launchpad.net/ceilometer/+bug/1243249
I have change the interval from 300 to 30 in /etc/ceilometer/pipeline.yaml
So this is a general issue with service startup interdependencies. Something similar was handled for keystone with systemd notifiers, but on el6 we used an internal wait_until_keystone_available() routine in the keystone init script. Eoghan what would be the best way to determine the ceilometer API was available. some celiometer (client) command, or a more generic curl call (which would have awkwardness with parsing and configurable port numbers).
If it's simply a matter of the alarm-evaluator hitting the API service before it's fully ready, then I'd have expected only the first evaluation cycle or so to fail. But then for alarm evaluation to subsequently get back on track once the API service is fully up, i.e. the problem would be purely transient, maybe impacting the first one or two evaluation cycles at most (which occur by default at 60s intervals). Whereas the fact the autoscaling actions *never* occur appears to suggest something else. Two possibilities come to mind: 1. The problem with alarm evaluation is actually resolved quickly, but the lack of repeat_actions in your template results in Heat autoscaling not receiving the continuous notifications it needs in order to implement the cooldown period configured. To check if this is the case: $ function get_alarm_id { ceilometer alarm-list | awk -F\| "/$1/ {print \$2}"; } $ ceilometer alarm-show -a $(get_alarm_id MEMAlarmHigh) | grep -E 'state|repeat' if the state has indeed transitioned to alarm, then the problem is the lack of repeat_actions. To resolve: $ for a in $(get_alarm_id MEMAlarmHigh) $(get_alarm_id MEMAlarmLow); do ceilometer alarm-update -a $a --repeat-actions True ; done Then in the future make sure you add the line: repeat_actions: True to every OS::Metering::Alarm resource in your template. 2. Are you running the ceilometer-alarm-evaluator service on a different host to the ceilometer-api service? If so, you've possibly hit this issue (now fixed) with the iptables rule installed by packstack for the ceilometer-api: https://bugzilla.redhat.com/1040404 or this similar issue (not yet fixed): https://bugzilla.redhat.com/1041560 with the iptables rules for keystone? To check your iptables rules, on the controller host run: $ sudo iptables -L | grep -E 'ceilometer|keystone' looking for the fourth field (source) to be anywhere if the rule is sufficiently open, or the controller IP address if not. To workaround these issues: $ function get_iptables_index { sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n $1 | cut -f1 -d:; } # if the ceilometer rule is not open $ sudo iptables -I INPUT $(get_iptables_index ceilometer) -p tcp --dport 8777 -j ACCEPT # if the keystone rule is not open $ sudo iptables -I INPUT $(get_iptables_index keystone) -p tcp --dport 35357 -j ACCEPT $ sudo service iptables save
(In reply to Eoghan Glynn from comment #3) > If it's simply a matter of the alarm-evaluator hitting the API service > before it's fully ready, then I'd have expected only the first evaluation > cycle or so to fail. > > But then for alarm evaluation to subsequently get back on track once the API > service is fully up, i.e. the problem would be purely transient, maybe > impacting the first one or two evaluation cycles at most (which occur by > default at 60s intervals). > > Whereas the fact the autoscaling actions *never* occur appears to suggest > something else. > > Two possibilities come to mind: > > 1. The problem with alarm evaluation is actually resolved quickly, but the > lack of repeat_actions in your template results in Heat autoscaling not > receiving the continuous notifications it needs in order to implement the > cooldown period configured. > > To check if this is the case: > > $ function get_alarm_id { ceilometer alarm-list | awk -F\| "/$1/ {print > \$2}"; } > $ ceilometer alarm-show -a $(get_alarm_id MEMAlarmHigh) | grep -E > 'state|repeat' > > if the state has indeed transitioned to alarm, then the problem is the lack > of repeat_actions. To resolve: > > $ for a in $(get_alarm_id MEMAlarmHigh) $(get_alarm_id MEMAlarmLow); do > ceilometer alarm-update -a $a --repeat-actions True ; done > > Then in the future make sure you add the line: > > repeat_actions: True > > to every OS::Metering::Alarm resource in your template. > > > 2. Are you running the ceilometer-alarm-evaluator service on a different > host to the ceilometer-api service? > > If so, you've possibly hit this issue (now fixed) with the iptables rule > installed by packstack for the ceilometer-api: > > https://bugzilla.redhat.com/1040404 > > or this similar issue (not yet fixed): > > https://bugzilla.redhat.com/1041560 > > with the iptables rules for keystone? > > To check your iptables rules, on the controller host run: > > $ sudo iptables -L | grep -E 'ceilometer|keystone' > > looking for the fourth field (source) to be anywhere if the rule is > sufficiently open, or the controller IP address if not. > > To workaround these issues: > > $ function get_iptables_index { sudo iptables -L | grep -A 20 > 'INPUT.*policy ACCEPT' | grep -- -- | grep -n $1 | cut -f1 -d:; } > > # if the ceilometer rule is not open > $ sudo iptables -I INPUT $(get_iptables_index ceilometer) -p tcp --dport > 8777 -j ACCEPT > > # if the keystone rule is not open > $ sudo iptables -I INPUT $(get_iptables_index keystone) -p tcp --dport > 35357 -j ACCEPT > > $ sudo service iptables save Hi,Eoghan Glynn , thanks for your reply, I run the ceilometer-alarm-evaluator service and the ceilometer-api service on the same node, and the state of the alarm is "insufficient data", this issue block me for more than one week, and after I run the template ,I didn't find any error in log file(the debug is set to "True") ,can you help me out of this question?
Hi Alienyyg, Is the state of the alarm *continually* in insufficient_data, or does it flip from transiently being in the alarm state then back into insufficient_data? If the former, the issue is likely to be: https://bugzilla.redhat.com/1032070 You can work-around with: $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host $ sudo openstack-ceilometer-compute restart If the latter, the issue is likely a mismatch between the cadence of the compute agent polling interval (default: 600s) and the alarm period (60s). You can work-around with: $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml $ sudo service openstack-ceilometer-compute restart However, I noticed something else wrong with your template - I had been thrown by the description: Scale-up if the average CPU > 50% for 1 minute which implies that the alarm is based on CPU util. However I see now that the actual meter name is: meter_name: memory is that intentional? Note that this meter's value is the *total* amount of allocated memory, which will be static for an instance group in the absence of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal) which would be more variable. The threshold and description imply what really meant here was: meter_name: cpu_util i.e. to scale up based on CPU utilization %.
(In reply to Eoghan Glynn from comment #5) > Hi Alienyyg, > > Is the state of the alarm *continually* in insufficient_data, or does it > flip from transiently being in the alarm state then back into > insufficient_data? > > If the former, the issue is likely to be: > > https://bugzilla.redhat.com/1032070 > > You can work-around with: > > $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host > $ sudo openstack-ceilometer-compute restart > > If the latter, the issue is likely a mismatch between the cadence of the > compute agent polling interval (default: 600s) and the alarm period (60s). > > You can work-around with: > > $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval: > 60/ }' /etc/ceilometer/pipeline.yaml > $ sudo service openstack-ceilometer-compute restart > > However, I noticed something else wrong with your template - I had been > thrown by the description: > > Scale-up if the average CPU > 50% for 1 minute > > which implies that the alarm is based on CPU util. However I see now that > the actual meter name is: > > meter_name: memory > > is that intentional? Note that this meter's value is the *total* amount of > allocated memory, which will be static for an instance group in the absence > of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal) > which would be more variable. > > The threshold and description imply what really meant here was: > > meter_name: cpu_util > > i.e. to scale up based on CPU utilization %. Hi Eoghan: Sorry for the mistake about "meter_name: memory", I wanna to do autoscaling based on memory utilization originally,but now I have changed to "meter_name: cpu_util" the status of each alarm is always "insufficient data", I have changed the interval to 60 in "/etc/ceilometer/pipeline.yaml" the DEFAULT.host is not set in both nova.conf and ceilometer.conf, and I am running a mutinode openstack,one controller and 2 compute node ,all service but nova-compute runs on controller, the 2 compute node are responsible for nova-compute, so I can not get any response after nova list --all-tenants --host $(python -c "import socket ; print socket.gethostname()") but I think it is not the key problem. after I deploy openstack, I did not get any ceilometer-compute service, all I get is openstack-ceilometer-common-2013.2-1.el6.noarch openstack-ceilometer-collector-2013.2-1.el6.noarch python-ceilometerclient-1.0.6-1.el6.noarch python-ceilometer-2013.2-1.el6.noarch openstack-ceilometer-central-2013.2-1.el6.noarch openstack-ceilometer-api-2013.2-1.el6.noarch openstack-ceilometer-alarm-2013.2-1.el6.noarch then I followed https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install to install ceilometer service and after all, I can't get any response with: ceilometer statistics -m cpu_util -q "resource_id=$INSTANCE_ID" and a error was found in collecor.log(every I lunch the stack ,it appears): 2013-12-17 11:59:13.261 2504 ERROR ceilometer.collector.dispatcher.database [req-e197639d-7dab-40fb-8f21-37fd434737a1 None None] Failed to record metering data: not okForStor age 68 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database Traceback (most recent call last): 69 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database File "/usr/lib/python2.6/site-packages/ceilometer/collector/dispatcher/database.py", line 65, in record_metering_data 70 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database self.storage_conn.record_metering_data(meter) 71 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database File "/usr/lib/python2.6/site-packages/ceilometer/storage/impl_mongodb.py", line 416, in record_ metering_data 72 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database upsert=True, 73 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database File "/usr/lib64/python2.6/site-packages/pymongo/collection.py", line 479, in update 74 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database check_keys, self.__uuid_subtype), safe) 75 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database File "/usr/lib64/python2.6/site-packages/pymongo/mongo_client.py", line 920, in _send_message 76 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database rv = self.__check_response_to_last_error(response) 77 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database File "/usr/lib64/python2.6/site-packages/pymongo/mongo_client.py", line 863, in __check_response _to_last_error 78 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database raise OperationFailure(details["err"], details["code"]) 79 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database OperationFailure: not okForStorage 80 2013-12-17 11:59:13.261 2504 TRACE ceilometer.collector.dispatcher.database maybe this is the key to solve this issue? Best Regards Alienyyg
(In reply to Eoghan Glynn from comment #5) > Hi Alienyyg, > > Is the state of the alarm *continually* in insufficient_data, or does it > flip from transiently being in the alarm state then back into > insufficient_data? > > If the former, the issue is likely to be: > > https://bugzilla.redhat.com/1032070 > > You can work-around with: > > $ sudo openstack-config --del /etc/ceilometer/ceilometer.conf DEFAULT host > $ sudo openstack-ceilometer-compute restart > > If the latter, the issue is likely a mismatch between the cadence of the > compute agent polling interval (default: 600s) and the alarm period (60s). > > You can work-around with: > > $ sudo sed -i '/^ *name: .*_pipeline$/ { n ; s/interval: 600$/interval: > 60/ }' /etc/ceilometer/pipeline.yaml > $ sudo service openstack-ceilometer-compute restart > > However, I noticed something else wrong with your template - I had been > thrown by the description: > > Scale-up if the average CPU > 50% for 1 minute > > which implies that the alarm is based on CPU util. However I see now that > the actual meter name is: > > meter_name: memory > > is that intentional? Note that this meter's value is the *total* amount of > allocated memory, which will be static for an instance group in the absence > of scaling actions, not the memory utilization (i.e. 100%*MemFree/MemTotal) > which would be more variable. > > The threshold and description imply what really meant here was: > > meter_name: cpu_util > > i.e. to scale up based on CPU utilization %. and I can get the memeory of each instance via ceilometer statistics -m memory -q "resource_id=&ID"
Hi Alienyyg, Can you paste the latest version of your Heat template? Also we need to get to the bottom of why the mongodb storage driver is rejecting the attempt to store the sample data. Can you ensure that debug logging is enabled: $ sudo openstack-config --SET /etc/ceilometer/ceilometer.conf DEFAULT debug True $ sudo service openstack-ceilometer-collector restart Then attach a fresh excerpt from the collector log containing the 'not okForStorage' failure. One known case where this may occur is where some user metadata for an instance contains an embedded period ('.') in the user metadata key. Can you check whether this is the case with: $ for i in $(nova list | grep -vE '(\+--|Task State)' | awk '{print $2}'); do nova show $i | grep meta; done I suspect what's happening in this case is that the 'metering.' prefix used by Heat is causing those mongodb failures, can you check which event types cause the problem with something like: $ for r in $(grep 'not okForStorage' /var/log/ceilometer/collector.log | grep req- | sed 's/^.*req-/req-/' | sed 's/ None .*$//'); do grep $r /var/log/ceilometer/collector.log | grep received | sed 's/^.*event_type.: u.//' | sed 's/. .*$//'; done Can you run this and get back to me with the problematic event types? I don't actually think the 'not okForStorage' is the cause of the cpu_util not showing up in the metering store, as the ceilometer compute agent handles user metadata differently. (See https://bugs.launchpad.net/ceilometer/+bug/1262190 for some of the metadata handling inconsistencies I discovered when looking into your issue)
One other point, in terms of getting-started docco, you'd be much better off looking at: http://openstack.redhat.com/CeilometerQuickStart as opposed to: https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install as the latter is quite old at this stage.
(In reply to Eoghan Glynn from comment #9) > One other point, in terms of getting-started docco, you'd be much better off > looking at: > > http://openstack.redhat.com/CeilometerQuickStart > > as opposed to: > > https://fedoraproject.org/wiki/QA:Testcase_OpenStack_ceilometer_install > > as the latter is quite old at this stage. Hi Eoghan : I am sorry but my openstack crashed yesterday due to the disk error on my controller(some remapping errors, and I cannot start my controller), so I deploy openstack once again with rdo, and I found that after the deployment finished , no ceilometer-compute service is found on controller the deployment is the same as I used,(with gre network , one controller and two compute node), I think this is the key,because when I try allinone deploment, erverything works well, might be some bug with rdo. my openstack crashed before I got your message, So, I'am not able to get the error logs again :( BTW,thans for your link about ceilometer Best Regards Alienyyg
> no ceilometer-compute service is found on controller Did you mean that compute agent is running is controller, or it's not even installed on the controller? $ sudo service openstack-ceilometer-compute status $ sudo rpm -qa | grep ceilometer
(In reply to Eoghan Glynn from comment #11) > > no ceilometer-compute service is found on controller > > Did you mean that compute agent is running is controller, or it's not even > installed on the controller? > > $ sudo service openstack-ceilometer-compute status > $ sudo rpm -qa | grep ceilometer Hi Eoghan: really sorry for the latency, I just deploy openstack many times to get familiar to RDO this weekends, I think I misunderstand it before, the openstack-ceilometer-compute only exist on node where nova-compute services runs on, when I ask the last questions,I found no openstack-ceilometer-compute exist on controller, but it should be. I also found it is better to run openstack in virtualbox or vmware, just because it is more flexible and easy to backup the openstack environment, I discovery this after the physical machine error and may times of openstack deployment :( , maybe you guys can give a simple advice in the RDO quickstart document.I just finish the new openstack tonight, and I will test autoscaling tomorrow. BTW,thanks for your reply again. Best Regards Alienyyg
Eoghan is correct, "openstack-ceilometer-alarm-evaluator should wait for openstack-ceilometer-api service" is not the way to make autoscaling unavilible, some thing wrong with my openstack, I depoly the openstack again and it seems I can monitor the status of the guest machine, but I meet something new, I think I need to raise a new bug. regards alienyyg