986378 – [RFE] Alarm threshold evaluation logic capable of wide scaling

Bug 986378 - [RFE] Alarm threshold evaluation logic capable of wide scaling

Summary: [RFE] Alarm threshold evaluation logic capable of wide scaling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Upstream M2
Target Release:	4.0
Assignee:	Eoghan Glynn
QA Contact:	Kevin Whitney
Docs Contact:
URL:	https://blueprints.launchpad.net/ceil...
Whiteboard:
Depends On:	986381
Blocks:	973191 RHOS40RFE 1055813
TreeView+	depends on / blocked

Reported:	2013-07-19 15:42 UTC by Eoghan Glynn
Modified:	2016-04-26 19:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-ceilometer-2013.2-0.2.b1.el6ost
Doc Type:	Enhancement
Doc Text:	Feature: Evaluation of alarms based on comparison of static thresholds against statistics aggregated by Ceilometer. Reason: This enhancement allows users to be notified when the performance of their cloud resources cross certain thresholds, and also allows automated workflows such as Heat autoscaling to be triggered. Result: New ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are provided.
Clone Of:
Environment:
Last Closed:	2013-12-20 00:14:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	34468	0	None	None	None	Never
Red Hat Product Errata	RHEA-2013:1859	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory	2013-12-21 00:01:48 UTC

Description Eoghan Glynn 2013-07-19 15:42:01 UTC

We need logic to compare observed sample datapoints against alarm thresholds, and for this to be capable of being scaled either narrowly, when hosted by a singleton service, or very widely in order to trigger timely notifications on a very large population of alarms.

The threshold evaluator should encapsulate all the logic required to manage the dynamic state of a constrained set of alarms: polling for the required statistics over an appropriate time window, correcting for metric lag, handling sparse metrics, and initiating state transitions & notification when threshold crossing is detected.

Upstream blueprint: https://blueprints.launchpad.net/ceilometer/+spec/alarm-distributed-threshold-evaluation

Comment 2 Ami Jeain 2013-07-31 09:35:37 UTC

hi Eoghan,
will this be part of RHOS 4.0?
Currently it doesn't have the "blocks" havana tracker bug

Comment 5 Eoghan Glynn 2013-09-23 13:15:28 UTC

Merged upstream as an FFE for RC1, so was not in the packages based on havana-3 but will be in the packages rebuilt for Havana RC1:

  https://github.com/openstack/ceilometer/commit/ede2329e

Comment 7 Eoghan Glynn 2013-10-21 15:14:11 UTC

How To Test
===========

Similarly to https://bugzilla.redhat.com/986381

0. Install packstack allinone, and also on an additional compute node.

Ensure the compute agent is gathering metrics at a reasonable cadence (every 60s for example instead of every 10mins as per the default):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Ensure the ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are running on the controller node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService 
  export CEILO_ALARM_SVCS='evaluator notifier'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc restart; done


2. Ensure a second ceilometer-alarm-evaluator service is running on the compute node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService
  export CEILO_ALARM_SVCS='evaluator'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


3. Spin up an instance in the usual way:

  nova boot --image $IMAGE_ID --flavor 1 test_instance


4. Create multiple alarms with thresholds sufficiently low that they are guaranteed to go into alarm:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name high_cpu_alarm_${i} --description 'instance running hot'  \
     --meter-name cpu_util  --threshold 0.01 --comparison-operator gt  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done


5. Ensure that the alarms are partitioned over the multiple evaluators:

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'
  
On each host, expect approximately half the alarms to be evaluated, i.e.

  '... initiating evaluation cycle on 5 alarms'


6. Ensure all alarms have transitioned to the 'alarm' state:

  ceilometer alarm-list


7. Create some more alarms:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name low_cpu_alarm_${i} --description 'instance running cold'  \
     --meter-name cpu_util  --threshold 99.9 --comparison-operator le  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done

and also delete a few alarms:

  ceilometer delete-alarm -a $ALARM_ID

and ensure that the alarm allocation is still roughly even between the evaluation services: 

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'


8. Shutdown the partitioned ceilometer alarm service on each host:

    sudo service openstack-ceilometer-alarm-evaluator stop

then restart on the controller host *only* with the singleton evaluator:

  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.SingletonAlarmService 
  sudo service openstack-ceilometer-alarm-evaluator start


9. Reset all alarms to the 'ok' state and ensure that they flip back to 'alarm':

  for a in $(ceilometer alarm-list | grep _cpu_alarm_ | awk -F\| '{print $2}')
  do
    ceilometer alarm-update --state ok -a $a
  done
  
  sleep 60
  ceilometer alarm-list

Comment 11 Eoghan Glynn 2013-12-11 13:03:23 UTC

Pending the fix for:

  https://bugzilla.redhat.com/1040404

testing this requires that a less constrained firewall rule is added for the ceilometer-api service:

  $ INDEX=$(sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n ceilometer-api | cut -f1 -d:)
  $ sudo iptables -I INPUT $INDEX -p tcp --dport 8777 -j ACCEPT

Comment 15 errata-xmlrpc 2013-12-20 00:14:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html

Note You need to log in before you can comment on or make changes to this bug.