Bug 986378 - [RFE] Alarm threshold evaluation logic capable of wide scaling
[RFE] Alarm threshold evaluation logic capable of wide scaling
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer (Show other bugs)
4.0
Unspecified Unspecified
high Severity high
: Upstream M2
: 4.0
Assigned To: Eoghan Glynn
Kevin Whitney
https://blueprints.launchpad.net/ceil...
: FutureFeature, OtherQA
Depends On: 986381
Blocks: 973191 RHOS40RFE 1055813
  Show dependency treegraph
 
Reported: 2013-07-19 11:42 EDT by Eoghan Glynn
Modified: 2016-04-26 15:34 EDT (History)
8 users (show)

See Also:
Fixed In Version: openstack-ceilometer-2013.2-0.2.b1.el6ost
Doc Type: Enhancement
Doc Text:
Feature: Evaluation of alarms based on comparison of static thresholds against statistics aggregated by Ceilometer. Reason: This enhancement allows users to be notified when the performance of their cloud resources cross certain thresholds, and also allows automated workflows such as Heat autoscaling to be triggered. Result: New ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are provided.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-12-19 19:14:07 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 34468 None None None Never

  None (edit)
Description Eoghan Glynn 2013-07-19 11:42:01 EDT
We need logic to compare observed sample datapoints against alarm thresholds, and for this to be capable of being scaled either narrowly, when hosted by a singleton service, or very widely in order to trigger timely notifications on a very large population of alarms.

The threshold evaluator should encapsulate all the logic required to manage the dynamic state of a constrained set of alarms: polling for the required statistics over an appropriate time window, correcting for metric lag, handling sparse metrics, and initiating state transitions & notification when threshold crossing is detected.

Upstream blueprint: https://blueprints.launchpad.net/ceilometer/+spec/alarm-distributed-threshold-evaluation
Comment 2 Ami Jeain 2013-07-31 05:35:37 EDT
hi Eoghan,
will this be part of RHOS 4.0?
Currently it doesn't have the "blocks" havana tracker bug
Comment 5 Eoghan Glynn 2013-09-23 09:15:28 EDT
Merged upstream as an FFE for RC1, so was not in the packages based on havana-3 but will be in the packages rebuilt for Havana RC1:

  https://github.com/openstack/ceilometer/commit/ede2329e
Comment 7 Eoghan Glynn 2013-10-21 11:14:11 EDT
How To Test
===========

Similarly to https://bugzilla.redhat.com/986381

0. Install packstack allinone, and also on an additional compute node.

Ensure the compute agent is gathering metrics at a reasonable cadence (every 60s for example instead of every 10mins as per the default):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Ensure the ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are running on the controller node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService 
  export CEILO_ALARM_SVCS='evaluator notifier'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc restart; done


2. Ensure a second ceilometer-alarm-evaluator service is running on the compute node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService
  export CEILO_ALARM_SVCS='evaluator'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


3. Spin up an instance in the usual way:

  nova boot --image $IMAGE_ID --flavor 1 test_instance


4. Create multiple alarms with thresholds sufficiently low that they are guaranteed to go into alarm:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name high_cpu_alarm_${i} --description 'instance running hot'  \
     --meter-name cpu_util  --threshold 0.01 --comparison-operator gt  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done


5. Ensure that the alarms are partitioned over the multiple evaluators:

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'
  
On each host, expect approximately half the alarms to be evaluated, i.e.

  '... initiating evaluation cycle on 5 alarms'


6. Ensure all alarms have transitioned to the 'alarm' state:

  ceilometer alarm-list


7. Create some more alarms:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name low_cpu_alarm_${i} --description 'instance running cold'  \
     --meter-name cpu_util  --threshold 99.9 --comparison-operator le  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done

and also delete a few alarms:

  ceilometer delete-alarm -a $ALARM_ID

and ensure that the alarm allocation is still roughly even between the evaluation services: 

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'


8. Shutdown the partitioned ceilometer alarm service on each host:

    sudo service openstack-ceilometer-alarm-evaluator stop

then restart on the controller host *only* with the singleton evaluator:

  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.SingletonAlarmService 
  sudo service openstack-ceilometer-alarm-evaluator start


9. Reset all alarms to the 'ok' state and ensure that they flip back to 'alarm':

  for a in $(ceilometer alarm-list | grep _cpu_alarm_ | awk -F\| '{print $2}')
  do
    ceilometer alarm-update --state ok -a $a
  done
  
  sleep 60
  ceilometer alarm-list
Comment 11 Eoghan Glynn 2013-12-11 08:03:23 EST
Pending the fix for:

  https://bugzilla.redhat.com/1040404

testing this requires that a less constrained firewall rule is added for the ceilometer-api service:

  $ INDEX=$(sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n ceilometer-api | cut -f1 -d:)
  $ sudo iptables -I INPUT $INDEX -p tcp --dport 8777 -j ACCEPT
Comment 15 errata-xmlrpc 2013-12-19 19:14:07 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html

Note You need to log in before you can comment on or make changes to this bug.