Bug 986378

Summary: [RFE] Alarm threshold evaluation logic capable of wide scaling
Product: Red Hat OpenStack Reporter: Eoghan Glynn <eglynn>
Component: openstack-ceilometerAssignee: Eoghan Glynn <eglynn>
Status: CLOSED ERRATA QA Contact: Kevin Whitney <kwhitney>
Severity: high Docs Contact:
Priority: high    
Version: 4.0CC: ajeain, eglynn, jruzicka, mlopes, pbrady, sgordon, sradvan, srevivo
Target Milestone: Upstream M2Keywords: FutureFeature, OtherQA
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/ceilometer/+spec/alarm-distributed-threshold-evaluation
Whiteboard:
Fixed In Version: openstack-ceilometer-2013.2-0.2.b1.el6ost Doc Type: Enhancement
Doc Text:
Feature: Evaluation of alarms based on comparison of static thresholds against statistics aggregated by Ceilometer. Reason: This enhancement allows users to be notified when the performance of their cloud resources cross certain thresholds, and also allows automated workflows such as Heat autoscaling to be triggered. Result: New ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are provided.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-20 00:14:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 986381    
Bug Blocks: 973191, 975499, 1055813    

Description Eoghan Glynn 2013-07-19 15:42:01 UTC
We need logic to compare observed sample datapoints against alarm thresholds, and for this to be capable of being scaled either narrowly, when hosted by a singleton service, or very widely in order to trigger timely notifications on a very large population of alarms.

The threshold evaluator should encapsulate all the logic required to manage the dynamic state of a constrained set of alarms: polling for the required statistics over an appropriate time window, correcting for metric lag, handling sparse metrics, and initiating state transitions & notification when threshold crossing is detected.

Upstream blueprint: https://blueprints.launchpad.net/ceilometer/+spec/alarm-distributed-threshold-evaluation

Comment 2 Ami Jeain 2013-07-31 09:35:37 UTC
hi Eoghan,
will this be part of RHOS 4.0?
Currently it doesn't have the "blocks" havana tracker bug

Comment 5 Eoghan Glynn 2013-09-23 13:15:28 UTC
Merged upstream as an FFE for RC1, so was not in the packages based on havana-3 but will be in the packages rebuilt for Havana RC1:

  https://github.com/openstack/ceilometer/commit/ede2329e

Comment 7 Eoghan Glynn 2013-10-21 15:14:11 UTC
How To Test
===========

Similarly to https://bugzilla.redhat.com/986381

0. Install packstack allinone, and also on an additional compute node.

Ensure the compute agent is gathering metrics at a reasonable cadence (every 60s for example instead of every 10mins as per the default):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Ensure the ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are running on the controller node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService 
  export CEILO_ALARM_SVCS='evaluator notifier'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc restart; done


2. Ensure a second ceilometer-alarm-evaluator service is running on the compute node:

  sudo yum install -y openstack-ceilometer-alarm
  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService
  export CEILO_ALARM_SVCS='evaluator'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


3. Spin up an instance in the usual way:

  nova boot --image $IMAGE_ID --flavor 1 test_instance


4. Create multiple alarms with thresholds sufficiently low that they are guaranteed to go into alarm:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name high_cpu_alarm_${i} --description 'instance running hot'  \
     --meter-name cpu_util  --threshold 0.01 --comparison-operator gt  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done


5. Ensure that the alarms are partitioned over the multiple evaluators:

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'
  
On each host, expect approximately half the alarms to be evaluated, i.e.

  '... initiating evaluation cycle on 5 alarms'


6. Ensure all alarms have transitioned to the 'alarm' state:

  ceilometer alarm-list


7. Create some more alarms:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name low_cpu_alarm_${i} --description 'instance running cold'  \
     --meter-name cpu_util  --threshold 99.9 --comparison-operator le  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done

and also delete a few alarms:

  ceilometer delete-alarm -a $ALARM_ID

and ensure that the alarm allocation is still roughly even between the evaluation services: 

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'


8. Shutdown the partitioned ceilometer alarm service on each host:

    sudo service openstack-ceilometer-alarm-evaluator stop

then restart on the controller host *only* with the singleton evaluator:

  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.SingletonAlarmService 
  sudo service openstack-ceilometer-alarm-evaluator start


9. Reset all alarms to the 'ok' state and ensure that they flip back to 'alarm':

  for a in $(ceilometer alarm-list | grep _cpu_alarm_ | awk -F\| '{print $2}')
  do
    ceilometer alarm-update --state ok -a $a
  done
  
  sleep 60
  ceilometer alarm-list

Comment 11 Eoghan Glynn 2013-12-11 13:03:23 UTC
Pending the fix for:

  https://bugzilla.redhat.com/1040404

testing this requires that a less constrained firewall rule is added for the ceilometer-api service:

  $ INDEX=$(sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n ceilometer-api | cut -f1 -d:)
  $ sudo iptables -I INPUT $INDEX -p tcp --dport 8777 -j ACCEPT

Comment 15 errata-xmlrpc 2013-12-20 00:14:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html