Bug 986381

Summary:	[RFE] Alarm partitioning over multiple threshold evaluators
Product:	Red Hat OpenStack	Reporter:	Eoghan Glynn <eglynn>
Component:	openstack-ceilometer	Assignee:	Eoghan Glynn <eglynn>
Status:	CLOSED ERRATA	QA Contact:	Kevin Whitney <kwhitney>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.0	CC:	ajeain, eglynn, jruzicka, mlopes, pbrady, sgordon, sradvan, srevivo
Target Milestone:	rc	Keywords:	FutureFeature, OtherQA
Target Release:	4.0
Hardware:	Unspecified
OS:	Unspecified
URL:	https://blueprints.launchpad.net/ceilometer/+spec/alarm-service-partitioner
Whiteboard:
Fixed In Version:	openstack-ceilometer-2013.2-0.12.rc2.el6ost	Doc Type:	Enhancement
Doc Text:	Feature: Partitioning of alarm evaluation over a horizontally scaled out dynamic pool of workers. Reason: This enhancement allows the evaluation workload to scale up to encompass many alarms, and also avoids a singleton evaluator becoming a single point of failure. Result: The alarm.evaluation_service configuration option may be set to ceilometer.alarm.service.PartitionedAlarmService, in which case multiple ceilometer-alarm-evaluator service instances can be started up on different hosts. These replicas will self-organize and divide the evaluation workload among themselves via a group co-ordination protocol based on fanout RPC.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-12-20 00:14:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	973191, 975499, 986378, 1055813

Description Eoghan Glynn 2013-07-19 15:58:06 UTC

We need a mechanism to split and balance the alarm threshold evaluation workload among workers.

This should allow the pool of workers to be dynamically resized, as the set of alarms to be evaluated grows or shrinks, with periodic re-balancing to account for obsoleted alarms.

Upstream blueprint: https://blueprints.launchpad.net/ceilometer/+spec/alarm-service-partitioner

Comment 3 Eoghan Glynn 2013-09-23 13:25:55 UTC

Merged upstream as an FFE for RC1, so was not in the packages based on havana-3 but will be in the packages rebuilt for Havana RC1:

  https://github.com/openstack/ceilometer/commit/ede2329e

(Note that the above logic also enables the widely scaled threshold evaluation required in BZ 986378).

Comment 7 Eoghan Glynn 2013-10-02 11:33:15 UTC

I've been waiting for RC1 to be cut upstream, and then follow-on changes to the re-built openstack-ceilometer-* packages, before writing up a test approach in the BZ.

The reason upstream RC1 is a blocker is that an integral part of the mechanism to be tested here landed as an FFE post Havana-3, so is not present in our RPMs as things stand.

Now the upstream RC1 was due to be cut late last week, but was delayed to today (Oct 2nd) due to a couple of lagard bug fixes and problems in the tempest gate adding hugely to the gerrit turnaround time.

However everything is new landed as of late yesterday, and the release candidate will be cut shortly. I have the packaging folks on notice as to the changes that'll be required, so we should have rebuilt RPMs by end of week, at which point testing can commence.

Further information to follow once the new packages are available.

Comment 8 Pádraig Brady 2013-10-03 22:57:52 UTC

New puddle contains the required version of ceilometer:
http://download.lab.bos.redhat.com/rel-eng/OpenStack/4.0/2013-10-03.3

Comment 9 Eoghan Glynn 2013-10-21 14:55:23 UTC

How To Test
===========

0. Install packstack allinone, and also on an additional compute node.

Ensure the compute agent is gathering metrics at a reasonable cadence (every 60s for example instead of every 10mins as per the default):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Ensure the ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are running on the controller node:

  sudo yum install -y openstack-ceilometer-alarm
  export CEILO_ALARM_SVCS='evaluator notifier'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


2. Ensure a second ceilometer-alarm-evaluator service is running on the compute node:

  sudo yum install -y openstack-ceilometer-alarm
  export CEILO_ALARM_SVCS='evaluator'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


3. Spin up an instance in the usual way:

  nova boot --image $IMAGE_ID --flavor 1 test_instance


4. Create multiple alarms with thresholds sufficiently low that they are guaranteed to go into alarm:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name high_cpu_alarm_${i} --description 'instance running hot'  \
     --meter-name cpu_util  --threshold 0.01 --comparison-operator gt  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done


5. Ensure that the alarms are partitioned over the multiple evaluators:

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'
  
On each host, expect approximately half the alarms to be evaluated, i.e.

  '... initiating evaluation cycle on 5 alarms'


6. Ensure all alarms have transitioned to the 'alarm' state:

  ceilometer alarm-list


7. Create some more alarms:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name low_cpu_alarm_${i} --description 'instance running cold'  \
     --meter-name cpu_util  --threshold 99.9 --comparison-operator le  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done

and also delete a few alarms:

  ceilometer delete-alarm -a $ALARM_ID

and ensure that the alarm allocation is still roughly even between the evaluation services: 

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'

Comment 10 Eoghan Glynn 2013-10-21 15:16:21 UTC

Addition to steps #1 & #2 above:

*Before* restarting the ceilometer-alarm-evaluator service, ensure that the partitioned evaluation service is configured:

  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService

Comment 14 Eoghan Glynn 2013-12-11 13:04:54 UTC

Pending the fix for:

  https://bugzilla.redhat.com/1040404

testing this requires that a less constrained firewall rule is added for the ceilometer-api service:

  $ INDEX=$(sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n ceilometer-api | cut -f1 -d:)
  $ sudo iptables -I INPUT $INDEX -p tcp --dport 8777 -j ACCEPT

Comment 16 Eoghan Glynn 2013-12-12 13:48:36 UTC

This bug can now transition to VERIFIED as the iptables rule workaround is no longer required since openstack-packstack-2013.2.1-0.18.dev934.el6ost was built.

Comment 18 errata-xmlrpc 2013-12-20 00:14:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html