986381 – [RFE] Alarm partitioning over multiple threshold evaluators

Bug 986381 - [RFE] Alarm partitioning over multiple threshold evaluators

Summary: [RFE] Alarm partitioning over multiple threshold evaluators

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	4.0
Assignee:	Eoghan Glynn
QA Contact:	Kevin Whitney
Docs Contact:
URL:	https://blueprints.launchpad.net/ceil...
Whiteboard:
Depends On:
Blocks:	973191 RHOS40RFE 986378 1055813
TreeView+	depends on / blocked

Reported:	2013-07-19 15:58 UTC by Eoghan Glynn
Modified:	2016-04-26 18:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-ceilometer-2013.2-0.12.rc2.el6ost
Doc Type:	Enhancement
Doc Text:	Feature: Partitioning of alarm evaluation over a horizontally scaled out dynamic pool of workers. Reason: This enhancement allows the evaluation workload to scale up to encompass many alarms, and also avoids a singleton evaluator becoming a single point of failure. Result: The alarm.evaluation_service configuration option may be set to ceilometer.alarm.service.PartitionedAlarmService, in which case multiple ceilometer-alarm-evaluator service instances can be started up on different hosts. These replicas will self-organize and divide the evaluation workload among themselves via a group co-ordination protocol based on fanout RPC.
Clone Of:
Environment:
Last Closed:	2013-12-20 00:14:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2013:1859	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory	2013-12-21 00:01:48 UTC

Description Eoghan Glynn 2013-07-19 15:58:06 UTC

We need a mechanism to split and balance the alarm threshold evaluation workload among workers.

This should allow the pool of workers to be dynamically resized, as the set of alarms to be evaluated grows or shrinks, with periodic re-balancing to account for obsoleted alarms.

Upstream blueprint: https://blueprints.launchpad.net/ceilometer/+spec/alarm-service-partitioner

Comment 3 Eoghan Glynn 2013-09-23 13:25:55 UTC

Merged upstream as an FFE for RC1, so was not in the packages based on havana-3 but will be in the packages rebuilt for Havana RC1:

  https://github.com/openstack/ceilometer/commit/ede2329e

(Note that the above logic also enables the widely scaled threshold evaluation required in BZ 986378).

Comment 7 Eoghan Glynn 2013-10-02 11:33:15 UTC

I've been waiting for RC1 to be cut upstream, and then follow-on changes to the re-built openstack-ceilometer-* packages, before writing up a test approach in the BZ.

The reason upstream RC1 is a blocker is that an integral part of the mechanism to be tested here landed as an FFE post Havana-3, so is not present in our RPMs as things stand.

Now the upstream RC1 was due to be cut late last week, but was delayed to today (Oct 2nd) due to a couple of lagard bug fixes and problems in the tempest gate adding hugely to the gerrit turnaround time.

However everything is new landed as of late yesterday, and the release candidate will be cut shortly. I have the packaging folks on notice as to the changes that'll be required, so we should have rebuilt RPMs by end of week, at which point testing can commence.

Further information to follow once the new packages are available.

Comment 8 Pádraig Brady 2013-10-03 22:57:52 UTC

New puddle contains the required version of ceilometer:
http://download.lab.bos.redhat.com/rel-eng/OpenStack/4.0/2013-10-03.3

Comment 9 Eoghan Glynn 2013-10-21 14:55:23 UTC

How To Test
===========

0. Install packstack allinone, and also on an additional compute node.

Ensure the compute agent is gathering metrics at a reasonable cadence (every 60s for example instead of every 10mins as per the default):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Ensure the ceilometer-alarm-evaluator and ceilometer-alarm-notifier services are running on the controller node:

  sudo yum install -y openstack-ceilometer-alarm
  export CEILO_ALARM_SVCS='evaluator notifier'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


2. Ensure a second ceilometer-alarm-evaluator service is running on the compute node:

  sudo yum install -y openstack-ceilometer-alarm
  export CEILO_ALARM_SVCS='evaluator'
  for svc in $CEILO_ALARM_SVCS; do sudo service openstack-ceilometer-alarm-$svc start; done


3. Spin up an instance in the usual way:

  nova boot --image $IMAGE_ID --flavor 1 test_instance


4. Create multiple alarms with thresholds sufficiently low that they are guaranteed to go into alarm:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name high_cpu_alarm_${i} --description 'instance running hot'  \
     --meter-name cpu_util  --threshold 0.01 --comparison-operator gt  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done


5. Ensure that the alarms are partitioned over the multiple evaluators:

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'
  
On each host, expect approximately half the alarms to be evaluated, i.e.

  '... initiating evaluation cycle on 5 alarms'


6. Ensure all alarms have transitioned to the 'alarm' state:

  ceilometer alarm-list


7. Create some more alarms:

  for i in $(seq 10)
  do
    ceilometer alarm-threshold-create --name low_cpu_alarm_${i} --description 'instance running cold'  \
     --meter-name cpu_util  --threshold 99.9 --comparison-operator le  --statistic avg \
     --period 60 --evaluation-periods 1 \
     --alarm-action 'log://' \
     --query resource_id=$INSTANCE_ID
  done

and also delete a few alarms:

  ceilometer delete-alarm -a $ALARM_ID

and ensure that the alarm allocation is still roughly even between the evaluation services: 

  tail -f /var/log/alarm-evaluator.log | grep 'initiating evaluation cycle'

Comment 10 Eoghan Glynn 2013-10-21 15:16:21 UTC

Addition to steps #1 & #2 above:

*Before* restarting the ceilometer-alarm-evaluator service, ensure that the partitioned evaluation service is configured:

  sudo openstack-config --set /etc/ceilometer/ceilometer.conf alarm evaluation_service ceilometer.alarm.service.PartitionedAlarmService

Comment 14 Eoghan Glynn 2013-12-11 13:04:54 UTC

Pending the fix for:

  https://bugzilla.redhat.com/1040404

testing this requires that a less constrained firewall rule is added for the ceilometer-api service:

  $ INDEX=$(sudo iptables -L | grep -A 20 'INPUT.*policy ACCEPT' | grep -- -- | grep -n ceilometer-api | cut -f1 -d:)
  $ sudo iptables -I INPUT $INDEX -p tcp --dport 8777 -j ACCEPT

Comment 16 Eoghan Glynn 2013-12-12 13:48:36 UTC

This bug can now transition to VERIFIED as the iptables rule workaround is no longer required since openstack-packstack-2013.2.1-0.18.dev934.el6ost was built.

Comment 18 errata-xmlrpc 2013-12-20 00:14:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html

Note You need to log in before you can comment on or make changes to this bug.