Bug 1034684

Summary: [heat] potential autoscaling headroom remains unused
Product: Red Hat OpenStack Reporter: Eoghan Glynn <eglynn>
Component: openstack-heatAssignee: Eoghan Glynn <eglynn>
Status: CLOSED ERRATA QA Contact: Kevin Whitney <kwhitney>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: breeler, ddomingo, eglynn, hateya, sbaker, sdake, shardy, srevivo, yeylon
Target Milestone: rcKeywords: OtherQA, Triaged
Target Release: 4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-engine-2013.2-2.0.el6ost Doc Type: Bug Fix
Doc Text:
The Orchestration engine used counterintuitive truncation logic when calculating the autoscaling of server group size changes. Specifically, autoscaling always only used the configured scaling increment, regardless of the configured maximum or minimum group size. This allowed certain scaling increment settings to prevent the autoscaling feature from actually hitting minimum or maximum group sizes. For example, with a scale-up setting of 2, the only possible autoscaling maximum would be 4 if the configured maximum group size if 5. With this release, the autoscaling feature now truncates scaling adjustments to upper/lower bounds in case of an overshoot. This allows the Orchestration engine to automatically scale to maximum and minimum group sizes, regardless of the configuring scaling increments.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-12-20 00:39:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Heat template that reproduces te issue. none

Description Eoghan Glynn 2013-11-26 10:51:00 UTC
Description of problem:

Potential headroom for an autoscaling group growth or shrinkage will remain unused if the adjustment doesn't *exactly* hit the max or min size respectively.

Take for example an instance group with:

 * MinSize=1
 * MaxSize=6
 * ScaleupPolicy ScalingAdjustment=2
 * ScaledownPolicy ScalingAdjustment=-3

When the under-scaled alarm fires, the group will grow in increments of 2 from 1->3->5 and then grow no further, even if the under-scaled alarm condition persists. So the max group size is never reached.

Then if the over-scaled alarm fires subsequently, the group will shrink in one decrement of 3 from 5->2 and then shrink no further, even if the over-scaled alarm condition persists. So the min group size is never resumed.

This may seem like an edge case, but is actually quite likely to be hit especially if the adjustment type is set to PercentChangeInCapacity, in which case it's non-trivial to choose min and max size such that a compounded application of the percentage delta always exactly lands on the upper and lower bounds.

More intuitive behavior would be to truncate the adjustment to the upper or lower bound in the case of an over-shoot.

Version-Release number of selected component (if applicable):

openstack-heat-engine-2013.2-1.0.el6ost.noarch


How reproducible:

100%


Steps to Reproduce:

0. Install openstack, including heat & ceilometer. Ensure that the ceilometer compute agent is measuring cpu_util at a reasonable frequency (every minute as opposed to the default 10 mins):

  sudo sed -i '/^ *name: cpu_pipeline$/ { n ; s/interval: 600$/interval: 60/ }' /etc/ceilometer/pipeline.yaml
  sudo service openstack-ceilometer-compute restart


1. Upload the cirros images if not already present in glance:

  sudo yum install -y wget
  wget http://launchpad.net/cirros/trunk/0.3.0/+download/cirros-0.3.0-x86_64-uec.tar.gz
  tar zxvf cirros-0.3.0-x86_64-uec.tar.gz 
 glance add name=cirros-aki is_public=true container_format=aki disk_format=aki < cirros-0.3.0-x86_64-vmlinuz 
  glance add name=cirros-ari is_public=true container_format=ari disk_format=ari < cirros-0.3.0-x86_64-initrd 
  glance add name=cirros-ami is_public=true container_format=ami disk_format=ami \
     "kernel_id=$(glance index | awk '/cirros-aki/ {print $1}')" \
     "ramdisk_id=$(glance index | awk '/cirros-ari/ {print $1}')" < cirros-0.3.0-x86_64-blank.img  


2. Add a UserKey if not already present in nova:

  nova keypair-add --pub_key ~/.ssh/id_rsa.pub userkey


3. Create stack with the attached template:

  heat stack-create test_stack --template-file=template.yaml --parameters="KeyName=userkey;InstanceType=m1.tiny;ImageId=$CIRROS_AMI_IMAGE"


4. Wait for the stack creation to complete:

  watch "heat stack-show test_stack | grep status"


5. Check that the high and low CPU alarms transition into the alarm and ok states respectively within a couple of minutes:

  watch "ceilometer alarm-list | grep test_stack"


6. Verify that the peak number of servers never goes beyond 5 (whereas the declared MaxSize is 6):

  watch "nova list | grep ServerGroup | wc -l" 



Actual results:

The autoscaling group will max out at 5 instances, regardless of how long the underscaled alarm persists for. 



Expected results:

The autoscaling group should max out at 6 instances.


Additional info:

This issue would also occur with native cloudwatch-style alarming, as opposed to ceilometer alarming.

Comment 1 Eoghan Glynn 2013-11-26 10:52:50 UTC
Created attachment 829191 [details]
Heat template that reproduces te issue.

Comment 2 Eoghan Glynn 2013-11-26 10:54:51 UTC
Fix proposed to master upstream:

  https://review.openstack.org/58343

Comment 4 Eoghan Glynn 2013-11-26 15:36:36 UTC
Fix landed on master upstream:

  http://github.com/openstack/heat/commit/2c25616e

Comment 5 Eoghan Glynn 2013-11-26 15:46:47 UTC
Fix proposed to stable/havana upstream:

  https://review.openstack.org/58552

Comment 6 Eoghan Glynn 2013-11-27 12:34:02 UTC
Fix landed on stable/havana upstream:

  https://github.com/openstack/heat/commit/a8c0b110

Comment 7 Eoghan Glynn 2013-11-27 12:43:21 UTC
Fix backported to internal rhos-4.0-rhel-6-patches branch:

  https://code.engineering.redhat.com/gerrit/16394

Comment 14 errata-xmlrpc 2013-12-20 00:39:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1859.html