Bug 1430753 - Heat not able to maintain loadbalancer minimum member count
Summary: Heat not able to maintain loadbalancer minimum member count
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 12.0 (Pike)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-09 13:56 UTC by VIKRANT
Modified: 2020-07-16 09:17 UTC (History)
10 users (show)

Fixed In Version: openstack-heat-9.0.0-0.20170728194225.cc4fdce.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 21:13:42 UTC
Target Upstream Version:
Embargoed:
tvignaud: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 422983 0 None None None 2017-08-04 14:11:04 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description VIKRANT 2017-03-09 13:56:06 UTC
Description of problem:

Heat not able to maintain loadbalancer minimum member count.

Customer is using two templates, one without load-balancer configuratio nand another with load-balancer configuration.

When creating stack using *without* load-balancer template with minimum count of 2 nodes, when we are deleting an instance manually using "nova delete <instanceid>" command, heat is able to spin new instance to maintain the minimum instance count. However the same is not happening when we are creating stack *with* load-balancer template.

Here is the heartbeat definition used in templates:

From without load balancer template :

~~~
    heartbeat_alarm:
        type: OS::Ceilometer::Alarm
        properties:
          comparison_operator: lt
          evaluation_periods: '1'
          meter_name: instance
          period: '60'
          statistic: count
          threshold: {get_param: desired_capacity}
          alarm_actions:
                - {get_attr: [java_server_scaleup_policy, alarm_url]}
          matching_metadata: {'metadata.user_metadata.stack': {get_param: "OS::stack_id"}}
~~~

From with load balancer template:

~~~
  heartbeat_alarm:
        type: OS::Ceilometer::Alarm
        properties:
          comparison_operator: lt
          evaluation_periods: '2'
          meter_name: instance
          period: '60'
          statistic: count
          threshold: {get_param: desired_capacity}
          alarm_actions:
                - {get_attr: [web_server_scaleup_policy, alarm_url]}
          matching_metadata: {'metadata.user_metadata.stack': {get_param: "OS::stack_id"}}
~~~



Version-Release number of selected component (if applicable):
RHEL OSP 9

# awk '/heat/ {print $1}' installed-rpms 
openstack-heat-api-6.1.0-1.el7.noarch
openstack-heat-api-cfn-6.1.0-1.el7.noarch
openstack-heat-api-cloudwatch-6.1.0-1.el7.noarch
openstack-heat-common-6.1.0-1.el7.noarch
openstack-heat-engine-6.1.0-1.el7.noarch
python-heatclient-1.2.0-1.el7ost.noarch


How reproducible:
Everytime for Cu. 

Steps to Reproduce:
1. Create heat stack with load balancer configuration and add two members to it.
2. Delete one of the member. 
3. heat is not able to spawn a new instance to maintain the minimum count defined in template.

Actual results:
Heat is not spawning new instance to maintain the min. count.

Expected results:
Heat should spawning new instance to maintain the min. count.

Additional info:

Comment 4 Zane Bitter 2017-03-16 00:33:32 UTC
stack-show on the nested (autoscaling group) stack will tell you why it failed.

You can get the uuid of the nested stack by doing "openstack stack resource show WebServer-Stack WebServerASG", it's listed as the physical_resource_id.

Comment 6 Zane Bitter 2017-03-16 13:46:34 UTC
So the error is:

resources.tdhf2cznzqnd: StackValidationFailed: resources.member: Property error: member.Properties.address: Error validating value ''

So it looks like the scaling unit is a nested stack, and that it contains a resource named 'member' with a property called 'address', and the address is resolving to an empty string when it needs to be a valid IP address.

This could be a problem with the template, or it could be a bug in Heat. (At the validation stage, intrinsic functions like {get_attr: } don't return valid values, but Heat ought to cope with that gracefully.) Could you attach the lb_server.yaml template?

Comment 7 VIKRANT 2017-03-16 14:37:45 UTC
Here is the conten of lb_server.yaml template.

~~~
heat_template_version: 2013-05-23
description: A load-balancer server
parameters:
  image:
    type: string
    description: Image used for servers
  key_name:
    type: string
    description: SSH key to connect to the servers
  flavor:
    type: string
    description: flavor used by the servers
  pool_id:
    type: string
    description: Pool to contact
  user_data:
    type: string
    description: Server user_data
  metadata:
    type: json
  network:
    type: string
    description: Network used by the server

resources:
  server:
    type: OS::Nova::Server
    properties:
      flavor: {get_param: flavor}
      image: {get_param: image}
      key_name: {get_param: key_name}
      metadata: {get_param: metadata}
      user_data: {get_param: user_data}
      user_data_format: RAW
      networks: [{network: {get_param: network} }]
  member:
    type: OS::Neutron::PoolMember
    properties:
      pool_id: {get_param: pool_id}
      address: {get_attr: [server, first_address]}
      protocol_port: 80

outputs:
  server_ip:
    description: IP Address of the load-balanced server.
    value: { get_attr: [server, first_address] }
  lb_member:
    description: LB member details.
    value: { get_attr: [member, show] }
~~~

Comment 8 Zane Bitter 2017-03-16 16:39:33 UTC
OK, that looks like a Heat bug then... {get_attr: [server, first_address]} probably returns an empty string during validation, and the validation ought to be able to handle that but apparently it's complaining.

I wonder how it managed to create the autoscaling group in the first place without running into this issue for the initial members...

Comment 9 Zane Bitter 2017-03-16 19:02:05 UTC
I suspect it may be failing to get the IP addresses of the the _existing_ members. For resources that aren't created yet, we should always get None returned for their attribute values without even asking the resource. An empty string (which is returned by the resource itself) suggests that the resource was in the created state but getting the server's address failed for some reason. That also explains how the group could be created initially but updating it fails.

The first_address attribute is deprecated. You should probably replace that line with:

  address: {get_attr: [server, networks, {get_param: network}, 0]}

That might actually resolve the problem.

Comment 10 VIKRANT 2017-03-17 09:54:35 UTC
Thanks Zane. Suggested the same to Cu. Awaiting Cu. response.

Comment 11 VIKRANT 2017-03-20 10:00:07 UTC
Zane, as per the latest update from Cu. they are still hitting the issue after making suggested change.

Comment 12 Punit Kundal 2017-03-22 13:21:19 UTC
Hello Zane,

Here's the updated template received from the customer:

heat_template_version: 2013-05-23
description: A load-balancer instance
parameters:
  image:
    type: string
    description: Image used for instances
  key_name:
    type: string
    description: SSH key to connect to the instances
  flavor:
    type: string
    description: flavor used by the instances
  pool_id:
    type: string
    description: Pool to contact
  user_data:
    type: string
    description: Server user_data
  metadata:
    type: json
  network:
    type: string
    description: Network used by the instance
  #security_groups:
  #  type: string
  #  description: Webinstance Security group
   
resources:
  server:
    type: OS::Nova::Server
    properties:
      flavor: {get_param: flavor}
      image: {get_param: image}
      key_name: {get_param: key_name}
      metadata: {get_param: metadata}
      user_data: {get_param: user_data}
      #security_groups: webserverSG
      #security_groups: [{security_groups: {get_param: security_groups}}]
      user_data_format: RAW
      networks: [{network: {get_param: network} }]
  member:
    type: OS::Neutron::PoolMember
    properties:
      pool_id: {get_param: pool_id}
      address: {get_attr: [server, networks, {get_param: network}, 0]}
      #address: {get_attr: [instance, first_address]}
      protocol_port: 80

outputs:
 # instance_ip:
  #  description: IP Address of the load-balanced instance.
    #value: { get_attr: [instance, first_address] }
  lb_member:
    description: LB member details.
    value: { get_attr: [member, show] }

As mentioned by vikrant, the customer is still hitting the same issue.

Thanks.

Comment 13 Zane Bitter 2017-03-22 23:35:44 UTC
OK, after reading more carefully here, I see the cause of the problem. You're deleting a server from Nova manually, but Heat doesn't know that it's missing, so when it comes to validate the template the server is not found. This causes it to return a default value for the IP address (an empty string, as it happens), and that is being rejected as a valid IP address by the pool member.

I'm not sure why this would have worked in Liberty but not Mitaka. Possibly the validation became more robust in Mitaka.

One thing you can do is mark the server resource that you've deleted as FAILED using the 'resource mark-unhealthy' command. That should convince the autoscaling template generator to remove that resource from the template.

I'll continue investigating to see if there's a way we can avoid the error in this case.

Comment 14 Punit Kundal 2017-03-27 05:57:40 UTC
Hello Zane,

Many thanks for suggesting the resource mark-unhealthy command.

This suggestion has worked for the customer. After marking the deleting the resource as unhealthy, new instance was spawned automatically.

Could you please advice if there is a way in which this can be incorporated in the heat template itself ?

Thanks and Regards,
Punit

Comment 15 Zane Bitter 2017-05-02 21:25:53 UTC
This should get fixed in Pike by https://review.openstack.org/#/c/422983/ when it merges.

For current releases... we _could_ fix the {get_attr: [instance, first_address]} attribute by changing the default value that it returns to something that will pass the IP address constraint, i.e. '0.0.0.0' instead of ''. But this attribute is already deprecated.

There's no sane way to make {get_attr: [server, networks, {get_param: network}, 0]} not return None. I haven't seen the error message for this case, but I suspect it's failing in a different spot - it's a required property with a value of None (which reads as nothing specified) at a time when Heat is expecting to have resolved the real value. I can't think of anything we can do to resolve that.

Comment 20 Ronnie Rasouli 2017-12-05 12:19:24 UTC
Configured the following stack template:
heat_template_version: pike
resources:
  server:
    type: OS::Nova::Server
    properties:
      image:  cirros-0.3.5-x86_64-disk
      flavor: m1.nano
      networks:
      - network: heat-net
      - subnet: heat-subnet
  value:
    type: OS::Heat::Value
    properties:
      value: {get_attr: [server, first_address]} 
      type: string 

heat_template_version: pike
resources:
  asg:
    type: OS::Heat::AutoScalingGroup
    properties:
      resource:
        type: server.yaml
      min_size: 2
      desired_capacity: 3
      max_size: 5

  scale_up_policy:
    type: OS::Heat::ScalingPolicy
    properties:
      adjustment_type: change_in_capacity
      auto_scaling_group_id: {get_resource: asg}
      cooldown: 60
      scaling_adjustment: 1

  scale_dn_policy:
    type: OS::Heat::ScalingPolicy
    properties:
      adjustment_type: change_in_capacity
      auto_scaling_group_id: {get_resource: asg}
      cooldown: 60
      scaling_adjustment: '-1'

outputs:
  scale_up_url:
    value: {get_attr: [scale_up_policy, alarm_url]}
  scale_dn_url:
     value: {get_attr: [scale_dn_policy, alarm_url]}


Server has been deleted and by creting this stack 3 new servers has been scaled with the server stack attributes

Comment 23 errata-xmlrpc 2017-12-13 21:13:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.