Bug 1430753

Summary:	Heat not able to maintain loadbalancer minimum member count
Product:	Red Hat OpenStack	Reporter:	VIKRANT <vaggarwa>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED ERRATA	QA Contact:	Ronnie Rasouli <rrasouli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	9.0 (Mitaka)	CC:	aarapov, mburns, pkundal, rhel-osp-director-maint, sbaker, shardy, srevivo, therve, tvignaud, zbitter
Target Milestone:	rc	Keywords:	Triaged
Target Release:	12.0 (Pike)	Flags:	tvignaud: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-heat-9.0.0-0.20170728194225.cc4fdce.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-13 21:13:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description VIKRANT 2017-03-09 13:56:06 UTC

Description of problem:

Heat not able to maintain loadbalancer minimum member count.

Customer is using two templates, one without load-balancer configuratio nand another with load-balancer configuration.

When creating stack using *without* load-balancer template with minimum count of 2 nodes, when we are deleting an instance manually using "nova delete <instanceid>" command, heat is able to spin new instance to maintain the minimum instance count. However the same is not happening when we are creating stack *with* load-balancer template.

Here is the heartbeat definition used in templates:

From without load balancer template :

~~~
    heartbeat_alarm:
        type: OS::Ceilometer::Alarm
        properties:
          comparison_operator: lt
          evaluation_periods: '1'
          meter_name: instance
          period: '60'
          statistic: count
          threshold: {get_param: desired_capacity}
          alarm_actions:
                - {get_attr: [java_server_scaleup_policy, alarm_url]}
          matching_metadata: {'metadata.user_metadata.stack': {get_param: "OS::stack_id"}}
~~~

From with load balancer template:

~~~
  heartbeat_alarm:
        type: OS::Ceilometer::Alarm
        properties:
          comparison_operator: lt
          evaluation_periods: '2'
          meter_name: instance
          period: '60'
          statistic: count
          threshold: {get_param: desired_capacity}
          alarm_actions:
                - {get_attr: [web_server_scaleup_policy, alarm_url]}
          matching_metadata: {'metadata.user_metadata.stack': {get_param: "OS::stack_id"}}
~~~



Version-Release number of selected component (if applicable):
RHEL OSP 9

# awk '/heat/ {print $1}' installed-rpms 
openstack-heat-api-6.1.0-1.el7.noarch
openstack-heat-api-cfn-6.1.0-1.el7.noarch
openstack-heat-api-cloudwatch-6.1.0-1.el7.noarch
openstack-heat-common-6.1.0-1.el7.noarch
openstack-heat-engine-6.1.0-1.el7.noarch
python-heatclient-1.2.0-1.el7ost.noarch


How reproducible:
Everytime for Cu. 

Steps to Reproduce:
1. Create heat stack with load balancer configuration and add two members to it.
2. Delete one of the member. 
3. heat is not able to spawn a new instance to maintain the minimum count defined in template.

Actual results:
Heat is not spawning new instance to maintain the min. count.

Expected results:
Heat should spawning new instance to maintain the min. count.

Additional info:

Comment 4 Zane Bitter 2017-03-16 00:33:32 UTC

stack-show on the nested (autoscaling group) stack will tell you why it failed.

You can get the uuid of the nested stack by doing "openstack stack resource show WebServer-Stack WebServerASG", it's listed as the physical_resource_id.

Comment 6 Zane Bitter 2017-03-16 13:46:34 UTC

So the error is:

resources.tdhf2cznzqnd: StackValidationFailed: resources.member: Property error: member.Properties.address: Error validating value ''

So it looks like the scaling unit is a nested stack, and that it contains a resource named 'member' with a property called 'address', and the address is resolving to an empty string when it needs to be a valid IP address.

This could be a problem with the template, or it could be a bug in Heat. (At the validation stage, intrinsic functions like {get_attr: } don't return valid values, but Heat ought to cope with that gracefully.) Could you attach the lb_server.yaml template?

Comment 7 VIKRANT 2017-03-16 14:37:45 UTC

Here is the conten of lb_server.yaml template.

~~~
heat_template_version: 2013-05-23
description: A load-balancer server
parameters:
  image:
    type: string
    description: Image used for servers
  key_name:
    type: string
    description: SSH key to connect to the servers
  flavor:
    type: string
    description: flavor used by the servers
  pool_id:
    type: string
    description: Pool to contact
  user_data:
    type: string
    description: Server user_data
  metadata:
    type: json
  network:
    type: string
    description: Network used by the server

resources:
  server:
    type: OS::Nova::Server
    properties:
      flavor: {get_param: flavor}
      image: {get_param: image}
      key_name: {get_param: key_name}
      metadata: {get_param: metadata}
      user_data: {get_param: user_data}
      user_data_format: RAW
      networks: [{network: {get_param: network} }]
  member:
    type: OS::Neutron::PoolMember
    properties:
      pool_id: {get_param: pool_id}
      address: {get_attr: [server, first_address]}
      protocol_port: 80

outputs:
  server_ip:
    description: IP Address of the load-balanced server.
    value: { get_attr: [server, first_address] }
  lb_member:
    description: LB member details.
    value: { get_attr: [member, show] }
~~~

Comment 8 Zane Bitter 2017-03-16 16:39:33 UTC

OK, that looks like a Heat bug then... {get_attr: [server, first_address]} probably returns an empty string during validation, and the validation ought to be able to handle that but apparently it's complaining.

I wonder how it managed to create the autoscaling group in the first place without running into this issue for the initial members...

Comment 9 Zane Bitter 2017-03-16 19:02:05 UTC

I suspect it may be failing to get the IP addresses of the the _existing_ members. For resources that aren't created yet, we should always get None returned for their attribute values without even asking the resource. An empty string (which is returned by the resource itself) suggests that the resource was in the created state but getting the server's address failed for some reason. That also explains how the group could be created initially but updating it fails.

The first_address attribute is deprecated. You should probably replace that line with:

  address: {get_attr: [server, networks, {get_param: network}, 0]}

That might actually resolve the problem.

Comment 10 VIKRANT 2017-03-17 09:54:35 UTC

Thanks Zane. Suggested the same to Cu. Awaiting Cu. response.

Comment 11 VIKRANT 2017-03-20 10:00:07 UTC

Zane, as per the latest update from Cu. they are still hitting the issue after making suggested change.

Comment 12 Punit Kundal 2017-03-22 13:21:19 UTC

Hello Zane,

Here's the updated template received from the customer:

heat_template_version: 2013-05-23
description: A load-balancer instance
parameters:
  image:
    type: string
    description: Image used for instances
  key_name:
    type: string
    description: SSH key to connect to the instances
  flavor:
    type: string
    description: flavor used by the instances
  pool_id:
    type: string
    description: Pool to contact
  user_data:
    type: string
    description: Server user_data
  metadata:
    type: json
  network:
    type: string
    description: Network used by the instance
  #security_groups:
  #  type: string
  #  description: Webinstance Security group
   
resources:
  server:
    type: OS::Nova::Server
    properties:
      flavor: {get_param: flavor}
      image: {get_param: image}
      key_name: {get_param: key_name}
      metadata: {get_param: metadata}
      user_data: {get_param: user_data}
      #security_groups: webserverSG
      #security_groups: [{security_groups: {get_param: security_groups}}]
      user_data_format: RAW
      networks: [{network: {get_param: network} }]
  member:
    type: OS::Neutron::PoolMember
    properties:
      pool_id: {get_param: pool_id}
      address: {get_attr: [server, networks, {get_param: network}, 0]}
      #address: {get_attr: [instance, first_address]}
      protocol_port: 80

outputs:
 # instance_ip:
  #  description: IP Address of the load-balanced instance.
    #value: { get_attr: [instance, first_address] }
  lb_member:
    description: LB member details.
    value: { get_attr: [member, show] }

As mentioned by vikrant, the customer is still hitting the same issue.

Thanks.

Comment 13 Zane Bitter 2017-03-22 23:35:44 UTC

OK, after reading more carefully here, I see the cause of the problem. You're deleting a server from Nova manually, but Heat doesn't know that it's missing, so when it comes to validate the template the server is not found. This causes it to return a default value for the IP address (an empty string, as it happens), and that is being rejected as a valid IP address by the pool member.

I'm not sure why this would have worked in Liberty but not Mitaka. Possibly the validation became more robust in Mitaka.

One thing you can do is mark the server resource that you've deleted as FAILED using the 'resource mark-unhealthy' command. That should convince the autoscaling template generator to remove that resource from the template.

I'll continue investigating to see if there's a way we can avoid the error in this case.

Comment 14 Punit Kundal 2017-03-27 05:57:40 UTC

Hello Zane,

Many thanks for suggesting the resource mark-unhealthy command.

This suggestion has worked for the customer. After marking the deleting the resource as unhealthy, new instance was spawned automatically.

Could you please advice if there is a way in which this can be incorporated in the heat template itself ?

Thanks and Regards,
Punit

Comment 15 Zane Bitter 2017-05-02 21:25:53 UTC

This should get fixed in Pike by https://review.openstack.org/#/c/422983/ when it merges.

For current releases... we _could_ fix the {get_attr: [instance, first_address]} attribute by changing the default value that it returns to something that will pass the IP address constraint, i.e. '0.0.0.0' instead of ''. But this attribute is already deprecated.

There's no sane way to make {get_attr: [server, networks, {get_param: network}, 0]} not return None. I haven't seen the error message for this case, but I suspect it's failing in a different spot - it's a required property with a value of None (which reads as nothing specified) at a time when Heat is expecting to have resolved the real value. I can't think of anything we can do to resolve that.

Comment 20 Ronnie Rasouli 2017-12-05 12:19:24 UTC

Configured the following stack template:
heat_template_version: pike
resources:
  server:
    type: OS::Nova::Server
    properties:
      image:  cirros-0.3.5-x86_64-disk
      flavor: m1.nano
      networks:
      - network: heat-net
      - subnet: heat-subnet
  value:
    type: OS::Heat::Value
    properties:
      value: {get_attr: [server, first_address]} 
      type: string 

heat_template_version: pike
resources:
  asg:
    type: OS::Heat::AutoScalingGroup
    properties:
      resource:
        type: server.yaml
      min_size: 2
      desired_capacity: 3
      max_size: 5

  scale_up_policy:
    type: OS::Heat::ScalingPolicy
    properties:
      adjustment_type: change_in_capacity
      auto_scaling_group_id: {get_resource: asg}
      cooldown: 60
      scaling_adjustment: 1

  scale_dn_policy:
    type: OS::Heat::ScalingPolicy
    properties:
      adjustment_type: change_in_capacity
      auto_scaling_group_id: {get_resource: asg}
      cooldown: 60
      scaling_adjustment: '-1'

outputs:
  scale_up_url:
    value: {get_attr: [scale_up_policy, alarm_url]}
  scale_dn_url:
     value: {get_attr: [scale_dn_policy, alarm_url]}


Server has been deleted and by creting this stack 3 new servers has been scaled with the server stack attributes

Comment 23 errata-xmlrpc 2017-12-13 21:13:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462