Bug 1517066 - [RFE] Allow resuing blacklisted node indexes when a scale-out fails
Summary: [RFE] Allow resuing blacklisted node indexes when a scale-out fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 10.0 (Newton)
Hardware: All
OS: Linux
medium
medium
Target Milestone: Upstream M3
: 13.0 (Queens)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks: 1717932
TreeView+ depends on / blocked
 
Reported: 2017-11-24 06:42 UTC by Anand Nande
Modified: 2024-06-13 20:50 UTC (History)
12 users (show)

Fixed In Version: openstack-heat-10.0.1-0.20180314232330.c2a66b1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1717932 (view as bug list)
Environment:
Last Closed: 2018-06-27 13:39:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1741053 0 None None None 2018-01-04 13:51:25 UTC
OpenStack gerrit 530948 0 'None' MERGED Add removal_policies_mode to ResourceGroup 2021-02-01 04:31:11 UTC
Red Hat Issue Tracker OSP-7835 0 None None None 2022-03-13 14:59:06 UTC
Red Hat Knowledge Base (Solution) 2780661 0 None None Problem with node index when redeploying a compute node with predictable hostnames 2019-05-29 10:11:54 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:40:23 UTC

Description Anand Nande 2017-11-24 06:42:15 UTC
Description of problem:

Currently heat blacklists the node indices of the overcloud nodes which fail to become part of the overcloud stck and are later removed manually by the user. These node indices permanently remain in blacklisted state :

MariaDB [heat]> select resource_data.value,resource.name from resource,resource_data where resource_data.`key`="name_blacklist" and resource_data.resource_id=resource.id;
+-------+---------+
| value | name    |
+-------+---------+
| 5     | Compute |
+-------+---------+
1 row in set (0.00 sec)

There should be a way to allow heat to remove these indices from blacklist and allow it to assign it to same or different overcloud nodes during the next scale-out operation.

This is frustrating at times when there are large no.of.failed scaleout nodes
and the indices are sparsely assigned to the overcloud nodes seen in 'nova list':
1. it makes a novice user confusing as to why is the numbering not in a serial order.
2. When one compares the output of nova list against ironic node-list, one cant co-relate the node hostnames that are assigned - due to the blacklisted indices. 
3. There are some customers who want to maintain strict operational standards based on (2). 


Version-Release number of selected component (if applicable): 
# OSP-10 : openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
         : openstack-heat-engine-7.0.6-1.el7ost.noarch

Comment 3 Zane Bitter 2017-12-05 20:32:07 UTC
The way that blacklists work in Heat - where they're sticky, and nodes remain blacklisted even if they're removed from the blacklist - was designed by and for TripleO. The reason is that blacklisting a node is not generally something you want to do in your templates, it's something you decide on the fly. So if Heat didn't maintain the state of the blacklist outside of the property value then you'd need to maintain that state externally. (If you don't remember the blacklist then you end up deleting healthy members at higher indices to fill in the gaps at lower indices.) In any event, it's impossible to change this behaviour without breaking existing templates anyway.

This isn't a great design, and in fact ResourceGroups and predictable indices aren't great ideas in general IMHO. For that reason heat introduced the "resource mark unhealthy" command, which aims to achieve the same purpose out of band of the templates, which more closely matches how it's actually used. Unfortunately because of TripleO's heavy reliance on predictable indices being matched up with predefined data such as hostnames and IP addresses, it is unable to make use of mark-unhealthy because Heat creates the replacement resource before removing the unhealthy one, and conflicts ensue.

One thing that could be improved in Heat is that if the ResourceGroup is scaled down below the level of a blacklisted index, that index could become available again on the next scale up. So if e.g. you had members 0, 1 & 3 with 2 blacklisted and you scaled down to size 2 (deleting 3), then when you scaled up again the next member would have index 2 instead of 3 unless 2 was _currently_ blacklisted in the template. This would improve usability slightly and probably shouldn't break most existing users, although technically it would be a change in behaviour. I'm not sure how big a difference it would make on the TripleO use case, though... OpenStack clouds don't tend to get smaller very often.

Comment 4 Zane Bitter 2017-12-07 15:32:33 UTC
Another possibility is that we could add a whitelist property to remove stuff from the blacklist. Obviously support for using that would then have to be added to TripleO.

It sounds like the problem is being exacerbated by the fact that TripleO no longer has a scale-down command but always uses node-delete (which permanently blacklists an index) for scaling down.

Comment 11 Zane Bitter 2018-03-19 16:56:17 UTC
The feature that merged upstream was a new 'removal_policies_mode' property on OS::Heat::ResourceGroup. The default value of this property is 'append', which maintains the previous behaviour: once a resource is blacklisted, it is blacklisted permanently so adding additional members to the removal_policies adds them to the blacklist, but removing entries in the removal_policies has no effect.

If the 'removal_policies_mode' is changed to 'update', the blacklist is set to the *current* contents of the removal policies, so if the user has removed any entries from there they will no longer be blacklisted and those members will get created (if their indices are smaller than the size of the group).

To test this out you could do something like:

* Create a ResourceGroup
* update with removal_policies, check that resource(s) are deleted
* update with no/fewer removal_policies, check that nothing changes
* update removal_policies_mode to 'update', check that new resources are created

Comment 16 Ronnie Rasouli 2018-04-09 08:07:30 UTC
tested manually according the test case, verified

Comment 18 errata-xmlrpc 2018-06-27 13:39:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.