Bug 1415017 - Can't scale up 2 overcloud computes
Summary: Can't scale up 2 overcloud computes
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Zane Bitter
QA Contact: Amit Ugol
Depends On:
TreeView+ depends on / blocked
Reported: 2017-01-20 00:26 UTC by David Hill
Modified: 2020-02-14 18:29 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-02-08 21:09:16 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2260561 0 None None None 2017-01-25 19:05:56 UTC
Red Hat Knowledge Base (Solution) 2882651 0 None None None 2017-01-20 07:40:03 UTC

Description David Hill 2017-01-20 00:26:14 UTC
Description of problem:
We currently can't scale up 2 compute nodes (3 to 5).   What happened is that we tried scaling up to 10 computes and this failed.  We deleted the nodes that failed deploying and the working ones that were just deployed, cleaned up the database and retried with 5 computes ...

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Scale up to --computescale 10
2. failed to scale
3. Delete some nodes with ironic delete
4. Reset stack states with update stack set status='COMPLETE' where status='failed'; and update resource set status='COMPLETE' where action='DELETE';
5. Scale up to --computescale 5
6. Stack update fails adding 2 new compute and it seems like they're not getting the IPs like a working node.

Actual results:

Expected results:
We need to recover.

Additional info:

Comment 1 Zane Bitter 2017-01-20 16:04:07 UTC
I'm not clear in exactly what state the stack has ended up in due to the database editing.

Moving a nested stack from UPDATE_FAILED to UPDATE_COMPLETE should have no effect, since nested stacks are always updated, so I have no concerns about that one.

Moving a resource from DELETE_FAILED to DELETE_COMPLETE is an unnatural action for Heat (it normally removes the resource from the DB altogether after deleting). It may be that resources that were in DELETE_FAILED before are not getting replaced correctly.

The first thing to try would be to get back to a normal baseline by scaling back down to 3 nodes, so that all of the extra stacks and whatever bogus data they contain get deleted. Once that delete is succeeding, try scaling up to 5 nodes again.

Note You need to log in before you can comment on or make changes to this bug.