Description of problem:
We currently can't scale up 2 compute nodes (3 to 5). What happened is that we tried scaling up to 10 computes and this failed. We deleted the nodes that failed deploying and the working ones that were just deployed, cleaned up the database and retried with 5 computes ...
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Scale up to --computescale 10
2. failed to scale
3. Delete some nodes with ironic delete
4. Reset stack states with update stack set status='COMPLETE' where status='failed'; and update resource set status='COMPLETE' where action='DELETE';
5. Scale up to --computescale 5
6. Stack update fails adding 2 new compute and it seems like they're not getting the IPs like a working node.
Actual results:
Fails
Expected results:
We need to recover.
Additional info:
I'm not clear in exactly what state the stack has ended up in due to the database editing.
Moving a nested stack from UPDATE_FAILED to UPDATE_COMPLETE should have no effect, since nested stacks are always updated, so I have no concerns about that one.
Moving a resource from DELETE_FAILED to DELETE_COMPLETE is an unnatural action for Heat (it normally removes the resource from the DB altogether after deleting). It may be that resources that were in DELETE_FAILED before are not getting replaced correctly.
The first thing to try would be to get back to a normal baseline by scaling back down to 3 nodes, so that all of the extra stacks and whatever bogus data they contain get deleted. Once that delete is succeeding, try scaling up to 5 nodes again.