Bug 1415017

Summary:	Can't scale up 2 overcloud computes
Product:	Red Hat OpenStack	Reporter:	David Hill <dhill>
Component:	rhosp-director	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED NOTABUG	QA Contact:	Amit Ugol <augol>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.0 (Liberty)	CC:	achernet, asoni, athomas, byount, dbecker, jcoufal, mburns, mcornea, morazi, rhel-osp-director-maint, sbaker, shardy, srevivo, zbitter
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-08 21:09:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2017-01-20 00:26:14 UTC

Description of problem:
We currently can't scale up 2 compute nodes (3 to 5).   What happened is that we tried scaling up to 10 computes and this failed.  We deleted the nodes that failed deploying and the working ones that were just deployed, cleaned up the database and retried with 5 computes ...

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Scale up to --computescale 10
2. failed to scale
3. Delete some nodes with ironic delete
4. Reset stack states with update stack set status='COMPLETE' where status='failed'; and update resource set status='COMPLETE' where action='DELETE';
5. Scale up to --computescale 5
6. Stack update fails adding 2 new compute and it seems like they're not getting the IPs like a working node.

Actual results:
Fails

Expected results:
We need to recover.

Additional info:

Comment 1 Zane Bitter 2017-01-20 16:04:07 UTC

I'm not clear in exactly what state the stack has ended up in due to the database editing.

Moving a nested stack from UPDATE_FAILED to UPDATE_COMPLETE should have no effect, since nested stacks are always updated, so I have no concerns about that one.

Moving a resource from DELETE_FAILED to DELETE_COMPLETE is an unnatural action for Heat (it normally removes the resource from the DB altogether after deleting). It may be that resources that were in DELETE_FAILED before are not getting replaced correctly.

The first thing to try would be to get back to a normal baseline by scaling back down to 3 nodes, so that all of the extra stacks and whatever bogus data they contain get deleted. Once that delete is succeeding, try scaling up to 5 nodes again.