Bug 1331587

Summary: Unable to replace controller node because stack is in failed state
Product: Red Hat OpenStack Reporter: David Hill <dhill>
Component: python-rdomanager-oscpluginAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: Shai Revivo <srevivo>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: akaris, dhill, hbrock, jraju, jslagle, mburns, mcornea, rhel-osp-director-maint
Target Milestone: asyncKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-03 16:05:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description David Hill 2016-04-28 21:40:44 UTC
What problem/issue/behavior are you having trouble with?  What do you expect to see?

This morning we tried to replace a controller in CTMM1 (one of the environments) and the process removed 10 compute nodes from Director database. 

As we had a backup of the database, we could recover it and we noticed the stack was actually in UPDATE_FAILED state.  This update of the stack which ended in failed state was only executed to add compute nodes to the environment. Although the stack is in failed state, the overcloud is working for more than 3 weeks.

Trying to recover from this situation we tried to update director database, changing status in stack and resource tables from FAILED to "COMPLETE" in heat database. However running a new deployment will just try to remove the compute nodes again from the database.

It also seems overcloud deploy is trying to do something on controller-0, but its disk was wiped, so it's required to execute the replace controller - just updating the stack to arrive at a clean state doesn't seem to be possible.

We need guidance on how to complete the replace controller process.

What information can you provide around timeframes and urgency?

This should be solved today