Bug 1331587 - Unable to replace controller node because stack is in failed state
Summary: Unable to replace controller node because stack is in failed state
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-rdomanager-oscplugin
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: ---
Assignee: RHOS Maint
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-28 21:40 UTC by David Hill
Modified: 2019-10-10 12:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-03 16:05:14 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description David Hill 2016-04-28 21:40:44 UTC
What problem/issue/behavior are you having trouble with?  What do you expect to see?

This morning we tried to replace a controller in CTMM1 (one of the environments) and the process removed 10 compute nodes from Director database. 

As we had a backup of the database, we could recover it and we noticed the stack was actually in UPDATE_FAILED state.  This update of the stack which ended in failed state was only executed to add compute nodes to the environment. Although the stack is in failed state, the overcloud is working for more than 3 weeks.

Trying to recover from this situation we tried to update director database, changing status in stack and resource tables from FAILED to "COMPLETE" in heat database. However running a new deployment will just try to remove the compute nodes again from the database.

It also seems overcloud deploy is trying to do something on controller-0, but its disk was wiped, so it's required to execute the replace controller - just updating the stack to arrive at a clean state doesn't seem to be possible.

We need guidance on how to complete the replace controller process.

What information can you provide around timeframes and urgency?

This should be solved today


Note You need to log in before you can comment on or make changes to this bug.