Hide Forgot
What problem/issue/behavior are you having trouble with? What do you expect to see? This morning we tried to replace a controller in CTMM1 (one of the environments) and the process removed 10 compute nodes from Director database. As we had a backup of the database, we could recover it and we noticed the stack was actually in UPDATE_FAILED state. This update of the stack which ended in failed state was only executed to add compute nodes to the environment. Although the stack is in failed state, the overcloud is working for more than 3 weeks. Trying to recover from this situation we tried to update director database, changing status in stack and resource tables from FAILED to "COMPLETE" in heat database. However running a new deployment will just try to remove the compute nodes again from the database. It also seems overcloud deploy is trying to do something on controller-0, but its disk was wiped, so it's required to execute the replace controller - just updating the stack to arrive at a clean state doesn't seem to be possible. We need guidance on how to complete the replace controller process. What information can you provide around timeframes and urgency? This should be solved today