Hide Forgot
Created attachment 1211843 [details] output of heat stack-show overcloud Description of problem: A failure happened when scaling up an established overcloud. I tried to clean up the heat stack and re-deploy. After deleting the nova instance and ironic nodes at undercloud, I tried to deploy the compute nodes again. But it fails immediately because of "ERROR: Stack overcloud already has an action (ROLLBACK) in progress”. I cannot cancel heat-update because of “ERROR: Cancelling update when stack is (u'ROLLBACK', u'IN_PROGRESS') is not supported”. Although the rollback in progress finally changed into rollback failed state, but I still cannot do another update. Now, I am stuck on this error thus cannot add new compute nodes to the overcloud. To prevent this happens, user should not be allowed to delete the undercloud resources such as nova instance or ironic nodes; Or heat should force clean up failed update. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Cannot cancel the failed undercloud heat stack update; Cannot do a new undercloud heat stack update; Expected results: reject the requests to delete undercloud resources directly, if heat is the owner of such resources; User should be able to delete/cancel any failed actions, such as a heat update or rollback; Additional info:
Cancelling a stack update is not recommended for TripleO, because it causes a rollback and TripleO is not designed to deal with rollbacks. The best approach is to allow the stack and all of its nested stacks to either succeed, fail or timeout. It is possible to accelerate this process by restarting heat-engine on the undercloud (after which all IN_PROGRESS stacks should move to the FAILED state), and this usually works fine now (although in earlier versions of OSPd it was often problematic). Once the stack has reached the ROLLBACK_FAILED state you should be able to update it again, although you may have to wait for some nested stacks to time out also (this latter part is fixed in Newton). If you're still having issues, I would suggest restarting heat-engine to ensure all stacks are reset.
Zene, Thanks a lot for the prompt response. After restart heat-engine I can continue my operation, e.g. adding compute nodes to the existing overcloud. I think we can close the bug as no-problem-found. Should I change the state to close? Thanks. Bin (In reply to Zane Bitter from comment #1) > Cancelling a stack update is not recommended for TripleO, because it causes > a rollback and TripleO is not designed to deal with rollbacks. The best > approach is to allow the stack and all of its nested stacks to either > succeed, fail or timeout. It is possible to accelerate this process by > restarting heat-engine on the undercloud (after which all IN_PROGRESS stacks > should move to the FAILED state), and this usually works fine now (although > in earlier versions of OSPd it was often problematic). > > Once the stack has reached the ROLLBACK_FAILED state you should be able to > update it again, although you may have to wait for some nested stacks to > time out also (this latter part is fixed in Newton). If you're still having > issues, I would suggest restarting heat-engine to ensure all stacks are > reset.
Great! Thanks for the update.