Red Hat Bugzilla – Bug 1301511
Updating a failed stack fails with Stack already has an action (UPDATE) in progress.
Last modified: 2016-04-26 16:56:35 EDT
We have been Updating OpenStack overcloud from Director 7.0 to 7.2 and during the Update it has failed but some of the Resources stayed in the UPDATE_IN_PROGRESS.
We tried to wait until that UPDATE_IN_PROGRESS update times out. But it didn't happen.
As per understanding in the Bug #1292212 , there should be a time-out configured for the Resources to time-out after 4 hours. However it was not the case for us.
We waited more than that we were in state where stack was reported as UPDATE_FAILED and it's Resources beneath were reported as UPDATE_IN_PROGRESS.
Also when checking on the resources we could see the overcloud update failed with "Stack already has an action (UPDATE) in progress."
We have restarted OpenStack services on the Undercloud to force the UPDATE_IN_PROGRESS to transfer to UPDATE_FAILED.
Could you look into it, we think that the bug might be still there.
The fix for bug 1280094 is supposed to prevent this from happening, so it would be good to know exactly what version of Heat you're running in the undercloud.
As mentioned in the bug you linked, there were previously some issues with the workaround of restarting heat-engine, but we believe they are fixed. It's possible that if we are leaving resources IN_PROGRESS but not their containing stacks then they wouldn't get reset at startup, but that ought not to happen and the fact that you're getting the failure "Stack already has an action (UPDATE) in progress." suggests that the stack is also IN_PROGRESS anyway (which means it should get moved to FAILED when heat-engine starts up).
Useful information would be:
- A list of all resources and stacks that are still IN_PROGRESS after 4 hours
- Log of the initial update that put this stuff into the IN_PROGRESS state
- Journal output (`journalctl -u openstack-heat-engine`) from that same update
So far we've established that as far as moving stacks to FAILED, everything is working as expected. However updates are timing out for unknown reasons - the Compute and Controller resource groups are never completing.
One possible issue found in the journal is this:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
result = func(ctxt, **new_args)
File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 300, in wrapped
return func(self, ctx, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1526, in signal_software_deployment
File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 178, in signal_software_deployment
raise ValueError(_('deployment_id must be specified'))
ValueError: deployment_id must be specified
Which happens when an empty deployment_id is supplied with a signal. There definitely should have better logging of these kinds of errors, but we have no idea yet if this the cause of the problem, or why we're receiving a signal of this type.
It doesn't appear that those tracebacks were related either.
Created attachment 1118273 [details]
[Split] Full heat log 2
The only Heat bug found on this deployment was bug 1302828. In respect of resources being UPDATE_IN_PROGRESS, Heat was found to be behaving as expected.