Bug 1301511

Summary: Updating a failed stack fails with Stack already has an action (UPDATE) in progress.
Product: Red Hat OpenStack Reporter: Robin Cernin <rcernin>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED NOTABUG QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: bschmaus, calfonso, dmaley, ggillies, gkeegan, mburns, mcornea, morazi, nalmond, rcernin, rhel-osp-director-maint, sbaker, shardy, skinjo, yeylon, zbitter
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1292212 Environment:
Last Closed: 2016-01-28 18:17:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robin Cernin 2016-01-25 09:24:30 UTC
Hello

We have been Updating OpenStack overcloud from Director 7.0 to 7.2 and during the Update it has failed but some of the Resources stayed in the UPDATE_IN_PROGRESS.

We tried to wait until that UPDATE_IN_PROGRESS update times out. But it didn't happen.

As per understanding in the Bug #1292212 , there should be a time-out configured for the Resources to time-out after 4 hours. However it was not the case for us.

We waited more than that we were in state where stack was reported as UPDATE_FAILED and it's Resources beneath were reported as UPDATE_IN_PROGRESS.

Also when checking on the resources we could see the overcloud update failed with "Stack already has an action (UPDATE) in progress."

We have restarted OpenStack services on the Undercloud to force the UPDATE_IN_PROGRESS to transfer to UPDATE_FAILED.

Could you look into it, we think that the bug might be still there.

Thank you,
Kind Regards,
Robin Černín

Comment 1 Zane Bitter 2016-01-25 13:56:03 UTC
The fix for bug 1280094 is supposed to prevent this from happening, so it would be good to know exactly what version of Heat you're running in the undercloud.

As mentioned in the bug you linked, there were previously some issues with the workaround of restarting heat-engine, but we believe they are fixed. It's possible that if we are leaving resources IN_PROGRESS but not their containing stacks then they wouldn't get reset at startup, but that ought not to happen and the fact that you're getting the failure "Stack already has an action (UPDATE) in progress." suggests that the stack is also IN_PROGRESS anyway (which means it should get moved to FAILED when heat-engine starts up).

Useful information would be:
- A list of all resources and stacks that are still IN_PROGRESS after 4 hours
- Log of the initial update that put this stuff into the IN_PROGRESS state
- Journal output (`journalctl -u openstack-heat-engine`) from that same update

Comment 6 Zane Bitter 2016-01-25 19:53:53 UTC
So far we've established that as far as moving stacks to FAILED, everything is working as expected. However updates are timing out for unknown reasons - the Compute and Controller resource groups are never completing.

One possible issue found in the journal is this:

Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
executor_callback))
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
executor_callback)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
result = func(ctxt, **new_args)
File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 300, in wrapped
return func(self, ctx, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1526, in signal_software_deployment
updated_at=updated_at)
File "/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py", line 178, in signal_software_deployment
raise ValueError(_('deployment_id must be specified'))
ValueError: deployment_id must be specified

Which happens when an empty deployment_id is supplied with a signal. There definitely should have better logging of these kinds of errors, but we have no idea yet if this the cause of the problem, or why we're receiving a signal of this type.

Comment 7 Zane Bitter 2016-01-25 23:08:33 UTC
It doesn't appear that those tracebacks were related either.

Comment 11 Shinobu KINJO 2016-01-25 23:58:42 UTC
Created attachment 1118273 [details]
[Split] Full heat log 2

Comment 29 Zane Bitter 2016-01-28 18:17:02 UTC
The only Heat bug found on this deployment was bug 1302828. In respect of resources being UPDATE_IN_PROGRESS, Heat was found to be behaving as expected.