Bug 1278975

Summary: StackValidationFailed: Unknown resource Type : OS::TripleO::AllNodes::Validation while updating stack in UPDATE_FAILED
Product: Red Hat OpenStack Reporter: James Slagle <jslagle>
Component: openstack-heatAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: Amit Ugol <augol>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: calfonso, jcoufal, jslagle, mburns, rbiba, rhel-osp-director-maint, sasha, sbaker, shardy, yeylon, zbitter
Target Milestone: z3Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-2015.1.2-4.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-21 17:03:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1278544    
Bug Blocks:    

Description James Slagle 2015-11-06 22:04:35 UTC
If you have a stack in UPDATE_FAILED (for whatever reason, such as misconfigured DNS on the overcloud nodes), and you try to restart another update after fixing the issue, heat-engine throws the falling traceback:

Nov 06 16:43:50 instack.localdomain heat-engine[4501]: Traceback (most recent call last):
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: timer()
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: cb(*args, **kw)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: result = function(*args, **kwargs)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 112, in _start_with_trace
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: return func(*args, **kwargs)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: return f(*args, **kwargs)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 865, in update
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: updater()
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 174, in __call__
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.start(timeout=timeout)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.step()
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: next(self._runner)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: subtask = next(parent)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 918, in update_task
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: updater.start(timeout=self.timeout_secs())
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.step()
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: next(self._runner)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: subtask = next(parent)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/update.py", line 55, in __call__
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.previous_stack.dependencies,
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 238, in dependencies
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.resources.itervalues())
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 201, in resources
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: self.t.resource_definitions(self).items())
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 200, in <genexpr>
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: for (name, data) in
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 141, in __new__
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: resource_name=name)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: File "/usr/lib/python2.7/site-packages/heat/engine/environment.py", line 416, in get_class
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: raise exception.StackValidationFailed(message=msg)
Nov 06 16:43:50 instack.localdomain heat-engine[4501]: StackValidationFailed: Unknown resource Type : OS::TripleO::AllNodes::Validation

I'm filing this bug against python-rdomanager-oscplugin, because I suspect the problem is caused by a bug there in relation to not sending the correct environment files to the Heat API in this scenario.

Note however that I am specifying the environment file on the cli that should define this resource type:

openstack overcloud update stack overcloud -i --templates templates-y1 -e templates-y1/overcloud-resource-registry-puppet.yaml -e templates-y1/environments/network-isolation.yaml -e templates-y1/environments/net-single-nic-with-vlans.yaml -e custom-environment-7.1.yaml -e /home/stack/update.yaml

I've double checked and OS::TripleO::AllNodes::Validation is mostly defined in templates-y1/overcloud-resource-registry-puppet.yaml where templates-y1 is just a 1 for 1 copy of the templates from the latest openstack-tripleo-heat-templates package.

So I suspect the client is not sending what I've asked it to to the Heat API. I'll look into it a  bit more and change the bug to Heat or tripleo-heat-templates if I discover differently.

Comment 2 James Slagle 2015-11-06 22:07:46 UTC
Note that this traceback causes https://bugzilla.redhat.com/show_bug.cgi?id=1278544

which means the stack is stuck in UPDATE_IN_PROGRESS forever, with no way to recover

Comment 3 Zane Bitter 2015-11-06 22:23:42 UTC
Nope, this is a Heat bug - it's trying to load the *previous* stack and not finding a type for one of the resources in the environment. This is likely because we don't write the new environment until after a stack update has succeeded, so the previous stack may contain a mixture of old and new resources, but with the old environment.

I thought we had a bug for this already, but I don't see it at the moment.

Comment 4 Zane Bitter 2015-11-06 22:30:40 UTC
https://bugs.launchpad.net/heat/+bug/1477812 was a similar problem involving parameters, but the patch would not have fixed this issue with resource type mappings.

Comment 5 Zane Bitter 2015-11-06 22:37:13 UTC
Ah, found the other report of this: https://bugs.launchpad.net/heat/+bug/1508096 (from jprovazn, via me).

Now we know how to reproduce it.

Comment 6 Steve Baker 2015-11-09 03:18:35 UTC
Regarding StackValidationFailed: Unknown resource Type : OS::TripleO::AllNodes::Validation

A backport of https://review.openstack.org/#/c/176324 would be a pre-requisite of diagnosing this further (and it may even fix the problem)

I currently have a stack which is similarly wedged because Step4 went to UPDATE_FAILED after pacemaker failed to bring galera back up after the yum update.

Comment 7 James Slagle 2015-11-09 21:35:47 UTC
not sure if it helps any, but I tried to a patched Heat build with https://review.openstack.org/#/c/176324 applied, and I just get the exact same behavior as before

Comment 8 Steve Baker 2015-11-09 22:38:32 UTC
https://review.openstack.org/#/c/176324 results in the correct exceptions being raised, but Resource needs to fallback to TemplateResource for both TemplateNotFound and ResourceTypeNotFound.

I'll be coming up with a fix for this soon.

http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resource.py#n141

Comment 10 Zane Bitter 2015-11-18 22:57:03 UTC
Since the fixes for this are the same as the fixes for bug 1278544, I'm marking this one as TestOnly.

Comment 12 Amit Ugol 2015-11-23 16:10:44 UTC
updates passes CI so this is verified

Comment 13 Amit Ugol 2015-11-23 16:21:37 UTC
There is no more a way to recreate the type of failures that causes these errors while trying to recover from the previous errors (I hope I made it logical)

Comment 14 Steve Baker 2015-12-03 23:52:11 UTC
There was an error in the backport due to the different thread_lock arguments on kilo which leads to this error on engine start:

2015-12-03 18:12:24.663 14246 TRACE heat.engine.service Traceback (most recent call last):
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1627, in reset_stack_status
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service     with lock.thread_lock(retry=False):
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service   File "/usr/lib64/python2.7/contextlib.py", line 84, in helper
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service     return GeneratorContextManager(func(*args, **kwds))
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service TypeError: thread_lock() takes at least 2 arguments (2 given)
2015-12-03 18:12:24.663 14246 TRACE heat.engine.service

which is fixed by this patch

diff --git a/heat/engine/service.py b/heat/engine/service.py
index ac85fdf..ea99fff 100644
--- a/heat/engine/service.py
+++ b/heat/engine/service.py
@@ -1624,7 +1624,7 @@ class EngineService(service.Service):
             lock = stack_lock.StackLock(cnxt, stk, self.engine_id)
             engine_id = lock.get_engine_id()
             try:
-                with lock.thread_lock(retry=False):
+                with lock.thread_lock(stack_id, retry=False):
 
                     # refetch stack and confirm it is still I

Comment 16 Amit Ugol 2015-12-14 08:06:18 UTC
Trying again to update. The original issue still cannot be reproduced. The above fix is no longer visible to me. re-verifying.

Comment 18 errata-xmlrpc 2015-12-21 17:03:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2680