Bug 1290950 - rhel-osp-director: update 7.0->7.2 and 7.2 >7.3 fails StackValidationFailed: Unknown resource Type : OS::TripleO::NodeTLSCAData (*include-password or export HEAT_INCLUDE_PASSWORD=1) [NEEDINFO]
rhel-osp-director: update 7.0->7.2 and 7.2 >7.3 fails StackValidationFailed:...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
urgent Severity urgent
: z4
: 7.0 (Kilo)
Assigned To: Zane Bitter
Amit Ugol
: Reopened, ZStream
: 1306502 (view as bug list)
Depends On:
Blocks: 1298589
  Show dependency treegraph
 
Reported: 2015-12-11 20:32 EST by Alexander Chuzhoy
Modified: 2018-02-08 06:06 EST (History)
15 users (show)

See Also:
Fixed In Version: openstack-heat-2015.1.2-9.el7ost
Doc Type: Bug Fix
Doc Text:
When a stack update fails, Heat stores a merged environment file containing the previous and new environments with the stack. However, previously, if the update was prematurely interrupted (for instance, by restarting heat-engine) the merged environment file was not written. As a result, if an update was interrupted, any resources already created that had new type aliases in the environment could not have their types resolved, and thus the failed stack could no longer be updated. This patch catches any exceptions (including exit exceptions) that occur while updating a stack and ensure that the merged environment is written. Thereby, after a stack update is interrupted subsequent to resources with new type aliases being created, the stack can now be updated again.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 11:42:12 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
zbitter: needinfo? (athomas)


Attachments (Terms of Use)
messages from one controller. (1.18 MB, application/x-gzip)
2015-12-11 20:38 EST, Alexander Chuzhoy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1526944 None None None 2015-12-16 16:07 EST
Launchpad 1526951 None None None 2015-12-16 16:07 EST
Launchpad 1544348 None None None 2016-02-10 18:21 EST
Red Hat Product Errata RHSA-2016:0266 normal SHIPPED_LIVE Moderate: openstack-heat bug fix and security advisory 2016-02-18 16:41:02 EST

  None (edit)
Description Alexander Chuzhoy 2015-12-11 20:32:21 EST
rhel-osp-director: update 7.0->7.2 fails ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1 Authentication required


Environment:
instack-undercloud-2.1.2-36.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch


Steps to reproduce:
1. Deploy HA overcloud 7.0  with network isolation.
2. Attempt to update to 7.2.

Result:
The hosts get updated pretty quickly, then after a long time:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or
export HEAT_INCLUDE_PASSWORD=1
Authentication required


Expected result:
Successful update.


Notes:
Seems like puppet issues:

heat resource-list -n 5 overcloud|grep -v COMPLETE                             |
                                                 |
| ControllerNodesPostDeployment                | 7d89fd52-d552-46c9-9e4d-feeb8a100827          | O|
S::TripleO::ControllerPostDeployment             | UPDATE_FAILED      | 2015-12-11T22:26:48Z |    |
                                          |                                                       |
| ControllerOvercloudServicesDeployment_Step4  | 9e1059b6-34ed-467b-a90e-46d1771ee31c          | O|
S::Heat::StructuredDeployments                   | UPDATE_IN_PROGRESS | 2015-12-11T22:32:07Z | Con|
trollerNodesPostDeployment                |                                                       |
| 2                                            | 82be7957-7b81-4ff0-a006-0364cdf65aa7          | O|
S::Heat::StructuredDeployment                    | UPDATE_IN_PROGRESS | 2015-12-11T22:32:24Z | Con|
trollerOvercloudServicesDeployment_Step4  |                                                       |
| 0                                            | ddd5489d-efff-4184-b727-428b37188032          | O|
S::Heat::StructuredDeployment                    | UPDATE_IN_PROGRESS | 2015-12-11T22:32:28Z | Con|
trollerOvercloudServicesDeployment_Step4  |                                                       |
| 1                                            | dfd7ed81-18cd-4997-8e44-035075b4c30d          | O|
S::Heat::StructuredDeployment                    | UPDATE_IN_PROGRESS | 2015-12-11T22:32:29Z | Con|
trollerOvercloudServicesDeployment_Step4  


from pcs status:
 ip-192.0.2.12  (ocf::heartbeat:IPaddr2):       Stopped (unmanaged)
Comment 2 Alexander Chuzhoy 2015-12-11 20:38 EST
Created attachment 1104930 [details]
messages from one controller.
Comment 3 Alexander Chuzhoy 2015-12-11 20:38:51 EST
the logs are too big, attached just the messages file.
Comment 7 Steve Baker 2015-12-14 15:56:31 EST
I think this is likely to have the same root cause and workaround as bug 1290950

https://bugzilla.redhat.com/show_bug.cgi?id=1290949#c8
Comment 8 Jaromir Coufal 2015-12-14 18:42:42 EST
Doc/workaround works for me.
Comment 9 chris alfonso 2015-12-15 03:44:54 EST

*** This bug has been marked as a duplicate of bug 1290949 ***
Comment 10 Alexander Chuzhoy 2015-12-15 05:28:56 EST
The error reproduced on a VM setup with 2 vCPUS.
Comment 18 Steve Baker 2015-12-15 16:25:33 EST
Doing some basic diagnosis on a failed stack, the resource UpdateDeployment on controller 0 failed due to a timeout. sshing into controller 0 and doing the following:

   journalctl -l -u os-collect-config
   systemctl status os-collect-config

shows that the puppet-apply is still running, and apparently stalled on a call to nova-manage db_sync

   CGroup: /system.slice/os-collect-config.service
           ├─ 2417 /usr/bin/systemctl restart openstack-nova-scheduler
           ├─12684 /sbin/dhclient -H overcloud-controller-0 -1 -q -lf /var/lib/dhc...
           ├─14016 /usr/bin/python /usr/bin/os-collect-config
           ├─19509 /usr/bin/python /usr/bin/os-refresh-config
           ├─19644 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-...
           ├─22133 python /usr/libexec/os-refresh-config/configure.d/55-heat-confi...
           ├─22139 python /var/lib/heat-config/hooks/puppet
           ├─22141 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/l...
           └─31139 /usr/bin/python /usr/bin/nova-manage db sync

ControllerOvercloudServicesDeployment_Step4 on all 3 controllers are also in a failed state. Looking at os-collect-config on controller 1 shows that the puppet-apply is wedged on restarting openstack-nova-scheduler:

   CGroup: /system.slice/os-collect-config.service
           ├─ 7366 /usr/bin/python /usr/bin/os-refresh-config
           ├─ 7504 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-...
           ├─10047 python /usr/libexec/os-refresh-config/configure.d/55-heat-confi...
           ├─10052 python /var/lib/heat-config/hooks/puppet
           ├─10054 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/l...
           ├─12693 /sbin/dhclient -H overcloud-controller-1 -1 -q -lf /var/lib/dhc...
           ├─13680 /usr/bin/systemctl restart openstack-nova-scheduler
           └─15004 /usr/bin/python /usr/bin/os-collect-config

Controller 2 is also wedged while running puppet, but there doesn't seem to be a long running subprocess causing this so I'm not sure what is the cause of this wedge:

   CGroup: /system.slice/os-collect-config.service
           ├─  811 /usr/bin/systemctl restart openstack-nova-scheduler
           ├─ 2973 /usr/bin/python /usr/bin/os-collect-config
           ├─12685 /sbin/dhclient -H overcloud-controller-2 -1 -q -lf /var/lib/dhclient/dhclient--br-ex.lease -pf /var/run/dhclient-br-ex.pid br-ex
           ├─27005 /usr/bin/python /usr/bin/os-refresh-config
           ├─27155 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d
           ├─29697 python /usr/libexec/os-refresh-config/configure.d/55-heat-config
           ├─29708 python /var/lib/heat-config/hooks/puppet
           └─29710 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/8827922e-444e-4ec3-acd8-53ece6c6127e.pp

Any heat stack-update is going to time out on unrelated resources while these puppet-apply runs are wedged. I think the "Authentication failed." message is actually the token timing out, so apart from the misleading error message, heat is working correctly in this instance.

I'm unassigning myself from this bug so someone else can diagnose these puppet issues.
Comment 19 Jiri Stransky 2015-12-16 05:19:21 EST
This looks like a false alarm to me. I suspect an inadvertent change in network isolation settings between the initial deployment and the update.

On controller-0 there's a puppet run stuck, it's stuck on restarting nova-scheduler and doing nova db-sync. In nova-scheduler.log i see it cannot connect to the database:

2015-12-16 04:46:08.457 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10696 attempts left.
2015-12-16 04:46:18.467 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10697 attempts left.
2015-12-16 04:46:28.474 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10698 attempts left.

The `connection` option in nova.conf points to 192.0.2.6:

connection=mysql://nova:<password-removed-for-BZ>@192.0.2.6/nova

The 192.0.2.6 address is pingable and assigned to controller-0:

[root@overcloud-controller-0 ~]# ip a | grep 192.0.2.6
    inet 192.0.2.6/32 brd 192.0.2.255 scope global br-ex

The reason MariaDB isn't reachable on this IP is that haproxy listens on a different one. Mysql section from haproxy.cfg:

listen mysql
  bind 192.168.100.11:3306 
  option httpchk
  stick on dst
  stick-table type ip size 1000
  timeout client 0
  timeout server 0
  server overcloud-controller-0 192.168.100.15:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-1 192.168.100.12:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2
  server overcloud-controller-2 192.168.100.14:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2


On the undercloud machine i don't see what command was used to deploy but i see two network-environment files: network-environment.yaml.org and network-environment.yaml. By the look of it i guess network-environment.yaml.org was used in initial deployment, and network-environment.yaml for the update.

There's quite a lot of differences between the two files, e.g. in network-environment.yaml.org i see:

InternalApiNetCidr: 192.168.100.0/24

which corresponds to the haproxy.cfg setting above. In network-environment.yaml i don't see any InternalApiNetCidr line though. Also in the .org variant of the file i see OS::TripleO::Network::InternalApi assigned in the resource registry, but in the one used for update i don't see it. This probably caused "un-isolation" of the InternalApi network and falling back to ctlplane, causing the desync between nova.conf and haproxy.conf (haproxy didn't get restarted by puppet, as it's managed by pacemaker).

I think unchanged environment files should be passed to the update CLI call, except for cases where the update doc says that some modifications are needed.
Comment 21 Steve Baker 2015-12-16 16:07:41 EST
I've attached the upstream bugs for the cosmetic aspects of this issue.
Comment 23 Jiri Stransky 2016-01-06 05:20:04 EST
As explained in https://bugzilla.redhat.com/show_bug.cgi?id=1290950#c19, the problem with the investigated environment was network misconfiguration in the environment files. If the same symptoms appear for a different reason, we should probably open a new BZ.
Comment 24 Alexander Chuzhoy 2016-02-01 21:00:58 EST
Reproduce this issue on 7.2->7.3 Update.


Deployed 7.2 GA with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1   --neutron-network-type vxlan --neutro
n-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/
openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml

Update command:
openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip.yaml  -e network-environment.yaml
Comment 25 Alexander Chuzhoy 2016-02-01 21:01:28 EST
The environment files were modified.
Comment 27 Omri Hochman 2016-02-01 21:18:06 EST
Adding blocker ? for 7.3GA.
This bug was initially opened when we tested updates attempt from 7.0 to 7.2 . 
now it reproduced when attempted update of a clean 7.2 deployment to latest 7.3 bits. 

environment: 
--------------
openstack-heat-api-2015.1.2-7.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-112.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-templates-0-0.8.20150605git.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.2-7.el7ost.noarch
openstack-heat-engine-2015.1.2-7.el7ost.noarch
Comment 28 Alexander Chuzhoy 2016-02-01 21:58:29 EST
And reproduced on 7.0 -> 7.3 update:
Deployment command:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1   --neutron-network-type gre --neutron-tunnel-types gre  --ntp-server x.x.x.x --timeout 90 -e network-environment.yaml


Update command:
openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-keystone-admin-internal-api.yaml  -e network-environment.yaml
Comment 29 Jiri Stransky 2016-02-02 10:57:26 EST
The root cause on the system i got to investigate looks like something related to SSL support, possibly the OS::TripleO::NodeTLSCAData missing from the resource registry.


[stack@instack ~]$ heat resource-list overcloud -n5 | grep -vi complete
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| Controller                                    | e3e5d008-201a-4a0e-9196-338186d78233          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-02-01T21:17:26Z |                                               |
| 2                                             | b7ac5241-4e77-4b38-bb6d-5bd82908dabe          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2016-02-01T21:17:49Z | Controller                                    |
| 1                                             | a5c81a21-2113-4d56-aeeb-12c0f8449359          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2016-02-01T21:19:51Z | Controller                                    |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
[stack@instack ~]$ heat stack-show a5c81a21-2113-4d56-aeeb-12c0f8449359 | grep reason
| stack_status_reason   | Unknown resource Type : OS::TripleO::NodeTLSCAData                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
[stack@instack ~]$ heat stack-show b7ac5241-4e77-4b38-bb6d-5bd82908dabe | grep reason
| stack_status_reason   | Unknown resource Type : OS::TripleO::NodeTLSCAData                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
[stack@instack ~]$ heat stack-show e3e5d008-201a-4a0e-9196-338186d78233 | grep reason
| stack_status_reason   | Timed out                                                                                                                                      |
Comment 30 Jiri Stransky 2016-02-02 11:02:50 EST
Pasting from IRC -- the update command was missing the resource registry:

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network
Comment 31 Jiri Stransky 2016-02-02 11:05:53 EST
Sorry, that's ^ the deploy cmd, the update cmd was:

openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip
Comment 32 Jiri Stransky 2016-02-02 11:46:25 EST
[stack@instack openstack-tripleo-heat-templates]$ grep -ri OS::TripleO::NodeTLSCAData
overcloud-resource-registry-puppet.yaml:  OS::TripleO::NodeTLSCAData: puppet/extraconfig/tls/no-ca.yaml
puppet/controller-puppet.yaml:    type: OS::TripleO::NodeTLSCAData
puppet/swift-storage-puppet.yaml:    type: OS::TripleO::NodeTLSCAData
puppet/ceph-storage-puppet.yaml:    type: OS::TripleO::NodeTLSCAData
puppet/cinder-storage-puppet.yaml:    type: OS::TripleO::NodeTLSCAData
puppet/compute-puppet.yaml:    type: OS::TripleO::NodeTLSCAData
environments/inject-trust-anchor.yaml:  OS::TripleO::NodeTLSCAData: ../puppet/extraconfig/tls/ca-inject.yaml

I see OS::TripleO::NodeTLSCAData mapped to no-ca.yaml in the resource registry ^^ so i wonder if "Unknown resource Type : OS::TripleO::NodeTLSCAData" even though we do pass the resource registry could be a symptom of something else, possibly a Heat bug of some sort.
Comment 33 Zane Bitter 2016-02-03 10:42:27 EST
This looks similar to the error in bug 1298589 (which is for 8.0)
Comment 34 Zane Bitter 2016-02-03 10:58:36 EST
Here is the stack trace, very similar to bug129859:

2016-02-01 16:03:11.033 7460 ERROR heat.engine.service [-] Unhandled error in asynchronous task
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service Traceback (most recent call last):
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 123, in log_exceptions
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     gt.wait()
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     return self._exit_event.wait()
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     current.throw(*self._exc)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     result = function(*args, **kwargs)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 112, in _start_with_trace
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     return func(*args, **kwargs)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     return f(*args, **kwargs)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 76, in handle_exceptions
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     {'func': func.__name__, 'msg': errmsg})
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     six.reraise(self.type_, self.value, self.tb)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 71, in handle_exceptions
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     return func(stack, *args, **kwargs)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 893, in update
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     updater()
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 174, in __call__
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     self.start(timeout=timeout)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     next(self._runner)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     subtask = next(parent)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 946, in update_task
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     updater.start(timeout=self.timeout_secs())
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     self.step()
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     next(self._runner)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     subtask = next(parent)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/update.py", line 55, in __call__
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     self.previous_stack.dependencies,
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 260, in dependencies
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     self.resources.itervalues())
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 223, in resources
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     self.t.resource_definitions(self).items())
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 222, in <genexpr>
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     for (name, data) in
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 143, in __new__
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     resource_name=name)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service   File "/usr/lib/python2.7/site-packages/heat/engine/environment.py", line 439, in get_class_to_instantiate
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service     raise exception.StackValidationFailed(message=msg)
2016-02-01 16:03:11.033 7460 TRACE heat.engine.service StackValidationFailed: Unknown resource Type : OS::TripleO::NodeTLSCAData

Crucially, the error appears to be in getting the dependencies for the *previous* stack.
Comment 35 Zane Bitter 2016-02-03 11:10:12 EST
The log also shows that heat-engine was restarted with a stack IN_PROGRESS prior to this failure occurring, and also that the stack experience a failure due to message timeout (bug 1290949) about 20mins prior to the error above. So it's likely that we have some partially-created resource that is then being resolved against the old environment, which hasn't defined a type for it - i.e. the same problem that should have been fixed as bug 1278975.
Comment 36 Zane Bitter 2016-02-03 11:25:38 EST
Current best guess is that the fix for bug 1278975 doesn't get the chance to run when the stack operation ends abnormally (e.g. by restarting heat-engine). It's not inside an exception handler, so even assuming things proceed in a somewhat orderly manner (i.e. not kill -9) it doesn't stand a chance.

This can probably be worked around by not restarting heat-engine while it's still working on a stack. A likely fix would be along the lines of storing the updated environment in advance, rather than waiting until after something has failed.
Comment 37 Mike Burns 2016-02-08 16:29:51 EST
Per Zane, this bug does not have a fix yet.  Most of the issues that were considered blocking in this bug are really due to bug 1304878 which is already fixed.  

This bug only shows up if heat-engine is restarted mid-update, I'd propose dropping blocker.
Comment 38 Zane Bitter 2016-02-11 00:09:41 EST
*** Bug 1306502 has been marked as a duplicate of this bug. ***
Comment 39 Angus Thomas 2016-02-11 06:16:07 EST
I agree with Mike in Comment 37. This isn't a release blocker. It's a consequence of restarting the heat-engine in mid-update.
Comment 40 Zane Bitter 2016-02-11 13:32:09 EST
(In reply to Angus Thomas from comment #39)
> I agree with Mike in Comment 37. This isn't a release blocker. It's a
> consequence of restarting the heat-engine in mid-update.

It is, but people do that all the time and we've been recommending that as our workaround for waiting 4 hours for the timeout on a child stack if something fails. On the happy path you never hit it, but that doesn't mean you won't be hitting it regularly. And it's really hard to recover from because it makes the data about the *existing* state inconsistent.

We have a patch and can avoid a lot of needless suffering.
Comment 43 Amit Ugol 2016-02-18 06:11:31 EST
haven't been able to reproduce the original issue, not that its an easy thing to do.
Comment 44 Amit Ugol 2016-02-18 06:11:43 EST
haven't been able to reproduce the original issue, not that its an easy thing to do.
Comment 46 errata-xmlrpc 2016-02-18 11:42:12 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0266.html

Note You need to log in before you can comment on or make changes to this bug.