Description of problem: On a recent stack update it failed with the following: overcloud.AllNodesDeploySteps.ComputeDeployment_Step1.50: resource_type: OS::Heat::StructuredDeployment physical_resource_id: d3e320e6-989d-4ccc-9632-f95a6432025a status: UPDATE_FAILED status_reason: | UPDATE aborted deploy_stdout: | ^[[mNotice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.^[[0m ^[[mNotice: Compiled catalog for overcloud-compute-50.localdomain in environment production in 1.75 seconds^[[0m ^[[mNotice: /Stage[main]/Tripleo::Profile::Base::Kernel/Kmod::Load[nf_conntrack_proto_sctp]/Exec[modprobe nf_conntrack_proto_sctp]/returns: executed successfully^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[003 accept ssh]/Firewall[003 accept ssh ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[neutron_ovs_agent]/Tripleo::Firewall::Rule[136 neutron gre networks]/Firewall[136 neutron gre networks ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[neutron_ovs_agent]/Tripleo::Firewall::Rule[118 neutron vxlan networks]/Firewall[118 neutron vxlan networks ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[ntp]/Tripleo::Firewall::Rule[105 ntp]/Firewall[105 ntp ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[nova_libvirt]/Tripleo::Firewall::Rule[200 nova_libvirt]/Firewall[200 nova_libvirt ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Ssh::Server::Config/Concat[/etc/ssh/sshd_config]/File[/etc/ssh/sshd_config]/content: content changed '{md5}baedf1034830191df5b3d75cb450ea2f' to '{md5}c3c5d6edffcb177a423195170f9277ab'^[[0m ^[[mNotice: /Stage[main]/Ssh::Server::Service/Service[sshd]: Triggered 'refresh' from 2 events^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]/ensure: created^[[0m ^[[mNotice: /Stage[main]/Tripleo::Firewall/Exec[nonpersistent_v6_rules_cleanup]/returns: executed successfully^[[0m ^[[mNotice: Finished catalog run in 29.99 seconds^[[0m deploy_stderr: | exception: connect failed error: Could not connect to cluster (is it running?) overcloud.AllNodesDeploySteps.ControllerDeployment_Step1: resource_type: OS::Heat::StructuredDeploymentGroup physical_resource_id: 889e66e6-bf9b-47ca-ab5d-ae44c0ce3015 status: UPDATE_FAILED status_reason: | UPDATE aborted You have this repeated for a lot if not all compute nodes. From the heat-engine.log on the director node: 2018-12-15 15:41:29.319 14507 INFO heat.engine.resource [req-d8622bbf-4583-4d31-84a5-1baf8182599f e73382d0112c4ceaa8cb7363146b81cc 9f22a19a5a034d6fbf57537b8b246750 - - -] UPDATE: StructuredDeployment "54" [cc51baf1-c939-4081-9a25-b98b9434b9f1] Stack "overcloud-AllNodesDeploySteps-cg7re45aekzh-ComputeDeployment_Step1-fp4ayqsy76mt" [8f13d2c2-2089-47ab-929b-7ea00217df7f] 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 787, in _action_recorder 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource yield 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1366, in update 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource prop_diff]) 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource step = next(subtask) 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 840, in action_handler_task 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource done = check(handler_data) 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 466, in check_update_complete 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource return self._check_complete() 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 301, in _check_complete 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource raise exception.Error(message) 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource 2018-12-15 15:41:31.517 14507 INFO heat.engine.stack [req-d8622bbf-4583-4d31-84a5-1baf8182599f e73382d0112c4ceaa8cb7363146b81cc 9f22a19a5a034d6fbf57537b8b246750 - - -] Stack UPDATE FAILED (overcloud-AllNodesDeploySteps-cg7re45aekzh-ComputeDeployment_Step1-fp4ayqsy76mt): Error: resources[54]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 2018-12-15 15:41:32.151 14500 INFO heat.engine.update [req-1e901ae9-30a3-48c0-a5dc-bf0c482f751d - - - - -] Resource 0 for stack overcloud-AllNodesDeploySteps-cg7re45aekzh-ControllerDeployment_Step1-f3ffknlb5nb5 updated 2018-12-15 15:41:32.161 14495 INFO heat.engine.resource [req-b7a3958d-5d51-402a-869f-8228bedb01b3 - - - - -] UPDATE: StructuredDeploymentGroup "ComputeDeployment_Step1" [8f13d2c2-2089-47ab-929b-7ea00217df7f] Stack "overcloud-AllNodesDeploySteps-cg7re45aekzh" [7b59c9fd-c305-4274-b241-f13561505bf7] 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 787, in _action_recorder 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource yield 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1366, in update 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource prop_diff]) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource step = next(subtask) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 840, in action_handler_task 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource done = check(handler_data) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 397, in check_update_complete 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource if not checker.step(): 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 219, in step 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource poll_period = next(self._runner) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 385, in _run_to_completion 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource self).check_update_complete(updater): 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 513, in check_update_complete 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource cookie=cookie) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 415, in _check_status_complete 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource action=action) 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource ResourceFailure: resources.ComputeDeployment_Step1: Error: resources[54]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource I have SOS reports from director and compute node if any additional logs are interesting in this case. Version-Release number of selected component (if applicable): openstack-heat-common-7.0.6-4.el7ost How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: If useful here's the content of one of the pp file that failed: [root@overcloud-compute-57:~]# cat /var/lib/heat-config/heat-config-puppet/2ee89657-9906-45c7-827a-29cf70ce6cee.pp # Copyright 2014 Red Hat, Inc. # All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. You may obtain # a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the # License for the specific language governing permissions and limitations # under the License. # The content of this file will be used to generate # the puppet manifests for all roles, the placeholder # compute will be replaced by 'controller', 'blockstorage', # 'cephstorage' and all the deployed roles. if hiera('step') >= 4 { hiera_include('compute_classes', []) } $package_manifest_name = join(['/var/lib/tripleo/installed-packages/overcloud_compute', hiera('step')]) package_manifest{$package_manifest_name: ensure => present} include ::tripleo::trusted_cas include ::tripleo::profile::base::ceph::client include ::timezone include ::tripleo::profile::base::time::ntp include ::tripleo::profile::base::snmp # TODO(emilien): figure how to deal with libvirt profile. # We'll probably treat it like we do with Neutron plugins. # Until then, just include it in the default nova-compute role. include tripleo::profile::base::nova::compute::libvirt include tripleo::profile::base::nova::libvirt include ::tripleo::profile::base::kernel include ::tripleo::profile::base::neutron::plugins::ml2 include ::tripleo::profile::base::neutron::ovs include ::tripleo::profile::base::ceilometer::agent::compute include ::tripleo::packages include ::tripleo::firewall include ::kmod kmod::load { 'kvm_intel': } kmod::option { 'kvm_intel': option => 'nested', value => '1' } include ::tripleo::vip_hosts include ::tripleo::profile::base::sshd
Hello, Yep, the sosreport and logs would be nice to have in order to triage this issue :). Thank you in advance! Cheers, C.
SOS reports and logs are downloaded on new collab-shell server. Will provide link in the next comment. Thank you!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0922