Bug 1660567 - [OSP10] Overcloud update: Could not connect to cluster (is it running?)
Summary: [OSP10] Overcloud update: Could not connect to cluster (is it running?)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-firewall
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 10.0 (Newton)
Assignee: Alex Schultz
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-18 16:19 UTC by ggrimaux
Modified: 2019-04-30 16:59 UTC (History)
12 users (show)

Fixed In Version: puppet-firewall-1.8.1-4.e70157egit.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-30 16:59:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0922 0 None None None 2019-04-30 16:59:48 UTC

Description ggrimaux 2018-12-18 16:19:42 UTC
Description of problem:
On a recent stack update it failed with the following:

overcloud.AllNodesDeploySteps.ComputeDeployment_Step1.50:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: d3e320e6-989d-4ccc-9632-f95a6432025a
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
  deploy_stdout: |
    ^[[mNotice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.^[[0m
    ^[[mNotice: Compiled catalog for overcloud-compute-50.localdomain in environment production in 1.75 seconds^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Profile::Base::Kernel/Kmod::Load[nf_conntrack_proto_sctp]/Exec[modprobe nf_conntrack_proto_sctp]/returns: executed successfully^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[001 accept all icmp]/Firewall[001 accept all icmp ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[000 accept related established rules]/Firewall[000 accept related established rules ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[003 accept ssh]/Firewall[003 accept ssh ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[neutron_ovs_agent]/Tripleo::Firewall::Rule[136 neutron gre networks]/Firewall[136 neutron gre networks ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[neutron_ovs_agent]/Tripleo::Firewall::Rule[118 neutron vxlan networks]/Firewall[118 neutron vxlan networks ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[ntp]/Tripleo::Firewall::Rule[105 ntp]/Firewall[105 ntp ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall/Tripleo::Firewall::Service_rules[nova_libvirt]/Tripleo::Firewall::Rule[200 nova_libvirt]/Firewall[200 nova_libvirt ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Ssh::Server::Config/Concat[/etc/ssh/sshd_config]/File[/etc/ssh/sshd_config]/content: content changed '{md5}baedf1034830191df5b3d75cb450ea2f' to '{md5}c3c5d6edffcb177a423195170f9277ab'^[[0m
    ^[[mNotice: /Stage[main]/Ssh::Server::Service/Service[sshd]: Triggered 'refresh' from 2 events^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[998 log all]/Firewall[998 log all ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]/ensure: created^[[0m
    ^[[mNotice: /Stage[main]/Tripleo::Firewall/Exec[nonpersistent_v6_rules_cleanup]/returns: executed successfully^[[0m
    ^[[mNotice: Finished catalog run in 29.99 seconds^[[0m
  deploy_stderr: |
    exception: connect failed
    error: Could not connect to cluster (is it running?)
overcloud.AllNodesDeploySteps.ControllerDeployment_Step1:
  resource_type: OS::Heat::StructuredDeploymentGroup
  physical_resource_id: 889e66e6-bf9b-47ca-ab5d-ae44c0ce3015
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted

You have this repeated for a lot if not all compute nodes.

From the heat-engine.log on the director node:

2018-12-15 15:41:29.319 14507 INFO heat.engine.resource [req-d8622bbf-4583-4d31-84a5-1baf8182599f e73382d0112c4ceaa8cb7363146b81cc 9f22a19a5a034d6fbf57537b8b246750 - - -] UPDATE: StructuredDeployment "54" [cc51baf1-c939-4081-9a25-b98b9434b9f1] Stack "overcloud-AllNodesDeploySteps-cg7re45aekzh-ComputeDeployment_Step1-fp4ayqsy76mt" [8f13d2c2-2089-47ab-929b-7ea00217df7f]
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource Traceback (most recent call last):
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 787, in _action_recorder
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     yield
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1366, in update
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     prop_diff])
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     step = next(subtask)
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 840, in action_handler_task
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     done = check(handler_data)
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 466, in check_update_complete
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     return self._check_complete()
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 301, in _check_complete
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource     raise exception.Error(message)
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2018-12-15 15:41:29.319 14507 ERROR heat.engine.resource
2018-12-15 15:41:31.517 14507 INFO heat.engine.stack [req-d8622bbf-4583-4d31-84a5-1baf8182599f e73382d0112c4ceaa8cb7363146b81cc 9f22a19a5a034d6fbf57537b8b246750 - - -] Stack UPDATE FAILED (overcloud-AllNodesDeploySteps-cg7re45aekzh-ComputeDeployment_Step1-fp4ayqsy76mt): Error: resources[54]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2018-12-15 15:41:32.151 14500 INFO heat.engine.update [req-1e901ae9-30a3-48c0-a5dc-bf0c482f751d - - - - -] Resource 0 for stack overcloud-AllNodesDeploySteps-cg7re45aekzh-ControllerDeployment_Step1-f3ffknlb5nb5 updated
2018-12-15 15:41:32.161 14495 INFO heat.engine.resource [req-b7a3958d-5d51-402a-869f-8228bedb01b3 - - - - -] UPDATE: StructuredDeploymentGroup "ComputeDeployment_Step1" [8f13d2c2-2089-47ab-929b-7ea00217df7f] Stack "overcloud-AllNodesDeploySteps-cg7re45aekzh" [7b59c9fd-c305-4274-b241-f13561505bf7]
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource Traceback (most recent call last):
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 787, in _action_recorder
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     yield
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1366, in update
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     prop_diff])
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     step = next(subtask)
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 840, in action_handler_task
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     done = check(handler_data)
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 397, in check_update_complete
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     if not checker.step():
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 219, in step
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     poll_period = next(self._runner)
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 385, in _run_to_completion
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     self).check_update_complete(updater):
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 513, in check_update_complete
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     cookie=cookie)
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 415, in _check_status_complete
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource     action=action)
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource ResourceFailure: resources.ComputeDeployment_Step1: Error: resources[54]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2018-12-15 15:41:32.161 14495 ERROR heat.engine.resource

I have SOS reports from director and compute node if any additional logs are interesting in this case.

Version-Release number of selected component (if applicable):
openstack-heat-common-7.0.6-4.el7ost

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
If useful here's the content of one of the pp file that failed:

[root@overcloud-compute-57:~]# cat /var/lib/heat-config/heat-config-puppet/2ee89657-9906-45c7-827a-29cf70ce6cee.pp
# Copyright 2014 Red Hat, Inc.
# All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

# The content of this file will be used to generate
# the puppet manifests for all roles, the placeholder
# compute will be replaced by 'controller', 'blockstorage',
# 'cephstorage' and all the deployed roles.

if hiera('step') >= 4 {
  hiera_include('compute_classes', [])
}

$package_manifest_name = join(['/var/lib/tripleo/installed-packages/overcloud_compute', hiera('step')])
package_manifest{$package_manifest_name: ensure => present}
include ::tripleo::trusted_cas

include ::tripleo::profile::base::ceph::client
include ::timezone
include ::tripleo::profile::base::time::ntp
include ::tripleo::profile::base::snmp
# TODO(emilien): figure how to deal with libvirt profile.
# We'll probably treat it like we do with Neutron plugins.
# Until then, just include it in the default nova-compute role.
include tripleo::profile::base::nova::compute::libvirt
include tripleo::profile::base::nova::libvirt
include ::tripleo::profile::base::kernel
include ::tripleo::profile::base::neutron::plugins::ml2

include ::tripleo::profile::base::neutron::ovs
include ::tripleo::profile::base::ceilometer::agent::compute


include ::tripleo::packages
include ::tripleo::firewall


include ::kmod
kmod::load { 'kvm_intel': }
kmod::option { 'kvm_intel':
  option => 'nested',
  value  => '1'
}


include ::tripleo::vip_hosts
include ::tripleo::profile::base::sshd

Comment 1 Cédric Jeanneret 2018-12-19 15:02:17 UTC
Hello,

Yep, the sosreport and logs would be nice to have in order to triage this issue :).

Thank you in advance!

Cheers,

C.

Comment 2 ggrimaux 2018-12-19 15:11:23 UTC
SOS reports and logs are downloaded on new collab-shell server.

Will provide link in the next comment.

Thank you!

Comment 15 errata-xmlrpc 2019-04-30 16:59:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0922


Note You need to log in before you can comment on or make changes to this bug.