rhel-osp-director: re-ran the deployment command: "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed." Environment: Steps to reproduce: 1. Successfully deploy HA overcloud with network isolation. 2. Re-run the same deployment command. Result: Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed. heat resource-list -n5 overcloud|grep FAIL SecurityWarning +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | Controller | 541369ef-9c8d-436d-960e-57fd8f36a8ba | OS::Heat::ResourceGroup | UPDATE_FAILED | 2015-12-12T00:59:39Z | | | 0 | 2290a771-4206-423f-b002-035208185892 | OS::TripleO::Controller | UPDATE_FAILED | 2015-12-12T01:00:32Z | Controller | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ Some these errors in the logs - not sure if related: Dec 11 19:39:18 localhost os-collect-config: 2015-12-11 19:39:18.042 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:39:21 localhost os-collect-config: 2015-12-11 19:39:21.052 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:39:54 localhost os-collect-config: 2015-12-11 19:39:54.096 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:39:57 localhost os-collect-config: 2015-12-11 19:39:57.100 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:40:30 localhost os-collect-config: 2015-12-11 19:40:30.145 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:40:33 localhost os-collect-config: 2015-12-11 19:40:33.150 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host')) Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.188 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused')) Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.197 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(111, 'Connection refused')) Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused')) Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369 WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure Expected result: Successfull completion.
Created attachment 1104929 [details] logs from one controller.
Environment: instack-undercloud-2.1.2-36.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch openstack-dashboard-theme-2015.1.2-4.el7ost.noarch openstack-nova-common-2015.1.2-7.el7ost.noarch openstack-ceilometer-alarm-2015.1.2-1.el7ost.noarch openstack-neutron-ml2-2015.1.2-3.el7ost.noarch openstack-swift-proxy-2.3.0-2.el7ost.noarch openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-ceilometer-collector-2015.1.2-1.el7ost.noarch openstack-nova-scheduler-2015.1.2-7.el7ost.noarch openstack-keystone-2015.1.2-2.el7ost.noarch openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch openstack-neutron-2015.1.2-3.el7ost.noarch openstack-nova-compute-2015.1.2-7.el7ost.noarch openstack-ceilometer-central-2015.1.2-1.el7ost.noarch openstack-nova-conductor-2015.1.2-7.el7ost.noarch openstack-nova-cert-2015.1.2-7.el7ost.noarch openstack-heat-engine-2015.1.2-4.el7ost.noarch openstack-glance-2015.1.2-1.el7ost.noarch openstack-swift-account-2.3.0-2.el7ost.noarch openstack-selinux-0.6.46-1.el7ost.noarch openstack-neutron-common-2015.1.2-3.el7ost.noarch openstack-ceilometer-common-2015.1.2-1.el7ost.noarch openstack-ceilometer-api-2015.1.2-1.el7ost.noarch openstack-nova-console-2015.1.2-7.el7ost.noarch openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.2-4.el7ost.noarch openstack-swift-container-2.3.0-2.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch openstack-neutron-metering-agent-2015.1.2-3.el7ost.noarch openstack-puppet-modules-2015.1.8-32.el7ost.noarch openstack-swift-2.3.0-2.el7ost.noarch openstack-heat-common-2015.1.2-4.el7ost.noarch openstack-ceilometer-notification-2015.1.2-1.el7ost.noarch openstack-cinder-2015.1.2-5.el7ost.noarch openstack-nova-api-2015.1.2-7.el7ost.noarch openstack-heat-api-cfn-2015.1.2-4.el7ost.noarch openstack-dashboard-2015.1.2-4.el7ost.noarch openstack-ceilometer-compute-2015.1.2-1.el7ost.noarch openstack-heat-api-2015.1.2-4.el7ost.noarch openstack-swift-object-2.3.0-2.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch
(In reply to Alexander Chuzhoy from comment #0) > rhel-osp-director: re-ran the deployment command: "Stack failed with > status: resources.Controller: MessagingTimeout: resources[0]: Timed out > waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: > openstack Heat Stack update failed." > > > Environment: > > > Steps to reproduce: > 1. Successfully deploy HA overcloud with network isolation. > 2. Re-run the same deployment command. > > > Result: > Stack failed with status: resources.Controller: MessagingTimeout: > resources[0]: Timed out waiting for a reply to message ID > 863d0fbc6ce24cd288074d901d1a6e64 > ERROR: openstack Heat Stack update failed. > > > heat resource-list -n5 overcloud|grep FAIL > > SecurityWarning > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > | resource_name | physical_resource_id > | resource_type | resource_status | > updated_time | parent_resource | > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > | Controller | > 541369ef-9c8d-436d-960e-57fd8f36a8ba | OS::Heat::ResourceGroup > | UPDATE_FAILED | 2015-12-12T00:59:39Z | > | > | 0 | > 2290a771-4206-423f-b002-035208185892 | OS::TripleO::Controller > | UPDATE_FAILED | 2015-12-12T01:00:32Z | Controller > | > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > I found this in heat-engine.log 2015-12-11 17:48:52.198 10503 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: > Some these errors in the logs - not sure if related: > Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369 > WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, > 'Connection refused')) > Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369 > WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure The attached heat-api-cfn.log is empty, so its hard to tell what that 500 Server Error: InternalFailure is. Also there is a ~6 hour difference between the os-collect-config errors and the UPDATE_FAILED timestamp so it is hard to tell if they are related. Could you check that rabbitmq is running and healthy? Also can you check that all heat-engine processes are running and are connected to rabbit?
Oh, the attached logs were from a controller node. The error was from heat on the undercloud, so you'll need to attach the undercloud logs.
Sasha, do you have the undercloud logs to attach?
It looks like the root cause of this was that the undercloud only had one vCPU, so heat's default core per worker resulted in only a single heat-engine process being spawned. A single heat-engine worker is not enough to launch an overcloud stack. Workarounds for this issue: - undercloud vm needs at least 2 (virtual) cores. This needs to be standard for test environments - *or* manually uncomment /etc/heat/heat.conf [DEFAULT]num_engine_workers and restart openstack-heat-engine This bz can be targeted at 8.0 so that it can track the upstream bug which will set workers to max(cores, 4)
Doc/workaround works for me.
*** Bug 1290950 has been marked as a duplicate of this bug. ***
Can we please verify this impacts only virt environments and that the workaround works? Thanks
Was able to successfully re-run the deployment command on a setup with 2 vCPUs.
Failed to scale up overcloud with this message: attempted to add one compute on bare metal setup. Environment: openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch instack-undercloud-2.1.2-36.el7ost.noarch The executed command: openstack overcloud deploy --templates --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --control-scale 3 --ceph-storage-scale 3 --compute-scale 2 --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph
W/A that did the trick: 1. edit the file /etc/heat/heat.conf on the undercloud and uncomment the line: #num_engine_workers = 4 2. restart openstack-heat-engine 3. re-run the scale up command.
OK, I'm going to push ahead with the upstream fix to pin the minimum number of workers to 4.
I'm trying to gather all the undercloud tuning bits into one BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1288153
*** Bug 1289648 has been marked as a duplicate of this bug. ***
Steve are you making sure this will be part of 7.0.4?
*** Bug 1305557 has been marked as a duplicate of this bug. ***
my customer is also hitting this issue. this is on a physical machine with 8 cores. the workers in heat have already been adjusted to 8. [stack@blkcclu001 ~]$ heat resource-show overcloud Compute +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | { | | | "attributes": null, | | | "refs": null | | | } | | description | | | links | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) | | logical_resource_id | Compute | | physical_resource_id | b4ac65d4-8b33-4337-859a-1a453b8f3034 | | required_by | AllNodesExtraConfig | | | ComputeCephDeployment | | | allNodesConfig | | | ComputeAllNodesDeployment | | | ComputeNodesPostDeployment | | | ComputeAllNodesValidationDeployment | | resource_name | Compute | | resource_status | UPDATE_FAILED | | resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda | | resource_type | OS::Heat::ResourceGroup | | updated_time | 2016-02-05T15:19:19Z | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ from the os-collect-config logs Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host')) Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os-collect-config [-] Source [ec2] Unavailable. Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host')) Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os-collect-config [-] Source [cfn] Unavailable. customer is trying to deploy 7.2
Verified: Environment: openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch instack-undercloud-2.1.2-39.el7ost.noarch openstack-heat-common-2015.1.2-8.el7ost.noarch Successfully deployed overcloud with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml Populated the overcloud with objects. Reran the deployment command - completed successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0266.html
*** Bug 1263345 has been marked as a duplicate of this bug. ***