1290949 – rhel-osp-director: re-ran the deployment command: "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed."

Bug 1290949 - rhel-osp-director: re-ran the deployment command: "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed."

Summary: rhel-osp-director: re-ran the deployment command: "Stack failed with status:...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	z4
Target Release:	7.0 (Kilo)
Assignee:	Steve Baker
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1263345 1289648 (view as bug list)
Depends On:	1305947
Blocks:	1275439
TreeView+	depends on / blocked

Reported:	2015-12-12 01:16 UTC by Alexander Chuzhoy
Modified:	2022-07-09 08:11 UTC (History)
CC List:	15 users (show)
Fixed In Version:	openstack-heat-2015.1.2-8.el7ost
Doc Type:	Known Issue
Doc Text:	By default the number of heat-engine workers created will match the number of cores on the undercloud. Previously, however, if there was only one core there would only be one heat-engine worker, and this caused deadlocks when creating the overcloud stack. A single heat-engine worker was not enough to launch an overcloud stack. To avoid this, it is recommended that the undercloud has at least two (virtual) cores. For virtual deployments this should be two vCPUs, regardless of cores on the baremetal host. If this is not possible, then uncommenting the num_engine_workers line in /etc/heat/heat.conf, and restarting openstack-heat-engine fixes the issue. Thus, the above workarounds have resolved the issue.
Clone Of:
Environment:
Last Closed:	2016-02-18 16:42:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs from one controller. (5.48 MB, application/x-gzip) 2015-12-12 01:17 UTC, Alexander Chuzhoy	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1526045	None	None	None	Never
OpenStack gerrit	259172	None	MERGED	Make minimum default num_engine_workers>=4	2020-08-28 02:24:36 UTC
Red Hat Issue Tracker	OSP-16722	None	None	None	2022-07-09 08:11:42 UTC
Red Hat Product Errata	RHSA-2016:0266	normal	SHIPPED_LIVE	Moderate: openstack-heat bug fix and security advisory	2016-02-18 21:41:02 UTC

Description Alexander Chuzhoy 2015-12-12 01:16:47 UTC

rhel-osp-director: re-ran the deployment command:  "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed."


Environment:


Steps to reproduce:
1. Successfully deploy HA overcloud with network isolation.
2. Re-run the same deployment command.


Result:
Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64
ERROR: openstack Heat Stack update failed.


 heat resource-list -n5 overcloud|grep FAIL

  SecurityWarning
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| Controller                                    | 541369ef-9c8d-436d-960e-57fd8f36a8ba          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2015-12-12T00:59:39Z |                                               |
| 0                                             | 2290a771-4206-423f-b002-035208185892          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2015-12-12T01:00:32Z | Controller                                    |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+

Some these errors in the logs - not sure if related:
Dec 11 19:39:18 localhost os-collect-config: 2015-12-11 19:39:18.042 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:21 localhost os-collect-config: 2015-12-11 19:39:21.052 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:54 localhost os-collect-config: 2015-12-11 19:39:54.096 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:57 localhost os-collect-config: 2015-12-11 19:39:57.100 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:40:30 localhost os-collect-config: 2015-12-11 19:40:30.145 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:40:33 localhost os-collect-config: 2015-12-11 19:40:33.150 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.188 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.197 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369 WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure


Expected result:
Successfull completion.

Comment 2 Alexander Chuzhoy 2015-12-12 01:17:19 UTC

Created attachment 1104929 [details]
logs from one controller.

Comment 3 Alexander Chuzhoy 2015-12-12 01:27:16 UTC

Environment:

instack-undercloud-2.1.2-36.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch
openstack-dashboard-theme-2015.1.2-4.el7ost.noarch
openstack-nova-common-2015.1.2-7.el7ost.noarch
openstack-ceilometer-alarm-2015.1.2-1.el7ost.noarch
openstack-neutron-ml2-2015.1.2-3.el7ost.noarch
openstack-swift-proxy-2.3.0-2.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-ceilometer-collector-2015.1.2-1.el7ost.noarch
openstack-nova-scheduler-2015.1.2-7.el7ost.noarch
openstack-keystone-2015.1.2-2.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch
openstack-neutron-2015.1.2-3.el7ost.noarch
openstack-nova-compute-2015.1.2-7.el7ost.noarch
openstack-ceilometer-central-2015.1.2-1.el7ost.noarch
openstack-nova-conductor-2015.1.2-7.el7ost.noarch
openstack-nova-cert-2015.1.2-7.el7ost.noarch
openstack-heat-engine-2015.1.2-4.el7ost.noarch
openstack-glance-2015.1.2-1.el7ost.noarch
openstack-swift-account-2.3.0-2.el7ost.noarch
openstack-selinux-0.6.46-1.el7ost.noarch
openstack-neutron-common-2015.1.2-3.el7ost.noarch
openstack-ceilometer-common-2015.1.2-1.el7ost.noarch
openstack-ceilometer-api-2015.1.2-1.el7ost.noarch
openstack-nova-console-2015.1.2-7.el7ost.noarch
openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.2-4.el7ost.noarch
openstack-swift-container-2.3.0-2.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch
openstack-neutron-metering-agent-2015.1.2-3.el7ost.noarch
openstack-puppet-modules-2015.1.8-32.el7ost.noarch
openstack-swift-2.3.0-2.el7ost.noarch
openstack-heat-common-2015.1.2-4.el7ost.noarch
openstack-ceilometer-notification-2015.1.2-1.el7ost.noarch
openstack-cinder-2015.1.2-5.el7ost.noarch
openstack-nova-api-2015.1.2-7.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-4.el7ost.noarch
openstack-dashboard-2015.1.2-4.el7ost.noarch
openstack-ceilometer-compute-2015.1.2-1.el7ost.noarch
openstack-heat-api-2015.1.2-4.el7ost.noarch
openstack-swift-object-2.3.0-2.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch

Comment 4 Steve Baker 2015-12-13 19:51:23 UTC

(In reply to Alexander Chuzhoy from comment #0)
> rhel-osp-director: re-ran the deployment command:  "Stack failed with
> status: resources.Controller: MessagingTimeout: resources[0]: Timed out
> waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR:
> openstack Heat Stack update failed."
> 
> 
> Environment:
> 
> 
> Steps to reproduce:
> 1. Successfully deploy HA overcloud with network isolation.
> 2. Re-run the same deployment command.
> 
> 
> Result:
> Stack failed with status: resources.Controller: MessagingTimeout:
> resources[0]: Timed out waiting for a reply to message ID
> 863d0fbc6ce24cd288074d901d1a6e64
> ERROR: openstack Heat Stack update failed.
> 
> 
>  heat resource-list -n5 overcloud|grep FAIL
> 
>   SecurityWarning
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> | resource_name                                 | physical_resource_id      
> | resource_type                                     | resource_status |
> updated_time         | parent_resource                               |
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> | Controller                                    |
> 541369ef-9c8d-436d-960e-57fd8f36a8ba          | OS::Heat::ResourceGroup     
> | UPDATE_FAILED   | 2015-12-12T00:59:39Z |                                  
> |
> | 0                                             |
> 2290a771-4206-423f-b002-035208185892          | OS::TripleO::Controller     
> | UPDATE_FAILED   | 2015-12-12T01:00:32Z | Controller                       
> |
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> 

I found this in heat-engine.log

2015-12-11 17:48:52.198 10503 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue:



> Some these errors in the logs - not sure if related:
> Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369
> WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111,
> 'Connection refused'))
> Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369
> WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure

The attached heat-api-cfn.log is empty, so its hard to tell what that 500 Server Error: InternalFailure is. Also there is a ~6 hour difference between the os-collect-config errors and the UPDATE_FAILED timestamp so it is hard to tell if they are related.

Could you check that rabbitmq is running and healthy? Also can you check that all heat-engine processes are running and are connected to rabbit?

Comment 5 Steve Baker 2015-12-13 19:57:31 UTC

Oh, the attached logs were from a controller node. The error was from heat on the undercloud, so you'll need to attach the undercloud logs.

Comment 6 chris alfonso 2015-12-14 10:48:30 UTC

Sasha, do you have the undercloud logs to attach?

Comment 8 Steve Baker 2015-12-14 20:37:22 UTC

It looks like the root cause of this was that the undercloud only had one vCPU, so heat's default core per worker resulted in only a single heat-engine process being spawned. A single heat-engine worker is not enough to launch an overcloud stack.

Workarounds for this issue:
- undercloud vm needs at least 2 (virtual) cores. This needs to be standard for test environments
- *or* manually uncomment /etc/heat/heat.conf [DEFAULT]num_engine_workers and restart openstack-heat-engine

This bz can be targeted at 8.0 so that it can track the upstream bug which will set workers to max(cores, 4)

Comment 9 Jaromir Coufal 2015-12-14 23:41:07 UTC

Doc/workaround works for me.

Comment 11 chris alfonso 2015-12-15 08:44:54 UTC

*** Bug 1290950 has been marked as a duplicate of this bug. ***

Comment 12 Jaromir Coufal 2015-12-16 12:46:28 UTC

Can we please verify this impacts only virt environments and that the workaround works? Thanks

Comment 13 Alexander Chuzhoy 2015-12-16 15:17:11 UTC

Was able to successfully re-run the deployment command on a setup with 2 vCPUs.

Comment 14 Alexander Chuzhoy 2015-12-17 17:25:20 UTC

Failed to scale up overcloud with this message: attempted to add one compute on bare metal setup.
Environment:
openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch
instack-undercloud-2.1.2-36.el7ost.noarch


The executed command:
openstack overcloud deploy --templates --ntp-server 10.5.26.10 --timeout 90  -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml  -e network-environment.yaml  --control-scale 3 --ceph-storage-scale 3 --compute-scale 2 --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph

Comment 15 Alexander Chuzhoy 2015-12-17 19:58:54 UTC

W/A that did the trick:
1. edit the file /etc/heat/heat.conf on the undercloud and uncomment the line:
#num_engine_workers = 4

2. restart openstack-heat-engine

3. re-run the scale up command.

Comment 16 Steve Baker 2015-12-17 20:03:02 UTC

OK, I'm going to push ahead with the upstream fix to pin the minimum number of workers to 4.

Comment 17 Dan Yocum 2015-12-28 16:11:35 UTC

I'm trying to gather all the undercloud tuning bits into one BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1288153

Comment 18 Steven Hardy 2016-01-07 11:51:11 UTC

*** Bug 1289648 has been marked as a duplicate of this bug. ***

Comment 19 Jaromir Coufal 2016-01-20 17:28:54 UTC

Steve are you making sure this will be part of 7.0.4?

Comment 21 Zane Bitter 2016-02-08 15:14:46 UTC

*** Bug 1305557 has been marked as a duplicate of this bug. ***

Comment 22 Jack Waterworth 2016-02-08 15:26:18 UTC

my customer is also hitting this issue. this is on a physical machine with 8 cores. the workers in heat have already been adjusted to 8.


[stack@blkcclu001 ~]$ heat resource-show overcloud Compute
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Property               | Value                                                                                                                                            |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes             | {                                                                                                                                                |
|                        |   "attributes": null,                                                                                                                            |
|                        |   "refs": null                                                                                                                                   |
|                        | }                                                                                                                                                |
| description            |                                                                                                                                                  |
| links                  | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self)      |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack)                       |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) |
| logical_resource_id    | Compute                                                                                                                                          |
| physical_resource_id   | b4ac65d4-8b33-4337-859a-1a453b8f3034                                                                                                             |
| required_by            | AllNodesExtraConfig                                                                                                                              |
|                        | ComputeCephDeployment                                                                                                                            |
|                        | allNodesConfig                                                                                                                                   |
|                        | ComputeAllNodesDeployment                                                                                                                        |
|                        | ComputeNodesPostDeployment                                                                                                                       |
|                        | ComputeAllNodesValidationDeployment                                                                                                              |
| resource_name          | Compute                                                                                                                                          |
| resource_status        | UPDATE_FAILED                                                                                                                                    |
| resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda                  |
| resource_type          | OS::Heat::ResourceGroup                                                                                                                          |
| updated_time           | 2016-02-05T15:19:19Z                                                                                                                             |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+

from the os-collect-config logs 

Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os-collect-config [-] Source [ec2] Unavailable.
Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os-collect-config [-] Source [cfn] Unavailable.

customer is trying to deploy 7.2

Comment 26 Alexander Chuzhoy 2016-02-10 19:09:56 UTC

Verified:
Environment:
openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch
openstack-heat-common-2015.1.2-8.el7ost.noarch


Successfully deployed overcloud with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml

Populated the overcloud with objects.

Reran the deployment command - completed successfully.

Comment 28 errata-xmlrpc 2016-02-18 16:42:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0266.html

Comment 29 Zane Bitter 2016-03-22 22:20:38 UTC

*** Bug 1263345 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.