Bug 1290949

Summary: rhel-osp-director: re-ran the deployment command: "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed."
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-heatAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: dyocum, hbrock, jcoufal, jwaterwo, mburns, opavlenk, rhel-osp-director-maint, rybrown, sasha, sbaker, shardy, sputhenp, ssainkar, yeylon, zbitter
Target Milestone: z4Keywords: Triaged, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-2015.1.2-8.el7ost Doc Type: Known Issue
Doc Text:
By default the number of heat-engine workers created will match the number of cores on the undercloud. Previously, however, if there was only one core there would only be one heat-engine worker, and this caused deadlocks when creating the overcloud stack. A single heat-engine worker was not enough to launch an overcloud stack. To avoid this, it is recommended that the undercloud has at least two (virtual) cores. For virtual deployments this should be two vCPUs, regardless of cores on the baremetal host. If this is not possible, then uncommenting the num_engine_workers line in /etc/heat/heat.conf, and restarting openstack-heat-engine fixes the issue. Thus, the above workarounds have resolved the issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-18 16:42:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1305947    
Bug Blocks: 1275439    
Attachments:
Description Flags
logs from one controller. none

Description Alexander Chuzhoy 2015-12-12 01:16:47 UTC
rhel-osp-director: re-ran the deployment command:  "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed."


Environment:


Steps to reproduce:
1. Successfully deploy HA overcloud with network isolation.
2. Re-run the same deployment command.


Result:
Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64
ERROR: openstack Heat Stack update failed.


 heat resource-list -n5 overcloud|grep FAIL

  SecurityWarning
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| Controller                                    | 541369ef-9c8d-436d-960e-57fd8f36a8ba          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2015-12-12T00:59:39Z |                                               |
| 0                                             | 2290a771-4206-423f-b002-035208185892          | OS::TripleO::Controller                           | UPDATE_FAILED   | 2015-12-12T01:00:32Z | Controller                                    |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+

Some these errors in the logs - not sure if related:
Dec 11 19:39:18 localhost os-collect-config: 2015-12-11 19:39:18.042 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:21 localhost os-collect-config: 2015-12-11 19:39:21.052 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:54 localhost os-collect-config: 2015-12-11 19:39:54.096 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:39:57 localhost os-collect-config: 2015-12-11 19:39:57.100 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:40:30 localhost os-collect-config: 2015-12-11 19:40:30.145 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:40:33 localhost os-collect-config: 2015-12-11 19:40:33.150 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.188 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:41:03 localhost os-collect-config: 2015-12-11 19:41:03.197 2369 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, 'Connection refused'))
Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369 WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure


Expected result:
Successfull completion.

Comment 2 Alexander Chuzhoy 2015-12-12 01:17:19 UTC
Created attachment 1104929 [details]
logs from one controller.

Comment 3 Alexander Chuzhoy 2015-12-12 01:27:16 UTC
Environment:

instack-undercloud-2.1.2-36.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch
openstack-dashboard-theme-2015.1.2-4.el7ost.noarch
openstack-nova-common-2015.1.2-7.el7ost.noarch
openstack-ceilometer-alarm-2015.1.2-1.el7ost.noarch
openstack-neutron-ml2-2015.1.2-3.el7ost.noarch
openstack-swift-proxy-2.3.0-2.el7ost.noarch
openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-ceilometer-collector-2015.1.2-1.el7ost.noarch
openstack-nova-scheduler-2015.1.2-7.el7ost.noarch
openstack-keystone-2015.1.2-2.el7ost.noarch
openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch
openstack-neutron-2015.1.2-3.el7ost.noarch
openstack-nova-compute-2015.1.2-7.el7ost.noarch
openstack-ceilometer-central-2015.1.2-1.el7ost.noarch
openstack-nova-conductor-2015.1.2-7.el7ost.noarch
openstack-nova-cert-2015.1.2-7.el7ost.noarch
openstack-heat-engine-2015.1.2-4.el7ost.noarch
openstack-glance-2015.1.2-1.el7ost.noarch
openstack-swift-account-2.3.0-2.el7ost.noarch
openstack-selinux-0.6.46-1.el7ost.noarch
openstack-neutron-common-2015.1.2-3.el7ost.noarch
openstack-ceilometer-common-2015.1.2-1.el7ost.noarch
openstack-ceilometer-api-2015.1.2-1.el7ost.noarch
openstack-nova-console-2015.1.2-7.el7ost.noarch
openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.2-4.el7ost.noarch
openstack-swift-container-2.3.0-2.el7ost.noarch
openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch
openstack-neutron-metering-agent-2015.1.2-3.el7ost.noarch
openstack-puppet-modules-2015.1.8-32.el7ost.noarch
openstack-swift-2.3.0-2.el7ost.noarch
openstack-heat-common-2015.1.2-4.el7ost.noarch
openstack-ceilometer-notification-2015.1.2-1.el7ost.noarch
openstack-cinder-2015.1.2-5.el7ost.noarch
openstack-nova-api-2015.1.2-7.el7ost.noarch
openstack-heat-api-cfn-2015.1.2-4.el7ost.noarch
openstack-dashboard-2015.1.2-4.el7ost.noarch
openstack-ceilometer-compute-2015.1.2-1.el7ost.noarch
openstack-heat-api-2015.1.2-4.el7ost.noarch
openstack-swift-object-2.3.0-2.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch

Comment 4 Steve Baker 2015-12-13 19:51:23 UTC
(In reply to Alexander Chuzhoy from comment #0)
> rhel-osp-director: re-ran the deployment command:  "Stack failed with
> status: resources.Controller: MessagingTimeout: resources[0]: Timed out
> waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR:
> openstack Heat Stack update failed."
> 
> 
> Environment:
> 
> 
> Steps to reproduce:
> 1. Successfully deploy HA overcloud with network isolation.
> 2. Re-run the same deployment command.
> 
> 
> Result:
> Stack failed with status: resources.Controller: MessagingTimeout:
> resources[0]: Timed out waiting for a reply to message ID
> 863d0fbc6ce24cd288074d901d1a6e64
> ERROR: openstack Heat Stack update failed.
> 
> 
>  heat resource-list -n5 overcloud|grep FAIL
> 
>   SecurityWarning
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> | resource_name                                 | physical_resource_id      
> | resource_type                                     | resource_status |
> updated_time         | parent_resource                               |
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> | Controller                                    |
> 541369ef-9c8d-436d-960e-57fd8f36a8ba          | OS::Heat::ResourceGroup     
> | UPDATE_FAILED   | 2015-12-12T00:59:39Z |                                  
> |
> | 0                                             |
> 2290a771-4206-423f-b002-035208185892          | OS::TripleO::Controller     
> | UPDATE_FAILED   | 2015-12-12T01:00:32Z | Controller                       
> |
> +-----------------------------------------------+----------------------------
> -------------------+---------------------------------------------------+-----
> ------------+----------------------+-----------------------------------------
> ------+
> 

I found this in heat-engine.log

2015-12-11 17:48:52.198 10503 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue:



> Some these errors in the logs - not sure if related:
> Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369
> WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111,
> 'Connection refused'))
> Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369
> WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure

The attached heat-api-cfn.log is empty, so its hard to tell what that 500 Server Error: InternalFailure is. Also there is a ~6 hour difference between the os-collect-config errors and the UPDATE_FAILED timestamp so it is hard to tell if they are related.

Could you check that rabbitmq is running and healthy? Also can you check that all heat-engine processes are running and are connected to rabbit?

Comment 5 Steve Baker 2015-12-13 19:57:31 UTC
Oh, the attached logs were from a controller node. The error was from heat on the undercloud, so you'll need to attach the undercloud logs.

Comment 6 chris alfonso 2015-12-14 10:48:30 UTC
Sasha, do you have the undercloud logs to attach?

Comment 8 Steve Baker 2015-12-14 20:37:22 UTC
It looks like the root cause of this was that the undercloud only had one vCPU, so heat's default core per worker resulted in only a single heat-engine process being spawned. A single heat-engine worker is not enough to launch an overcloud stack.

Workarounds for this issue:
- undercloud vm needs at least 2 (virtual) cores. This needs to be standard for test environments
- *or* manually uncomment /etc/heat/heat.conf [DEFAULT]num_engine_workers and restart openstack-heat-engine

This bz can be targeted at 8.0 so that it can track the upstream bug which will set workers to max(cores, 4)

Comment 9 Jaromir Coufal 2015-12-14 23:41:07 UTC
Doc/workaround works for me.

Comment 11 chris alfonso 2015-12-15 08:44:54 UTC
*** Bug 1290950 has been marked as a duplicate of this bug. ***

Comment 12 Jaromir Coufal 2015-12-16 12:46:28 UTC
Can we please verify this impacts only virt environments and that the workaround works? Thanks

Comment 13 Alexander Chuzhoy 2015-12-16 15:17:11 UTC
Was able to successfully re-run the deployment command on a setup with 2 vCPUs.

Comment 14 Alexander Chuzhoy 2015-12-17 17:25:20 UTC
Failed to scale up overcloud with this message: attempted to add one compute on bare metal setup.
Environment:
openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch
instack-undercloud-2.1.2-36.el7ost.noarch


The executed command:
openstack overcloud deploy --templates --ntp-server 10.5.26.10 --timeout 90  -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml  -e network-environment.yaml  --control-scale 3 --ceph-storage-scale 3 --compute-scale 2 --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph

Comment 15 Alexander Chuzhoy 2015-12-17 19:58:54 UTC
W/A that did the trick:
1. edit the file /etc/heat/heat.conf on the undercloud and uncomment the line:
#num_engine_workers = 4

2. restart openstack-heat-engine

3. re-run the scale up command.

Comment 16 Steve Baker 2015-12-17 20:03:02 UTC
OK, I'm going to push ahead with the upstream fix to pin the minimum number of workers to 4.

Comment 17 Dan Yocum 2015-12-28 16:11:35 UTC
I'm trying to gather all the undercloud tuning bits into one BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1288153

Comment 18 Steven Hardy 2016-01-07 11:51:11 UTC
*** Bug 1289648 has been marked as a duplicate of this bug. ***

Comment 19 Jaromir Coufal 2016-01-20 17:28:54 UTC
Steve are you making sure this will be part of 7.0.4?

Comment 21 Zane Bitter 2016-02-08 15:14:46 UTC
*** Bug 1305557 has been marked as a duplicate of this bug. ***

Comment 22 Jack Waterworth 2016-02-08 15:26:18 UTC
my customer is also hitting this issue. this is on a physical machine with 8 cores. the workers in heat have already been adjusted to 8.


[stack@blkcclu001 ~]$ heat resource-show overcloud Compute
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Property               | Value                                                                                                                                            |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes             | {                                                                                                                                                |
|                        |   "attributes": null,                                                                                                                            |
|                        |   "refs": null                                                                                                                                   |
|                        | }                                                                                                                                                |
| description            |                                                                                                                                                  |
| links                  | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self)      |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack)                       |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) |
| logical_resource_id    | Compute                                                                                                                                          |
| physical_resource_id   | b4ac65d4-8b33-4337-859a-1a453b8f3034                                                                                                             |
| required_by            | AllNodesExtraConfig                                                                                                                              |
|                        | ComputeCephDeployment                                                                                                                            |
|                        | allNodesConfig                                                                                                                                   |
|                        | ComputeAllNodesDeployment                                                                                                                        |
|                        | ComputeNodesPostDeployment                                                                                                                       |
|                        | ComputeAllNodesValidationDeployment                                                                                                              |
| resource_name          | Compute                                                                                                                                          |
| resource_status        | UPDATE_FAILED                                                                                                                                    |
| resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda                  |
| resource_type          | OS::Heat::ResourceGroup                                                                                                                          |
| updated_time           | 2016-02-05T15:19:19Z                                                                                                                             |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+

from the os-collect-config logs 

Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host'))
Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os-collect-config [-] Source [ec2] Unavailable.
Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host'))
Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os-collect-config [-] Source [cfn] Unavailable.

customer is trying to deploy 7.2

Comment 26 Alexander Chuzhoy 2016-02-10 19:09:56 UTC
Verified:
Environment:
openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch
openstack-heat-common-2015.1.2-8.el7ost.noarch


Successfully deployed overcloud with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1   --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml

Populated the overcloud with objects.

Reran the deployment command - completed successfully.

Comment 28 errata-xmlrpc 2016-02-18 16:42:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0266.html

Comment 29 Zane Bitter 2016-03-22 22:20:38 UTC
*** Bug 1263345 has been marked as a duplicate of this bug. ***