Bug 1290949
Summary: | rhel-osp-director: re-ran the deployment command: "Stack failed with status: resources.Controller: MessagingTimeout: resources[0]: Timed out waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: openstack Heat Stack update failed." | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||
Component: | openstack-heat | Assignee: | Steve Baker <sbaker> | ||||
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.0 (Kilo) | CC: | dyocum, hbrock, jcoufal, jwaterwo, mburns, opavlenk, rhel-osp-director-maint, rybrown, sasha, sbaker, shardy, sputhenp, ssainkar, yeylon, zbitter | ||||
Target Milestone: | z4 | Keywords: | Triaged, ZStream | ||||
Target Release: | 7.0 (Kilo) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-heat-2015.1.2-8.el7ost | Doc Type: | Known Issue | ||||
Doc Text: |
By default the number of heat-engine workers created will match the number of cores on the undercloud. Previously, however, if there was only one core there would only be one heat-engine worker, and this caused deadlocks when creating the overcloud stack. A single heat-engine worker was not enough to launch an overcloud stack.
To avoid this, it is recommended that the undercloud has at least two (virtual) cores. For virtual deployments this should be two vCPUs, regardless of cores on the baremetal host. If this is not possible, then uncommenting the num_engine_workers line in /etc/heat/heat.conf, and restarting openstack-heat-engine fixes the issue. Thus, the above workarounds have resolved the issue.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-02-18 16:42:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1305947 | ||||||
Bug Blocks: | 1275439 | ||||||
Attachments: |
|
Description
Alexander Chuzhoy
2015-12-12 01:16:47 UTC
Created attachment 1104929 [details]
logs from one controller.
Environment: instack-undercloud-2.1.2-36.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch openstack-dashboard-theme-2015.1.2-4.el7ost.noarch openstack-nova-common-2015.1.2-7.el7ost.noarch openstack-ceilometer-alarm-2015.1.2-1.el7ost.noarch openstack-neutron-ml2-2015.1.2-3.el7ost.noarch openstack-swift-proxy-2.3.0-2.el7ost.noarch openstack-neutron-bigswitch-lldp-2015.1.38-1.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-ceilometer-collector-2015.1.2-1.el7ost.noarch openstack-nova-scheduler-2015.1.2-7.el7ost.noarch openstack-keystone-2015.1.2-2.el7ost.noarch openstack-neutron-lbaas-2015.1.2-1.el7ost.noarch openstack-neutron-2015.1.2-3.el7ost.noarch openstack-nova-compute-2015.1.2-7.el7ost.noarch openstack-ceilometer-central-2015.1.2-1.el7ost.noarch openstack-nova-conductor-2015.1.2-7.el7ost.noarch openstack-nova-cert-2015.1.2-7.el7ost.noarch openstack-heat-engine-2015.1.2-4.el7ost.noarch openstack-glance-2015.1.2-1.el7ost.noarch openstack-swift-account-2.3.0-2.el7ost.noarch openstack-selinux-0.6.46-1.el7ost.noarch openstack-neutron-common-2015.1.2-3.el7ost.noarch openstack-ceilometer-common-2015.1.2-1.el7ost.noarch openstack-ceilometer-api-2015.1.2-1.el7ost.noarch openstack-nova-console-2015.1.2-7.el7ost.noarch openstack-nova-novncproxy-2015.1.2-7.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.2-4.el7ost.noarch openstack-swift-container-2.3.0-2.el7ost.noarch openstack-neutron-openvswitch-2015.1.2-3.el7ost.noarch openstack-neutron-metering-agent-2015.1.2-3.el7ost.noarch openstack-puppet-modules-2015.1.8-32.el7ost.noarch openstack-swift-2.3.0-2.el7ost.noarch openstack-heat-common-2015.1.2-4.el7ost.noarch openstack-ceilometer-notification-2015.1.2-1.el7ost.noarch openstack-cinder-2015.1.2-5.el7ost.noarch openstack-nova-api-2015.1.2-7.el7ost.noarch openstack-heat-api-cfn-2015.1.2-4.el7ost.noarch openstack-dashboard-2015.1.2-4.el7ost.noarch openstack-ceilometer-compute-2015.1.2-1.el7ost.noarch openstack-heat-api-2015.1.2-4.el7ost.noarch openstack-swift-object-2.3.0-2.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch (In reply to Alexander Chuzhoy from comment #0) > rhel-osp-director: re-ran the deployment command: "Stack failed with > status: resources.Controller: MessagingTimeout: resources[0]: Timed out > waiting for a reply to message ID 863d0fbc6ce24cd288074d901d1a6e64 ERROR: > openstack Heat Stack update failed." > > > Environment: > > > Steps to reproduce: > 1. Successfully deploy HA overcloud with network isolation. > 2. Re-run the same deployment command. > > > Result: > Stack failed with status: resources.Controller: MessagingTimeout: > resources[0]: Timed out waiting for a reply to message ID > 863d0fbc6ce24cd288074d901d1a6e64 > ERROR: openstack Heat Stack update failed. > > > heat resource-list -n5 overcloud|grep FAIL > > SecurityWarning > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > | resource_name | physical_resource_id > | resource_type | resource_status | > updated_time | parent_resource | > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > | Controller | > 541369ef-9c8d-436d-960e-57fd8f36a8ba | OS::Heat::ResourceGroup > | UPDATE_FAILED | 2015-12-12T00:59:39Z | > | > | 0 | > 2290a771-4206-423f-b002-035208185892 | OS::TripleO::Controller > | UPDATE_FAILED | 2015-12-12T01:00:32Z | Controller > | > +-----------------------------------------------+---------------------------- > -------------------+---------------------------------------------------+----- > ------------+----------------------+----------------------------------------- > ------+ > I found this in heat-engine.log 2015-12-11 17:48:52.198 10503 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: > Some these errors in the logs - not sure if related: > Dec 11 19:41:33 localhost os-collect-config: 2015-12-11 19:41:33.234 2369 > WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(111, > 'Connection refused')) > Dec 11 19:42:35 localhost os-collect-config: 2015-12-11 19:42:35.612 2369 > WARNING os_collect_config.cfn [-] 500 Server Error: InternalFailure The attached heat-api-cfn.log is empty, so its hard to tell what that 500 Server Error: InternalFailure is. Also there is a ~6 hour difference between the os-collect-config errors and the UPDATE_FAILED timestamp so it is hard to tell if they are related. Could you check that rabbitmq is running and healthy? Also can you check that all heat-engine processes are running and are connected to rabbit? Oh, the attached logs were from a controller node. The error was from heat on the undercloud, so you'll need to attach the undercloud logs. Sasha, do you have the undercloud logs to attach? It looks like the root cause of this was that the undercloud only had one vCPU, so heat's default core per worker resulted in only a single heat-engine process being spawned. A single heat-engine worker is not enough to launch an overcloud stack. Workarounds for this issue: - undercloud vm needs at least 2 (virtual) cores. This needs to be standard for test environments - *or* manually uncomment /etc/heat/heat.conf [DEFAULT]num_engine_workers and restart openstack-heat-engine This bz can be targeted at 8.0 so that it can track the upstream bug which will set workers to max(cores, 4) Doc/workaround works for me. *** Bug 1290950 has been marked as a duplicate of this bug. *** Can we please verify this impacts only virt environments and that the workaround works? Thanks Was able to successfully re-run the deployment command on a setup with 2 vCPUs. Failed to scale up overcloud with this message: attempted to add one compute on bare metal setup. Environment: openstack-tripleo-heat-templates-0.8.6-94.el7ost.noarch instack-undercloud-2.1.2-36.el7ost.noarch The executed command: openstack overcloud deploy --templates --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --control-scale 3 --ceph-storage-scale 3 --compute-scale 2 --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph W/A that did the trick: 1. edit the file /etc/heat/heat.conf on the undercloud and uncomment the line: #num_engine_workers = 4 2. restart openstack-heat-engine 3. re-run the scale up command. OK, I'm going to push ahead with the upstream fix to pin the minimum number of workers to 4. I'm trying to gather all the undercloud tuning bits into one BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1288153 *** Bug 1289648 has been marked as a duplicate of this bug. *** Steve are you making sure this will be part of 7.0.4? *** Bug 1305557 has been marked as a duplicate of this bug. *** my customer is also hitting this issue. this is on a physical machine with 8 cores. the workers in heat have already been adjusted to 8. [stack@blkcclu001 ~]$ heat resource-show overcloud Compute +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | { | | | "attributes": null, | | | "refs": null | | | } | | description | | | links | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) | | logical_resource_id | Compute | | physical_resource_id | b4ac65d4-8b33-4337-859a-1a453b8f3034 | | required_by | AllNodesExtraConfig | | | ComputeCephDeployment | | | allNodesConfig | | | ComputeAllNodesDeployment | | | ComputeNodesPostDeployment | | | ComputeAllNodesValidationDeployment | | resource_name | Compute | | resource_status | UPDATE_FAILED | | resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda | | resource_type | OS::Heat::ResourceGroup | | updated_time | 2016-02-05T15:19:19Z | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ from the os-collect-config logs Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os_collect_config.ec2 [-] ('Connection aborted.', error(113, 'No route to host')) Feb 04 22:05:59 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:05:59.057 4736 WARNING os-collect-config [-] Source [ec2] Unavailable. Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os_collect_config.cfn [-] ('Connection aborted.', error(113, 'No route to host')) Feb 04 22:06:02 blkcclc009.na.blkint.com os-collect-config[4736]: 2016-02-04 22:06:02.063 4736 WARNING os-collect-config [-] Source [cfn] Unavailable. customer is trying to deploy 7.2 Verified: Environment: openstack-tripleo-heat-templates-0.8.6-117.el7ost.noarch instack-undercloud-2.1.2-39.el7ost.noarch openstack-heat-common-2015.1.2-8.el7ost.noarch Successfully deployed overcloud with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2 --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e ~/ssl-heat-templates/environments/enable-tls.yaml -e ~/ssl-heat-templates/environments/inject-trust-anchor.yaml Populated the overcloud with objects. Reran the deployment command - completed successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0266.html *** Bug 1263345 has been marked as a duplicate of this bug. *** |