1471988 – Scaling up and down ends with UPDATE_FAILED Heat status

Bug 1471988 - Scaling up and down ends with UPDATE_FAILED Heat status

Summary: Scaling up and down ends with UPDATE_FAILED Heat status

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Alex Schultz
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1487396
TreeView+	depends on / blocked

Reported:	2017-07-17 21:08 UTC by Richard Su
Modified:	2018-06-27 20:08 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1487396 (view as bug list)
Environment:
Last Closed:	2017-08-30 14:31:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
stdout from openstack overcloud deploy -e ~/environment.yaml --templates (82.75 KB, text/plain) 2017-07-17 21:08 UTC, Richard Su	no flags	Details
heat, nova, ironic, and mistral service logs (1.32 MB, application/x-gzip) 2017-07-17 21:09 UTC, Richard Su	no flags	Details
sosreport from undercloud node (13.67 MB, application/x-xz) 2017-07-28 19:45 UTC, Richard Su	no flags	Details
full stack failures list (110.27 KB, text/plain) 2017-07-28 19:51 UTC, Richard Su	no flags	Details
View All

Description Richard Su 2017-07-17 21:08:34 UTC

Created attachment 1300111 [details]
stdout from openstack overcloud deploy -e ~/environment.yaml --templates

Description of problem:
With Ocata release, scaling up compute nodes or deleting compute nodes ends with UPDATE_FAILED as the Heat status. Although UPDATE_FAILED is listed as status, a new node is brought up by scale up and the node is deleted successfully when deleting nodes. 

This problem doesn't happen in Newton.

Version-Release number of selected component (if applicable):
openstack-tripleo-common-6.1.0-2.el7ost.noarch
openstack-heat-api-8.0.2-2.el7ost.noarch
openstack-heat-api-cfn-8.0.2-2.el7ost.noarch
openstack-heat-common-8.0.2-2.el7ost.noarch
openstack-heat-engine-8.0.2-2.el7ost.noarch
openstack-tripleo-heat-templates-6.1.0-1.el7ost.noarch
openstack-mistral-api-4.0.2-1.el7ost.noarch
openstack-mistral-common-4.0.2-1.el7ost.noarch
openstack-mistral-engine-4.0.2-1.el7ost.noarch
openstack-mistral-executor-4.0.2-1.el7ost.noarch
openstack-nova-api-15.0.6-3.el7ost.noarch
openstack-nova-cert-15.0.6-3.el7ost.noarch
openstack-nova-common-15.0.6-3.el7ost.noarch
openstack-nova-compute-15.0.6-3.el7ost.noarch
openstack-nova-conductor-15.0.6-3.el7ost.noarch
openstack-nova-placement-api-15.0.6-3.el7ost.noarch
openstack-nova-scheduler-15.0.6-3.el7ost.noarch


How reproducible:
Always

Steps to Reproduce:
1. Deploy undercloud and overcloud using instack-virt-setup environment. 1 controller and 1 compute node is brought up in overcloud.
2. Scale up overcloud using:
[stack@instack ~]$ cat environment.yml 
parameter_defaults:
  ComputeCount: 2

openstack overcloud deploy -e ~/environment.yml --templates

Actual results:
New compute node is brought up, but CLI command and heat status ends with UPDATE_FAILED status.


Expected results:
New compute node is brought up and SUCCESS is indicated as status.


Additional info:
See attachments for stdout and openstack service logs.

Comment 1 Richard Su 2017-07-17 21:09:53 UTC

Created attachment 1300112 [details]
heat, nova, ironic, and mistral service logs

Comment 2 Carlos Camacho 2017-07-24 09:14:07 UTC

Hey folks, is this issue related to a scaling issue or a problem when doing upgrades/updates? Seems to be a stack update problem when scaling up.

Comment 3 Alex Schultz 2017-07-24 14:56:34 UTC

It looks like a puppet error but we would need the logs from the host that the error occurred on. Please provide a sosreport from the controller that it failed on.  Alternatively the information may show up in an 'openstack stack failures list overcloud'

Comment 4 Richard Su 2017-07-28 19:40:19 UTC

@Carlos, the issue occurs during scale up.

Comment 5 Richard Su 2017-07-28 19:41:12 UTC

@Alex, here is the stack failures list:

[stack@instack ~]$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 6ffcbd26-d156-4698-9670-1e47daec0717
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-object] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-object] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-object] has failures: true
    Notice: Applied catalog in 75.06 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    Warning: /Stage[main]/Gnocchi::Deps/Anchor[gnocchi::service::begin]: Skipping because of failed dependencies
    Warning: /Stage[main]/Gnocchi::Api/Service[gnocchi-api]: Skipping because of failed dependencies
    Warning: /Stage[main]/Apache::Service/Service[httpd]: Skipping because of failed dependencies
    Warning: /Stage[main]/Keystone::Deps/Anchor[keystone::service::end]: Skipping because of failed dependencies
    Warning: /Stage[main]/Gnocchi::Deps/Anchor[gnocchi::service::end]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Firewall[998 log all]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv4]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Skipping because of failed dependencies
    Warning: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Skipping because of failed dependencies
    Warning: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Skipping because of failed dependencies
    (truncated, view all with --long)

Comment 6 Richard Su 2017-07-28 19:45:38 UTC

Created attachment 1306024 [details]
sosreport from undercloud node

Comment 7 Richard Su 2017-07-28 19:51:18 UTC

Created attachment 1306025 [details]
full stack failures list

Looks like a not enough memory issue. I have these memory settings in my dev environment.

export UNDERCLOUD_NODE_MEM=12288
export NODE_MEM=8192

Comment 8 Alex Schultz 2017-07-28 22:35:48 UTC

Correct the error is not enough memory. There is not enough resources on the node being deployed. Is there actually 8G of memory available on your overcloud nodes?

     Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy/Pacemaker::Resource::Service[haproxy]/Pacemaker::Resource::Systemd[haproxy]/Pcmk_resource[haproxy]: Could not evaluate: Cannot allocate memory - /usr/sbin/pcs
     Error: /Stage[main]/Swift::Storage::Account/Swift::Storage::Generic[account]/Package[swift-account]: Could not evaluate: Cannot allocate memory - fork(2)
    Error: /Stage[main]/Swift::Storage::Container/Swift::Storage::Generic[container]/Package[swift-container]: Could not evaluate: Cannot allocate memory - fork(2)
    Error: /Stage[main]/Swift::Storage::Object/Swift::Storage::Generic[object]/Package[swift-object]: Could not evaluate: Cannot allocate memory - fork(2)

Additionally what services are you deploying as that can also affect the memory footprint?  

If this is a development environment we have 2 basic deployment options available to you right now:
You may try enabling swap via -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml
You may try using -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml

Comment 9 Alex Schultz 2017-08-30 14:31:35 UTC

Closing due to lack of updates.  Feel free to open again if continues to be a problem after enabling swap or using the low-memory-usage.yaml

Note You need to log in before you can comment on or make changes to this bug.