Bug 1471988 - Scaling up and down ends with UPDATE_FAILED Heat status
Summary: Scaling up and down ends with UPDATE_FAILED Heat status
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: unspecified
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Alex Schultz
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks: 1487396
TreeView+ depends on / blocked
 
Reported: 2017-07-17 21:08 UTC by Richard Su
Modified: 2018-06-27 20:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1487396 (view as bug list)
Environment:
Last Closed: 2017-08-30 14:31:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stdout from openstack overcloud deploy -e ~/environment.yaml --templates (82.75 KB, text/plain)
2017-07-17 21:08 UTC, Richard Su
no flags Details
heat, nova, ironic, and mistral service logs (1.32 MB, application/x-gzip)
2017-07-17 21:09 UTC, Richard Su
no flags Details
sosreport from undercloud node (13.67 MB, application/x-xz)
2017-07-28 19:45 UTC, Richard Su
no flags Details
full stack failures list (110.27 KB, text/plain)
2017-07-28 19:51 UTC, Richard Su
no flags Details

Description Richard Su 2017-07-17 21:08:34 UTC
Created attachment 1300111 [details]
stdout from openstack overcloud deploy -e ~/environment.yaml --templates

Description of problem:
With Ocata release, scaling up compute nodes or deleting compute nodes ends with UPDATE_FAILED as the Heat status. Although UPDATE_FAILED is listed as status, a new node is brought up by scale up and the node is deleted successfully when deleting nodes. 

This problem doesn't happen in Newton.

Version-Release number of selected component (if applicable):
openstack-tripleo-common-6.1.0-2.el7ost.noarch
openstack-heat-api-8.0.2-2.el7ost.noarch
openstack-heat-api-cfn-8.0.2-2.el7ost.noarch
openstack-heat-common-8.0.2-2.el7ost.noarch
openstack-heat-engine-8.0.2-2.el7ost.noarch
openstack-tripleo-heat-templates-6.1.0-1.el7ost.noarch
openstack-mistral-api-4.0.2-1.el7ost.noarch
openstack-mistral-common-4.0.2-1.el7ost.noarch
openstack-mistral-engine-4.0.2-1.el7ost.noarch
openstack-mistral-executor-4.0.2-1.el7ost.noarch
openstack-nova-api-15.0.6-3.el7ost.noarch
openstack-nova-cert-15.0.6-3.el7ost.noarch
openstack-nova-common-15.0.6-3.el7ost.noarch
openstack-nova-compute-15.0.6-3.el7ost.noarch
openstack-nova-conductor-15.0.6-3.el7ost.noarch
openstack-nova-placement-api-15.0.6-3.el7ost.noarch
openstack-nova-scheduler-15.0.6-3.el7ost.noarch


How reproducible:
Always

Steps to Reproduce:
1. Deploy undercloud and overcloud using instack-virt-setup environment. 1 controller and 1 compute node is brought up in overcloud.
2. Scale up overcloud using:
[stack@instack ~]$ cat environment.yml 
parameter_defaults:
  ComputeCount: 2

openstack overcloud deploy -e ~/environment.yml --templates

Actual results:
New compute node is brought up, but CLI command and heat status ends with UPDATE_FAILED status.


Expected results:
New compute node is brought up and SUCCESS is indicated as status.


Additional info:
See attachments for stdout and openstack service logs.

Comment 1 Richard Su 2017-07-17 21:09:53 UTC
Created attachment 1300112 [details]
heat, nova, ironic, and mistral service logs

Comment 2 Carlos Camacho 2017-07-24 09:14:07 UTC
Hey folks, is this issue related to a scaling issue or a problem when doing upgrades/updates? Seems to be a stack update problem when scaling up.

Comment 3 Alex Schultz 2017-07-24 14:56:34 UTC
It looks like a puppet error but we would need the logs from the host that the error occurred on. Please provide a sosreport from the controller that it failed on.  Alternatively the information may show up in an 'openstack stack failures list overcloud'

Comment 4 Richard Su 2017-07-28 19:40:19 UTC
@Carlos, the issue occurs during scale up.

Comment 5 Richard Su 2017-07-28 19:41:12 UTC
@Alex, here is the stack failures list:

[stack@instack ~]$ openstack stack failures list overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 6ffcbd26-d156-4698-9670-1e47daec0717
  status: UPDATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Dependency Package[swift-object] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Dependency Package[swift-object] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-account] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-container] has failures: true
    Notice: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Dependency Package[swift-object] has failures: true
    Notice: Applied catalog in 75.06 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    Warning: /Stage[main]/Gnocchi::Deps/Anchor[gnocchi::service::begin]: Skipping because of failed dependencies
    Warning: /Stage[main]/Gnocchi::Api/Service[gnocchi-api]: Skipping because of failed dependencies
    Warning: /Stage[main]/Apache::Service/Service[httpd]: Skipping because of failed dependencies
    Warning: /Stage[main]/Keystone::Deps/Anchor[keystone::service::end]: Skipping because of failed dependencies
    Warning: /Stage[main]/Gnocchi::Deps/Anchor[gnocchi::service::end]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Firewall[998 log all]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv4]: Skipping because of failed dependencies
    Warning: /Stage[main]/Tripleo::Firewall::Post/Tripleo::Firewall::Rule[999 drop all]/Firewall[999 drop all ipv6]: Skipping because of failed dependencies
    Warning: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]: Skipping because of failed dependencies
    Warning: /Stage[main]/Firewall::Linux::Redhat/File[/etc/sysconfig/ip6tables]: Skipping because of failed dependencies
    (truncated, view all with --long)

Comment 6 Richard Su 2017-07-28 19:45:38 UTC
Created attachment 1306024 [details]
sosreport from undercloud node

Comment 7 Richard Su 2017-07-28 19:51:18 UTC
Created attachment 1306025 [details]
full stack failures list

Looks like a not enough memory issue. I have these memory settings in my dev environment.

export UNDERCLOUD_NODE_MEM=12288
export NODE_MEM=8192

Comment 8 Alex Schultz 2017-07-28 22:35:48 UTC
Correct the error is not enough memory. There is not enough resources on the node being deployed. Is there actually 8G of memory available on your overcloud nodes?

     Error: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy/Pacemaker::Resource::Service[haproxy]/Pacemaker::Resource::Systemd[haproxy]/Pcmk_resource[haproxy]: Could not evaluate: Cannot allocate memory - /usr/sbin/pcs
     Error: /Stage[main]/Swift::Storage::Account/Swift::Storage::Generic[account]/Package[swift-account]: Could not evaluate: Cannot allocate memory - fork(2)
    Error: /Stage[main]/Swift::Storage::Container/Swift::Storage::Generic[container]/Package[swift-container]: Could not evaluate: Cannot allocate memory - fork(2)
    Error: /Stage[main]/Swift::Storage::Object/Swift::Storage::Generic[object]/Package[swift-object]: Could not evaluate: Cannot allocate memory - fork(2)

Additionally what services are you deploying as that can also affect the memory footprint?  

If this is a development environment we have 2 basic deployment options available to you right now:
You may try enabling swap via -e /usr/share/openstack-tripleo-heat-templates/environments/enable-swap.yaml
You may try using -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml

Comment 9 Alex Schultz 2017-08-30 14:31:35 UTC
Closing due to lack of updates.  Feel free to open again if continues to be a problem after enabling swap or using the low-memory-usage.yaml


Note You need to log in before you can comment on or make changes to this bug.