Bug 1493263 - tripleo.deployment.overcloudrc workflow fails on clean deployment and during upgrade with: Internal Server Error (HTTP 500)
Summary: tripleo.deployment.overcloudrc workflow fails on clean deployment and during ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-19 18:38 UTC by Marius Cornea
Modified: 2018-02-05 19:15 UTC (History)
18 users (show)

Fixed In Version: openstack-heat-9.0.1-0.20170925083033.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:10:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1719333 0 None None None 2017-09-25 14:53:05 UTC
OpenStack gerrit 507930 0 'None' MERGED Speed up show-stack with outputs in legacy path 2020-12-27 10:55:51 UTC
OpenStack gerrit 507931 0 'None' MERGED Use show_output in TemplateResource.get_reference_id() 2020-12-27 10:55:51 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Marius Cornea 2017-09-19 18:38:10 UTC
Description of problem:
OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails on deployment with standalone nodes with: Internal Server Error (HTTP 500)
We can actually see that the stack got updated by the mistral workflow(I believe the one which creates overcloudrc) fails with Internal Server Error (HTTP 500) error:

[...]
2017-09-19 18:11:47Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.CephStoragePostUpgradeConfig_Deployment5]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-09-19 18:11:48Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.CephStoragePostUpgradeConfig_Deployment5]: CREATE_COMPLETE  state changed
2017-09-19 18:11:50Z [0]: SIGNAL_IN_PROGRESS  Signal: deployment cf2c2862-3b85-4758-911f-b94097c93ff1 succeeded
2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5.2]: SIGNAL_IN_PROGRESS  Signal: deployment 1a84fb43-101f-42ae-9b96-bec3398742d4 succeeded
2017-09-19 18:11:51Z [0]: CREATE_COMPLETE  state changed
2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.NetworkerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5.2]: CREATE_COMPLETE  state changed
2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-09-19 18:11:52Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.NetworkerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE  state changed
2017-09-19 18:11:52Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE  state changed
2017-09-19 18:12:00Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4]: UPDATE_COMPLETE  Stack UPDATE completed successfully
2017-09-19 18:12:00Z [AllNodesDeploySteps]: UPDATE_COMPLETE  state changed
2017-09-19 18:12:19Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Internal Server Error (HTTP 500)


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch
python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch
openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch
puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch
openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch

How reproducible:
100% on deployments with standalone nodes

Steps to Reproduce:
1. Deploy OSP11 with 3 controllers, 3 database, 3 messaging, 2 networker, 2 computes, 3 ceph nodes

2. Upgrade to OSP12:

#!/bin/bash

timeout 100m openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \
  -r /home/stack/composable_roles/roles/roles_data.yaml \
  -e /home/stack/composable_roles/roles/nodes.yaml \
  -e /home/stack/composable_roles/roles/network-config.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/composable_roles/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/composable_roles/network/network-environment.yaml \
-e /home/stack/composable_roles/enable-tls.yaml \
-e /home/stack/composable_roles/inject-trust-anchor.yaml \
-e /home/stack/composable_roles/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /home/stack/composable_roles/debug.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/ceph-ansible-env.yaml \
-e /home/stack/docker-osp12.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \


Actual results:
Stack gets updated but the mistral workflow at the end fails with Internal Server Error (HTTP 500)

Expected results:
There's no error in the process.

Additional info:
Checking /var/log/mistral/engine.log we can notice the following trace:

2017-09-19 14:14:36.485 15678 INFO mistral.engine.engine_server [req-accc913a-07e8-4cbc-8e7a-9eaafb8d97d7 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Received RPC request 'start_action'[name=tripleo.deployment.overcloudrc, input={no_proxy: , container: overcloud}, description=None, params={u'save_result': True, u'run_sync': True}]
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server [req-accc913a-07e8-4cbc-8e7a-9eaafb8d97d7 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Exception during message handling: MessagingTimeout: Timed out waiting for a reply to message ID 8cb9986d7c8346069c58cf87c5f23447
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/engine_server.py", line 135, in start_action
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     **params
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/action_queue.py", line 74, in decorate
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     res = func(*args, **kw)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 78, in start_action
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     output = action.run(action_input, target, save=False)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/engine/actions.py", line 286, in run
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     async_=False
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 331, in run_action
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     return rpc_client_method(auth_ctx.ctx(), 'run_action', **rpc_kwargs)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     **kwargs
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     retry=self.retry)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     timeout=timeout, retry=retry)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     retry=retry)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     result = self._waiter.wait(msg_id, timeout)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     message = self.waiters.get(msg_id, timeout=timeout)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server     'to message ID %s' % msg_id)
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server MessagingTimeout: Timed out waiting for a reply to message ID 8cb9986d7c8346069c58cf87c5f23447
2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server 
2017-09-19 14:15:39.681 15678 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 8cb9986d7c8346069c58cf87c5f23447

Comment 2 Marius Cornea 2017-09-19 19:27:04 UTC
Trace with debug enabled:

REQ: curl -g -i -X POST https://192.168.24.2:13989/v2/action_executions -H "User-Agent: osc-lib/1.7.0 keystoneauth1/3.1.0 python-requests/2.11.1 CPython/2.7.5" -H "content-type: application/json" -H "X-Auth-Token: {SHA1}87997e684e702a66de0ea16cba6c0aed9305b348" -d '{"input": "{\"no_proxy\": \"\", \"container\": \"overcloud\"}", "params": "{\"save_result\": true, \"run_sync\": true}", "name": "tripleo.deployment.overcloudrc"}'
Resetting dropped connection: 192.168.24.2
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
"POST /v2/action_executions HTTP/1.1" 500 155
RESP: [500] Content-Length: 155 Content-Type: application/json Date: Tue, 19 Sep 2017 19:26:27 GMT Connection: keep-alive 
RESP BODY: {"debuginfo": null, "faultcode": "Server", "faultstring": "MessagingTimeout: Timed out waiting for a reply to message ID af1fed7d76dd4b168a8da568f72ff7f5"}

Request returned failure status: 500
Internal Server Error (HTTP 500)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 137, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 924, in take_action
    no_proxy=parsed_args.no_proxy)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 96, in overcloudrc
    **input_)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action
    save_result=True, run_sync=True)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 43, in create
    json.dumps(data)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 52, in decorator
    resp = func(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 123, in post
    data=body, **options)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 853, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python2.7/site-packages/osc_lib/session.py", line 40, in request
    resp = super(TimingSession, self).request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner
    return wrapped(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 742, in request
    raise exceptions.from_response(resp, method, url)
InternalServerError: Internal Server Error (HTTP 500)
clean_up DeployOvercloud: Internal Server Error (HTTP 500)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 134, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 279, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 169, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 137, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 924, in take_action
    no_proxy=parsed_args.no_proxy)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 96, in overcloudrc
    **input_)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action
    save_result=True, run_sync=True)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 43, in create
    json.dumps(data)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 52, in decorator
    resp = func(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 123, in post
    data=body, **options)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 853, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python2.7/site-packages/osc_lib/session.py", line 40, in request
    resp = super(TimingSession, self).request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner
    return wrapped(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 742, in request
    raise exceptions.from_response(resp, method, url)
InternalServerError: Internal Server Error (HTTP 500)

END return value: 1

Comment 3 Marius Cornea 2017-09-19 19:32:35 UTC
Trying to run the curl command manually we can see that the server returns a 500 error message:


curl -g -i -X POST https://192.168.24.2:13989/v2/action_executions -H "User-Agent: osc-lib/1.7.0 keystoneauth1/3.1.0 python-requests/2.11.1 CPython/2.7.5" -H "content-type: application/json" -H "X-Auth-Token: gAAAAABZwW-akpwj0GZ_9NIhDRBi3Ww7_qEMGZbodU0gxfg2b_URCGBcfowFl5s8ruE8Pedeq4s0c422UGS56Yw4Hv-QQT1CSydvx5qqWXnEkE2_cl6du2NbLaVwNnbDsmtqvI-HV7Yo895VQWBhddSGDoaJWnEdJJeECsnEkXK0UitPI4npwD0" -d '{"input": "{\"no_proxy\": \"\", \"container\": \"overcloud\"}", "params": "{\"save_result\": true, \"run_sync\": true}", "name": "tripleo.deployment.overcloudrc"}'
HTTP/1.1 500 Internal Server Error
Content-Length: 155
Content-Type: application/json
Date: Tue, 19 Sep 2017 19:28:59 GMT

{"debuginfo": null, "faultcode": "Server", "faultstring": "MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0"}


In /var/log/mistral/api.log:

2017-09-19 15:28:59.849 15773 ERROR wsme.api [req-ac2f9e10-0601-4613-bdb1-b4361aa9e0d8 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0". Detail: 
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/action_execution.py", line 166, in post
    **params

  File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 109, in decorator
    _wrap_exception_and_reraise(e)

  File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise
    raise exc.MistralException(message)

MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0
: MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0

Comment 4 Omri Hochman 2017-09-21 15:48:15 UTC
reproduced on my env but with **clean-deployment** of osp12 puddle 2017-09-20.1 

 u'container': u'overcloud'}', output='Unset', created_at='Unset', updated_at='Unset', params='{u'save_result': True, u'run_sync': True}']]
2017-09-02 09:28:59.822 6703 ERROR wsme.api [req-fc948c8d-5cc0-44a6-b09c-b9059970eaf2 01de881368404747819390895f034903 64c497d4a44e47b58b0650e6ce7a6000 - default default] Serv
er-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 3209824abd514b48a07f65d62d5db68a". Detail:
Traceback (most recent call last):

my case was Bare-metal environment.  adding Test-Blocker to Bz.

Comment 5 Dougal Matthews 2017-09-21 16:08:48 UTC
I spotted this error when looking into the env. I am not sure if it is related, but I wanted to log it somewhere handy as I've not seen it before.


2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor Traceback (most recent call last):
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 109, in run_action
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor     result = action.run(context.ctx())
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 409, in run            
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor     log_errors=processutils.LogErrors.ALL)                                                               |
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor   File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 400, in execute         
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor     cmd=sanitized_cmd)                                                                                   
2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor ProcessExecutionError: Unexpected error while running command.

Comment 6 Steven Hardy 2017-09-22 14:52:09 UTC
Note that as a temporary workaround it should be possible to create the overcloudrc manually, e.g:

(undercloud) [stack@undercloud ~]$ cat overcloudrc
# Clear any old environment that may conflict.
for key in $( set | awk '{FS="="}  /^OS_/ {print $1}' ); do unset $key ; done
export OS_USERNAME=admin
export OS_BAREMETAL_API_VERSION=1.34
export NOVA_VERSION=1.1
export OS_PROJECT_NAME=admin
export OS_PASSWORD=theadminpassword
export OS_NO_CACHE=True
export COMPUTE_API_VERSION=1.1
export no_proxy=,10.0.0.9,192.168.24.9
export OS_VOLUME_API_VERSION=3
export OS_CLOUDNAME=overcloud
export OS_AUTH_URL=http://10.0.0.9:5000/v2.0
export IRONIC_API_VERSION=1.34
export OS_IMAGE_API_VERSION=2
export OS_AUTH_TYPE=password
export PYTHONWARNINGS="ignore:Certificate has no, ignore:A true SSLContext object is not available"

# Add OS_CLOUDNAME to PS1
if [ -z "${CLOUDPROMPT_ENABLED:-}" ]; then
    export PS1=${PS1:-""}
    export PS1=\${OS_CLOUDNAME:+"(\$OS_CLOUDNAME)"}\ $PS1
    export CLOUDPROMPT_ENABLED=1
fi

(undercloud) [stack@undercloud ~]$ openstack stack environment show overcloud | grep "AdminPassword:"
  AdminPassword: theadminpassword


(undercloud) [stack@undercloud ~]$ openstack stack output show overcloud KeystoneURL
+--------------+----------------------------------------+
| Field        | Value                                  |
+--------------+----------------------------------------+
| description  | URL for the Overcloud Keystone service |
| output_key   | KeystoneURL                            |
| output_value | http://10.0.0.9:5000/v2.0              |
+--------------+----------------------------------------+

(undercloud) [stack@undercloud ~]$ openstack stack output show overcloud VipMap
+--------------+------------------------------------------------------------------------+
| Field        | Value                                                                  |
+--------------+------------------------------------------------------------------------+
| description  | Mapping of each network to VIP addresses. Also includes the Redis VIP. |
| output_key   | VipMap                                                                 |
| output_value | {                                                                      |
|              |   "storage": "172.16.1.8",                                             |
|              |   "management_uri": "",                                                |
|              |   "internal_api_subnet": "",                                           |
|              |   "ctlplane": "192.168.24.9",                                          |
|              |   "storage_uri": "172.16.1.8",                                         |
|              |   "management": "",                                                    |
|              |   "management_subnet": "",                                             |
|              |   "redis": "172.16.2.13",                                              |
|              |   "storage_subnet": "",                                                |
|              |   "storage_mgmt_uri": "172.16.3.12",                                   |
|              |   "tenant_uri": "",                                                    |
|              |   "external": "10.0.0.9",                                              |
|              |   "storage_mgmt": "172.16.3.12",                                       |
|              |   "tenant": "",                                                        |
|              |   "tenant_subnet": "",                                                 |
|              |   "ctlplane_uri": "192.168.24.9",                                      |
|              |   "external_subnet": "",                                               |
|              |   "storage_mgmt_subnet": "",                                           |
|              |   "internal_api": "172.16.2.10",                                       |
|              |   "internal_api_uri": "172.16.2.10",                                   |
|              |   "external_uri": "10.0.0.9",                                          |
|              |   "ctlplane_subnet": "192.168.24.9/24"                                 |
|              | }                                                                      |


As you can see, we just need the AdminPassword, the keystone endpoint, and the ctlplane/external VIP ip's for noproxy.

Comment 7 Jose Luis Franco 2017-09-22 15:18:12 UTC
From what I could observe in both systems with the issue reproduced (after clean deployment and after upgrade), the action called to generate the overcloudrc timesout because one of the stack output calls takes too long.

In both, calling a simple "openstack stack show overcloud" took almost 2 minutes:

(undercloud) [root@undercloud74 ~]# time openstack stack show overcloud
.....
real	1m48.796s
user	0m4.857s
sys	0m0.560s

(undercloud) [stack@undercloud ~]$ time openstack stack show overcloud
.....
real	1m54.717s
user	0m7.980s
sys	0m1.864s

Comment 8 Jose Luis Franco 2017-09-22 15:20:06 UTC
Heat and mistral versions in the env:

(undercloud) [root@undercloud74 ~]# rpm -qa | grep heat
python-heatclient-1.11.0-0.20170814165725.804ab3e.el7ost.noarch
openstack-heat-engine-9.0.1-0.20170911115334.0c64134.el7ost.noarch
openstack-heat-common-9.0.1-0.20170911115334.0c64134.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-cfn-9.0.1-0.20170911115334.0c64134.el7ost.noarch
python-heat-agent-1.4.1-0.20170824121708.ff31eb2.el7ost.noarch
puppet-heat-11.3.1-0.20170825182847.1700009.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch
openstack-heat-api-9.0.1-0.20170911115334.0c64134.el7ost.noarch

(undercloud) [root@undercloud74 ~]# rpm -qa | grep mistral
python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch
python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch
openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch
openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch

Comment 9 Steven Hardy 2017-09-25 11:27:47 UTC
Ok so I did some comparative testing with a small script that times the heat output-show command:

(undercloud) [stack@undercloud74 shardy_test]$ bash shardy_time_outputs.sh | tee shardy_time_outputs.out
DeployedServerEnvironment - 0m58.723s
EnabledServices - 0m54.666s
EndpointMap - 0m55.209s
HostsEntry - 0m57.220s
KeystoneAdminVip - 0m57.462s
KeystoneURL - 0m56.010s
ManagedEndpoints - 0m57.980s
RoleConfig - 0m58.881s
RoleData - 0m56.486s
RoleNetHostnameMap - 0m58.121s
RoleNetIpMap - 0m56.799s
ServerIdData - 0m59.442s
ServerOsCollectConfigData - 0m59.973s
VipMap - 1m0.919s

This is using the environment from comment #8

With my local dev env, it's much faster (this is a VM setup running on a desktop machine, only two nodes deployed though vs the 7 in the undercloud74 environment)

(undercloud) [stack@undercloud shardy_test]$ bash time_outputs.sh | tee time_outputs.out
DeployedServerEnvironment - 0m11.750s
EnabledServices - 0m11.678s
EndpointMap - 0m11.882s
HostsEntry - 0m11.770s
KeystoneAdminVip - 0m11.870s
KeystoneURL - 0m11.567s
ManagedEndpoints - 0m12.028s
RoleConfig - 0m11.722s
RoleData - 0m12.177s
RoleNetHostnameMap - 0m11.876s
RoleNetIpMap - 0m11.556s
ServerIdData - 0m11.631s
ServerOsCollectConfigData - 0m12.219s
VipMap - 0m12.014s

I think the timings should be more comparable than this, so I wonder if we're missing a heat patch on stable/pike but I've not spotted any relevant ones yet.

This is the script for anyone who wants to replicate this test:

(undercloud) [stack@undercloud shardy_test]$ cat time_outputs.sh 
#!/bin/bash

OUTPUTS="DeployedServerEnvironment \
EnabledServices \
EndpointMap \
HostsEntry \
KeystoneAdminVip \
KeystoneURL \
ManagedEndpoints \
RoleConfig \
RoleData \
RoleNetHostnameMap \
RoleNetIpMap \
ServerIdData \
ServerOsCollectConfigData \
VipMap"

for o in $OUTPUTS;
do
  echo -n "$o - "
  { time heat output-show overcloud $o >/dev/null; } 2>&1 | grep real | awk '{print $2}'
done

Comment 10 Joe Talerico 2017-09-25 14:25:43 UTC
With larger deployed clouds I am seeing this timeout as well. Here is what I am seeing in Mistral :
https://gist.github.com/jtaleric/08d8eb2e2b54ed4922e4b93d1f316ec1

Comment 11 Joe Talerico 2017-09-25 14:49:13 UTC
From a decently sized baremetal cloud (81 nodes)

(undercloud) [stack@b08-h01-1028r ~]$ ./shardy-test.sh 
DeployedServerEnvironment - 0m19.823s
EnabledServices - 0m20.853s
EndpointMap - 0m19.664s
HostsEntry - 0m20.131s
KeystoneAdminVip - 0m21.174s
KeystoneURL - 0m19.730s
ManagedEndpoints - 0m20.082s
RoleConfig - 0m21.217s
RoleData - 0m20.243s
RoleNetHostnameMap - 0m19.658s
RoleNetIpMap - 0m20.927s
ServerIdData - 0m20.414s
ServerOsCollectConfigData - 0m20.354s
VipMap - 0m21.538s

Comment 12 Zane Bitter 2017-09-25 14:53:05 UTC
Opened a Launchpad bug to track this upstream.

Comment 13 Marius Cornea 2017-09-25 17:13:27 UTC
Adding a note with my measurements from an environment which hasn't manifested this issue:

(undercloud) [stack@undercloud-0 ~]$ time openstack stack show overcloud >/dev/null

real	0m52.694s
user	0m2.734s
sys	0m0.440s

(undercloud) [stack@undercloud-0 ~]$ bash time_outputs.sh 
DeployedServerEnvironment - 0m28.699s
EnabledServices - 0m29.205s
EndpointMap - 0m27.473s
HostsEntry - 0m28.427s
KeystoneAdminVip - 0m27.812s
KeystoneURL - 0m28.718s
ManagedEndpoints - 0m29.427s
RoleConfig - 0m28.113s
RoleData - 0m29.859s
RoleNetHostnameMap - 0m28.073s
RoleNetIpMap - 0m28.305s
ServerIdData - 0m27.675s
ServerOsCollectConfigData - 0m28.109s
VipMap - 0m28.335s

Comment 14 Zane Bitter 2017-09-25 20:01:35 UTC
I submitted a couple of patches upstream that get the times back into sane ranges.

(undercloud) [root@undercloud74 ~]# time openstack stack show overcloud
...
real	0m30.415s
user	0m6.398s
sys	0m0.649s


(undercloud) [root@undercloud74 ~]# bash time_outputs.sh 
DeployedServerEnvironment - 0m8.726s
EnabledServices - 0m7.239s
EndpointMap - 0m10.393s
HostsEntry - 0m6.987s
KeystoneAdminVip - 0m10.154s
KeystoneURL - 0m7.156s
ManagedEndpoints - 0m9.395s
RoleConfig - 0m10.800s
RoleData - 0m12.631s
RoleNetHostnameMap - 0m7.225s
RoleNetIpMap - 0m10.686s
ServerIdData - 0m9.228s
ServerOsCollectConfigData - 0m10.131s
VipMap - 0m10.862s

Comment 15 Omri Hochman 2017-09-26 14:07:07 UTC
going to apply the patch and test them on my env: 
http://pastebin.test.redhat.com/519409

Comment 16 Omri Hochman 2017-09-26 15:37:05 UTC
(In reply to Omri Hochman from comment #15)
> going to apply the patch and test them on my env: 
> http://pastebin.test.redhat.com/519409


After applying the patches I still get:  Stack overcloud CREATE_COMPLETE
with missing overcloudrc file.

Comment 18 Omri Hochman 2017-10-02 16:51:09 UTC
(In reply to Omri Hochman from comment #16)
> (In reply to Omri Hochman from comment #15)
> > going to apply the patch and test them on my env: 
> > http://pastebin.test.redhat.com/519409
> 
> 
> After applying the patches I still get:  Stack overcloud CREATE_COMPLETE
> with missing overcloudrc file.

 I verified this patch works for me --> https://code.engineering.redhat.com/gerrit/#/c/119233/1


Got successful deployment + overcloudrc file exist.

Comment 19 Zane Bitter 2017-10-04 01:51:50 UTC
Both patches are now merged to upstream stable/pike, so we could pick them up by rebasing from RDO.

Comment 27 errata-xmlrpc 2017-12-13 22:10:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.