Hide Forgot
Description of problem: OSP11 -> OSP12 upgrade: major-upgrade-composable-steps-docker.yaml fails on deployment with standalone nodes with: Internal Server Error (HTTP 500) We can actually see that the stack got updated by the mistral workflow(I believe the one which creates overcloudrc) fails with Internal Server Error (HTTP 500) error: [...] 2017-09-19 18:11:47Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.CephStoragePostUpgradeConfig_Deployment5]: CREATE_COMPLETE Stack CREATE completed successfully 2017-09-19 18:11:48Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.CephStoragePostUpgradeConfig_Deployment5]: CREATE_COMPLETE state changed 2017-09-19 18:11:50Z [0]: SIGNAL_IN_PROGRESS Signal: deployment cf2c2862-3b85-4758-911f-b94097c93ff1 succeeded 2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5.2]: SIGNAL_IN_PROGRESS Signal: deployment 1a84fb43-101f-42ae-9b96-bec3398742d4 succeeded 2017-09-19 18:11:51Z [0]: CREATE_COMPLETE state changed 2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.NetworkerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE Stack CREATE completed successfully 2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5.2]: CREATE_COMPLETE state changed 2017-09-19 18:11:51Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE Stack CREATE completed successfully 2017-09-19 18:11:52Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.NetworkerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE state changed 2017-09-19 18:11:52Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4.ControllerPostUpgradeConfig_Deployment5]: CREATE_COMPLETE state changed 2017-09-19 18:12:00Z [overcloud-AllNodesDeploySteps-5bnoiyot2ph4]: UPDATE_COMPLETE Stack UPDATE completed successfully 2017-09-19 18:12:00Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed 2017-09-19 18:12:19Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Internal Server Error (HTTP 500) Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170913050523.0rc2.el7ost.noarch python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch How reproducible: 100% on deployments with standalone nodes Steps to Reproduce: 1. Deploy OSP11 with 3 controllers, 3 database, 3 messaging, 2 networker, 2 computes, 3 ceph nodes 2. Upgrade to OSP12: #!/bin/bash timeout 100m openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \ -r /home/stack/composable_roles/roles/roles_data.yaml \ -e /home/stack/composable_roles/roles/nodes.yaml \ -e /home/stack/composable_roles/roles/network-config.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/composable_roles/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/composable_roles/network/network-environment.yaml \ -e /home/stack/composable_roles/enable-tls.yaml \ -e /home/stack/composable_roles/inject-trust-anchor.yaml \ -e /home/stack/composable_roles/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \ -e /home/stack/composable_roles/debug.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \ -e /home/stack/ceph-ansible-env.yaml \ -e /home/stack/docker-osp12.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml \ Actual results: Stack gets updated but the mistral workflow at the end fails with Internal Server Error (HTTP 500) Expected results: There's no error in the process. Additional info: Checking /var/log/mistral/engine.log we can notice the following trace: 2017-09-19 14:14:36.485 15678 INFO mistral.engine.engine_server [req-accc913a-07e8-4cbc-8e7a-9eaafb8d97d7 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Received RPC request 'start_action'[name=tripleo.deployment.overcloudrc, input={no_proxy: , container: overcloud}, description=None, params={u'save_result': True, u'run_sync': True}] 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server [req-accc913a-07e8-4cbc-8e7a-9eaafb8d97d7 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Exception during message handling: MessagingTimeout: Timed out waiting for a reply to message ID 8cb9986d7c8346069c58cf87c5f23447 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/engine/engine_server.py", line 135, in start_action 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server **params 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/engine/action_queue.py", line 74, in decorate 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server res = func(*args, **kw) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 78, in start_action 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server output = action.run(action_input, target, save=False) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server return f(*args, **kwargs) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/engine/actions.py", line 286, in run 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server async_=False 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server return f(*args, **kwargs) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/rpc/clients.py", line 331, in run_action 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server return rpc_client_method(auth_ctx.ctx(), 'run_action', **rpc_kwargs) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server **kwargs 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server retry=self.retry) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server timeout=timeout, retry=retry) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server retry=retry) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in _send 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server result = self._waiter.wait(msg_id, timeout) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server message = self.waiters.get(msg_id, timeout=timeout) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 347, in get 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server 'to message ID %s' % msg_id) 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server MessagingTimeout: Timed out waiting for a reply to message ID 8cb9986d7c8346069c58cf87c5f23447 2017-09-19 14:15:36.497 15678 ERROR oslo_messaging.rpc.server 2017-09-19 14:15:39.681 15678 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 8cb9986d7c8346069c58cf87c5f23447
Trace with debug enabled: REQ: curl -g -i -X POST https://192.168.24.2:13989/v2/action_executions -H "User-Agent: osc-lib/1.7.0 keystoneauth1/3.1.0 python-requests/2.11.1 CPython/2.7.5" -H "content-type: application/json" -H "X-Auth-Token: {SHA1}87997e684e702a66de0ea16cba6c0aed9305b348" -d '{"input": "{\"no_proxy\": \"\", \"container\": \"overcloud\"}", "params": "{\"save_result\": true, \"run_sync\": true}", "name": "tripleo.deployment.overcloudrc"}' Resetting dropped connection: 192.168.24.2 /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:303: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning "POST /v2/action_executions HTTP/1.1" 500 155 RESP: [500] Content-Length: 155 Content-Type: application/json Date: Tue, 19 Sep 2017 19:26:27 GMT Connection: keep-alive RESP BODY: {"debuginfo": null, "faultcode": "Server", "faultstring": "MessagingTimeout: Timed out waiting for a reply to message ID af1fed7d76dd4b168a8da568f72ff7f5"} Request returned failure status: 500 Internal Server Error (HTTP 500) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 137, in run return_code = self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 924, in take_action no_proxy=parsed_args.no_proxy) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 96, in overcloudrc **input_) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action save_result=True, run_sync=True) File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 43, in create json.dumps(data) File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 52, in decorator resp = func(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 123, in post data=body, **options) File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 853, in post return self.request(url, 'POST', **kwargs) File "/usr/lib/python2.7/site-packages/osc_lib/session.py", line 40, in request resp = super(TimingSession, self).request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner return wrapped(*args, **kwargs) File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 742, in request raise exceptions.from_response(resp, method, url) InternalServerError: Internal Server Error (HTTP 500) clean_up DeployOvercloud: Internal Server Error (HTTP 500) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 134, in run ret_val = super(OpenStackShell, self).run(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 279, in run result = self.run_subcommand(remainder) File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 169, in run_subcommand ret_value = super(OpenStackShell, self).run_subcommand(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 137, in run return_code = self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 924, in take_action no_proxy=parsed_args.no_proxy) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 96, in overcloudrc **input_) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/base.py", line 25, in call_action save_result=True, run_sync=True) File "/usr/lib/python2.7/site-packages/mistralclient/api/v2/action_executions.py", line 43, in create json.dumps(data) File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 52, in decorator resp = func(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistralclient/api/httpclient.py", line 123, in post data=body, **options) File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 853, in post return self.request(url, 'POST', **kwargs) File "/usr/lib/python2.7/site-packages/osc_lib/session.py", line 40, in request resp = super(TimingSession, self).request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/positional/__init__.py", line 101, in inner return wrapped(*args, **kwargs) File "/usr/lib/python2.7/site-packages/keystoneauth1/session.py", line 742, in request raise exceptions.from_response(resp, method, url) InternalServerError: Internal Server Error (HTTP 500) END return value: 1
Trying to run the curl command manually we can see that the server returns a 500 error message: curl -g -i -X POST https://192.168.24.2:13989/v2/action_executions -H "User-Agent: osc-lib/1.7.0 keystoneauth1/3.1.0 python-requests/2.11.1 CPython/2.7.5" -H "content-type: application/json" -H "X-Auth-Token: gAAAAABZwW-akpwj0GZ_9NIhDRBi3Ww7_qEMGZbodU0gxfg2b_URCGBcfowFl5s8ruE8Pedeq4s0c422UGS56Yw4Hv-QQT1CSydvx5qqWXnEkE2_cl6du2NbLaVwNnbDsmtqvI-HV7Yo895VQWBhddSGDoaJWnEdJJeECsnEkXK0UitPI4npwD0" -d '{"input": "{\"no_proxy\": \"\", \"container\": \"overcloud\"}", "params": "{\"save_result\": true, \"run_sync\": true}", "name": "tripleo.deployment.overcloudrc"}' HTTP/1.1 500 Internal Server Error Content-Length: 155 Content-Type: application/json Date: Tue, 19 Sep 2017 19:28:59 GMT {"debuginfo": null, "faultcode": "Server", "faultstring": "MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0"} In /var/log/mistral/api.log: 2017-09-19 15:28:59.849 15773 ERROR wsme.api [req-ac2f9e10-0601-4613-bdb1-b4361aa9e0d8 334c5c67b5944522889fea53c8ccc901 eb6d19788a5346a691da6aea743fb13d - default default] Server-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0". Detail: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/mistral/api/controllers/v2/action_execution.py", line 166, in post **params File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 109, in decorator _wrap_exception_and_reraise(e) File "/usr/lib/python2.7/site-packages/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise raise exc.MistralException(message) MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0 : MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 741f869995a344b988b5f0c25fe77af0
reproduced on my env but with **clean-deployment** of osp12 puddle 2017-09-20.1 u'container': u'overcloud'}', output='Unset', created_at='Unset', updated_at='Unset', params='{u'save_result': True, u'run_sync': True}']] 2017-09-02 09:28:59.822 6703 ERROR wsme.api [req-fc948c8d-5cc0-44a6-b09c-b9059970eaf2 01de881368404747819390895f034903 64c497d4a44e47b58b0650e6ce7a6000 - default default] Serv er-side error: "MessagingTimeout: Timed out waiting for a reply to message ID 3209824abd514b48a07f65d62d5db68a". Detail: Traceback (most recent call last): my case was Bare-metal environment. adding Test-Blocker to Bz.
I spotted this error when looking into the env. I am not sure if it is related, but I wanted to log it somewhere handy as I've not seen it before. 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor Traceback (most recent call last): 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/mistral/executors/default_executor.py", line 109, in run_action 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor result = action.run(context.ctx()) 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/tripleo_common/actions/ansible.py", line 409, in run 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor log_errors=processutils.LogErrors.ALL) | 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 400, in execute 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor cmd=sanitized_cmd) 2017-09-01 14:17:06.247 6618 ERROR mistral.executors.default_executor ProcessExecutionError: Unexpected error while running command.
Note that as a temporary workaround it should be possible to create the overcloudrc manually, e.g: (undercloud) [stack@undercloud ~]$ cat overcloudrc # Clear any old environment that may conflict. for key in $( set | awk '{FS="="} /^OS_/ {print $1}' ); do unset $key ; done export OS_USERNAME=admin export OS_BAREMETAL_API_VERSION=1.34 export NOVA_VERSION=1.1 export OS_PROJECT_NAME=admin export OS_PASSWORD=theadminpassword export OS_NO_CACHE=True export COMPUTE_API_VERSION=1.1 export no_proxy=,10.0.0.9,192.168.24.9 export OS_VOLUME_API_VERSION=3 export OS_CLOUDNAME=overcloud export OS_AUTH_URL=http://10.0.0.9:5000/v2.0 export IRONIC_API_VERSION=1.34 export OS_IMAGE_API_VERSION=2 export OS_AUTH_TYPE=password export PYTHONWARNINGS="ignore:Certificate has no, ignore:A true SSLContext object is not available" # Add OS_CLOUDNAME to PS1 if [ -z "${CLOUDPROMPT_ENABLED:-}" ]; then export PS1=${PS1:-""} export PS1=\${OS_CLOUDNAME:+"(\$OS_CLOUDNAME)"}\ $PS1 export CLOUDPROMPT_ENABLED=1 fi (undercloud) [stack@undercloud ~]$ openstack stack environment show overcloud | grep "AdminPassword:" AdminPassword: theadminpassword (undercloud) [stack@undercloud ~]$ openstack stack output show overcloud KeystoneURL +--------------+----------------------------------------+ | Field | Value | +--------------+----------------------------------------+ | description | URL for the Overcloud Keystone service | | output_key | KeystoneURL | | output_value | http://10.0.0.9:5000/v2.0 | +--------------+----------------------------------------+ (undercloud) [stack@undercloud ~]$ openstack stack output show overcloud VipMap +--------------+------------------------------------------------------------------------+ | Field | Value | +--------------+------------------------------------------------------------------------+ | description | Mapping of each network to VIP addresses. Also includes the Redis VIP. | | output_key | VipMap | | output_value | { | | | "storage": "172.16.1.8", | | | "management_uri": "", | | | "internal_api_subnet": "", | | | "ctlplane": "192.168.24.9", | | | "storage_uri": "172.16.1.8", | | | "management": "", | | | "management_subnet": "", | | | "redis": "172.16.2.13", | | | "storage_subnet": "", | | | "storage_mgmt_uri": "172.16.3.12", | | | "tenant_uri": "", | | | "external": "10.0.0.9", | | | "storage_mgmt": "172.16.3.12", | | | "tenant": "", | | | "tenant_subnet": "", | | | "ctlplane_uri": "192.168.24.9", | | | "external_subnet": "", | | | "storage_mgmt_subnet": "", | | | "internal_api": "172.16.2.10", | | | "internal_api_uri": "172.16.2.10", | | | "external_uri": "10.0.0.9", | | | "ctlplane_subnet": "192.168.24.9/24" | | | } | As you can see, we just need the AdminPassword, the keystone endpoint, and the ctlplane/external VIP ip's for noproxy.
From what I could observe in both systems with the issue reproduced (after clean deployment and after upgrade), the action called to generate the overcloudrc timesout because one of the stack output calls takes too long. In both, calling a simple "openstack stack show overcloud" took almost 2 minutes: (undercloud) [root@undercloud74 ~]# time openstack stack show overcloud ..... real 1m48.796s user 0m4.857s sys 0m0.560s (undercloud) [stack@undercloud ~]$ time openstack stack show overcloud ..... real 1m54.717s user 0m7.980s sys 0m1.864s
Heat and mistral versions in the env: (undercloud) [root@undercloud74 ~]# rpm -qa | grep heat python-heatclient-1.11.0-0.20170814165725.804ab3e.el7ost.noarch openstack-heat-engine-9.0.1-0.20170911115334.0c64134.el7ost.noarch openstack-heat-common-9.0.1-0.20170911115334.0c64134.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch openstack-heat-api-cfn-9.0.1-0.20170911115334.0c64134.el7ost.noarch python-heat-agent-1.4.1-0.20170824121708.ff31eb2.el7ost.noarch puppet-heat-11.3.1-0.20170825182847.1700009.el7ost.noarch openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch openstack-heat-api-9.0.1-0.20170911115334.0c64134.el7ost.noarch (undercloud) [root@undercloud74 ~]# rpm -qa | grep mistral python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
Ok so I did some comparative testing with a small script that times the heat output-show command: (undercloud) [stack@undercloud74 shardy_test]$ bash shardy_time_outputs.sh | tee shardy_time_outputs.out DeployedServerEnvironment - 0m58.723s EnabledServices - 0m54.666s EndpointMap - 0m55.209s HostsEntry - 0m57.220s KeystoneAdminVip - 0m57.462s KeystoneURL - 0m56.010s ManagedEndpoints - 0m57.980s RoleConfig - 0m58.881s RoleData - 0m56.486s RoleNetHostnameMap - 0m58.121s RoleNetIpMap - 0m56.799s ServerIdData - 0m59.442s ServerOsCollectConfigData - 0m59.973s VipMap - 1m0.919s This is using the environment from comment #8 With my local dev env, it's much faster (this is a VM setup running on a desktop machine, only two nodes deployed though vs the 7 in the undercloud74 environment) (undercloud) [stack@undercloud shardy_test]$ bash time_outputs.sh | tee time_outputs.out DeployedServerEnvironment - 0m11.750s EnabledServices - 0m11.678s EndpointMap - 0m11.882s HostsEntry - 0m11.770s KeystoneAdminVip - 0m11.870s KeystoneURL - 0m11.567s ManagedEndpoints - 0m12.028s RoleConfig - 0m11.722s RoleData - 0m12.177s RoleNetHostnameMap - 0m11.876s RoleNetIpMap - 0m11.556s ServerIdData - 0m11.631s ServerOsCollectConfigData - 0m12.219s VipMap - 0m12.014s I think the timings should be more comparable than this, so I wonder if we're missing a heat patch on stable/pike but I've not spotted any relevant ones yet. This is the script for anyone who wants to replicate this test: (undercloud) [stack@undercloud shardy_test]$ cat time_outputs.sh #!/bin/bash OUTPUTS="DeployedServerEnvironment \ EnabledServices \ EndpointMap \ HostsEntry \ KeystoneAdminVip \ KeystoneURL \ ManagedEndpoints \ RoleConfig \ RoleData \ RoleNetHostnameMap \ RoleNetIpMap \ ServerIdData \ ServerOsCollectConfigData \ VipMap" for o in $OUTPUTS; do echo -n "$o - " { time heat output-show overcloud $o >/dev/null; } 2>&1 | grep real | awk '{print $2}' done
With larger deployed clouds I am seeing this timeout as well. Here is what I am seeing in Mistral : https://gist.github.com/jtaleric/08d8eb2e2b54ed4922e4b93d1f316ec1
From a decently sized baremetal cloud (81 nodes) (undercloud) [stack@b08-h01-1028r ~]$ ./shardy-test.sh DeployedServerEnvironment - 0m19.823s EnabledServices - 0m20.853s EndpointMap - 0m19.664s HostsEntry - 0m20.131s KeystoneAdminVip - 0m21.174s KeystoneURL - 0m19.730s ManagedEndpoints - 0m20.082s RoleConfig - 0m21.217s RoleData - 0m20.243s RoleNetHostnameMap - 0m19.658s RoleNetIpMap - 0m20.927s ServerIdData - 0m20.414s ServerOsCollectConfigData - 0m20.354s VipMap - 0m21.538s
Opened a Launchpad bug to track this upstream.
Adding a note with my measurements from an environment which hasn't manifested this issue: (undercloud) [stack@undercloud-0 ~]$ time openstack stack show overcloud >/dev/null real 0m52.694s user 0m2.734s sys 0m0.440s (undercloud) [stack@undercloud-0 ~]$ bash time_outputs.sh DeployedServerEnvironment - 0m28.699s EnabledServices - 0m29.205s EndpointMap - 0m27.473s HostsEntry - 0m28.427s KeystoneAdminVip - 0m27.812s KeystoneURL - 0m28.718s ManagedEndpoints - 0m29.427s RoleConfig - 0m28.113s RoleData - 0m29.859s RoleNetHostnameMap - 0m28.073s RoleNetIpMap - 0m28.305s ServerIdData - 0m27.675s ServerOsCollectConfigData - 0m28.109s VipMap - 0m28.335s
I submitted a couple of patches upstream that get the times back into sane ranges. (undercloud) [root@undercloud74 ~]# time openstack stack show overcloud ... real 0m30.415s user 0m6.398s sys 0m0.649s (undercloud) [root@undercloud74 ~]# bash time_outputs.sh DeployedServerEnvironment - 0m8.726s EnabledServices - 0m7.239s EndpointMap - 0m10.393s HostsEntry - 0m6.987s KeystoneAdminVip - 0m10.154s KeystoneURL - 0m7.156s ManagedEndpoints - 0m9.395s RoleConfig - 0m10.800s RoleData - 0m12.631s RoleNetHostnameMap - 0m7.225s RoleNetIpMap - 0m10.686s ServerIdData - 0m9.228s ServerOsCollectConfigData - 0m10.131s VipMap - 0m10.862s
going to apply the patch and test them on my env: http://pastebin.test.redhat.com/519409
(In reply to Omri Hochman from comment #15) > going to apply the patch and test them on my env: > http://pastebin.test.redhat.com/519409 After applying the patches I still get: Stack overcloud CREATE_COMPLETE with missing overcloudrc file.
(In reply to Omri Hochman from comment #16) > (In reply to Omri Hochman from comment #15) > > going to apply the patch and test them on my env: > > http://pastebin.test.redhat.com/519409 > > > After applying the patches I still get: Stack overcloud CREATE_COMPLETE > with missing overcloudrc file. I verified this patch works for me --> https://code.engineering.redhat.com/gerrit/#/c/119233/1 Got successful deployment + overcloudrc file exist.
Both patches are now merged to upstream stable/pike, so we could pick them up by rebasing from RDO.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462