Bug 1705694

Summary: openstack overcloud deploy command fails with socket.timeout: timed out
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: rhosp-directorAssignee: RHOS Maint <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: aschultz, dbecker, mburns, morazi
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-06 20:24:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2019-05-02 19:06:39 UTC
Description of problem:
Trying to deploy an overcloud with the following command, 
time openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml -e ~/containers-prepare-parameters.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e ~/templates/osp15.yaml --ntp-server clock.redhat.com


After about 20 minutes it fails with the following error:
Creating Swift container to store the plan
Creating plan from template files in: /tmp/tripleoclient-9lci373i/tripleo-heat-templates
Timed out waiting for messages from Execution (ID: 75f7f917-67ab-4b9c-8ad8-210f16660c99, State: ERROR). The Workflow errored and no messages were received.
Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/websocket/_socket.py", line 81, in recv
    bytes_ = sock.recv(bufsize)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 153, in wait_for_messages
    message = self.recv()
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 131, in recv
    return json.loads(self._ws.recv())
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 310, in recv
    opcode, data = self.recv_data()
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 327, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 340, in recv_data_frame
    frame = self.recv_frame()
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 374, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 361, in recv_frame
    self.recv_header()
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 309, in recv_header
    header = self.recv_strict(2)
  File "/usr/lib/python3.6/site-packages/websocket/_abnf.py", line 396, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "/usr/lib/python3.6/site-packages/websocket/_core.py", line 449, in _recv
    return recv(self.sock, bufsize)
  File "/usr/lib/python3.6/site-packages/websocket/_socket.py", line 84, in recv
    raise WebSocketTimeoutException(message)
websocket._exceptions.WebSocketTimeoutException: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 30, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 919, in take_action
    self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 374, in _deploy_tripleo_heat_templates_tmpdir
    new_tht_root, tht_root)
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 407, in _deploy_tripleo_heat_templates
    validate_stack=False)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/plan_management.py", line 174, in create_plan_from_templates
    validate_stack=validate_stack)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/plan_management.py", line 87, in create_deployment_plan
    **workflow_input)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/plan_management.py", line 77, in _create_update_deployment_plan
    _WORKFLOW_TIMEOUT):
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/base.py", line 61, in wait_for_messages
    for payload in websocket.wait_for_messages(timeout=timeout):
  File "/usr/lib/python3.6/site-packages/tripleoclient/plugin.py", line 158, in wait_for_messages
    raise exceptions.WebSocketTimeout()
tripleoclient.exceptions.WebSocketTimeout


Version-Release number of selected component (if applicable):
OSP15
(undercloud) [stack@f16-h10-000-1029p ~]$ sudo rpm -qa | grep tripleo
openstack-tripleo-puppet-elements-10.3.1-0.20190420090433.9ba1438.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190420043237.7d6edd9.el8ost.noarch
python3-tripleoclient-heat-installer-11.4.1-0.20190423085110.290ac95.el8ost.noarch
openstack-tripleo-validations-10.4.1-0.20190420030347.9d08e89.el8ost.noarch
python3-tripleo-common-10.7.1-0.20190423083511.2199eeb.el8ost.noarch
python3-tripleoclient-11.4.1-0.20190423085110.290ac95.el8ost.noarch
ansible-tripleo-ipsec-9.1.1-0.20190422122014.8c1fdab.el8ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190422122515.f1dfdc6.el8ost.noarch
openstack-tripleo-common-10.7.1-0.20190423083511.2199eeb.el8ost.noarch
openstack-tripleo-heat-templates-10.5.1-0.20190423085106.3f148c4.el8ost.noarch
openstack-tripleo-common-containers-10.7.1-0.20190423083511.2199eeb.el8ost.noarch
puppet-tripleo-10.4.1-0.20190420063733.7fc5500.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy undercloud and introspect overcloud nodes
2. Run overcloud deplyo command
3.

Actual results:
Command exits with failure

Expected results:
Deploy should succeed

Additional info:
Looking at mistral engine logs on undercloud, I see
2019-05-02 18:40:15.028 1 ERROR mistral.engine.task_handler [req-6a5a1e26-0287-4424-b5b0-9485fc25152e a76551fbe21c42dd8ea80ac74eeedd76 5018fa8b4e8144dc901c4e04cd0a624b - default default] Failed to run task [error=Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=['validate']], wf=tripleo.plan_management.v1.create_deployment_plan, task=add_root_stack_name]:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task
    task.run()
  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run
    self._run_new()
  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new
    self._schedule_actions()
  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 563, in _schedule_actions
    action.validate_input(input_dict)
  File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 336, in validate_input
    self.action_def.action_class
  File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 66, in validate_input
    raise exc.InputException(msg % tuple(msg_props))
mistral.exceptions.InputException: Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=['validate']]
: mistral.exceptions.InputException: Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=['validate']]

Comment 1 Alex Schultz 2019-05-06 20:24:23 UTC
I believe this is a duplicate of Bug 1700044. Please let us know if it's still occurring after the fix for 1700044 has been applied.

*** This bug has been marked as a duplicate of bug 1700044 ***

Comment 2 Sai Sindhur Malleni 2019-05-06 21:37:35 UTC
Hi Alex,

To apply the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1700044 please advise if the following two steps are enough,

1. On undercloud, install python3-oslo-rootwrap using dnf install python3-oslo-rootwrap
2. Patch tripleo-common on undercloud at /usr/lib/python3.6/site-packages/tripleo_common/actions/ansible.py

Comment 3 Alex Schultz 2019-05-06 22:47:52 UTC
No you have to patch the mistral container. It needs to be updated in the mistral-engine container and then the container needs to be restarted.

Comment 4 Sai Sindhur Malleni 2019-05-10 18:12:04 UTC
Hi Alex.
So I patched the mistral container with https://review.opendev.org/#/c/657090/1/tripleo_common/actions/ansible.py and ran podman restart mistral_engine.

Now also I see the overcloud deploy failing, but much faster

(undercloud) [stack@f16-h10-000-1029p ~]$ time openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-ovn-ha.yaml -e ~/containers-prepare-parameters.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e ~/templates/osp15.yaml --ntp-server clock.redhat.com
Removing the current plan files
Uploading new plan files
{'result': 'Failed to run task [error=Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']], wf=tripleo.swift_backup.v1.create_swift_backup_container_plan, task=set_tempurl]:\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task\n    task.run()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run\n    self._run_new()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new\n    self._schedule_actions()\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 563, in _schedule_actions\n    action.validate_input(input_dict)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 336, in validate_input\n    self.action_def.action_class\n  File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 66, in validate_input\n    raise exc.InputException(msg % tuple(msg_props))\nmistral.exceptions.InputException: Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']]\n'}
Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 30, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 919, in take_action
    self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 374, in _deploy_tripleo_heat_templates_tmpdir
    new_tht_root, tht_root)
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 400, in _deploy_tripleo_heat_templates
    validate_stack=False)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/plan_management.py", line 238, in update_plan_from_templates
    validate_stack=validate_stack)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/plan_management.py", line 122, in update_deployment_plan
    'Exception updating plan: {}'.format(payload['message']))
tripleoclient.exceptions.WorkflowServiceError: Exception updating plan: {'result': 'Failed to run task [error=Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']], wf=tripleo.swift_backup.v1.create_swift_backup_container_plan, task=set_tempurl]:\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task\n    task.run()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run\n    self._run_new()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new\n    self._schedule_actions()\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 563, in _schedule_actions\n    action.validate_input(input_dict)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 336, in validate_input\n    self.action_def.action_class\n  File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 66, in validate_input\n    raise exc.InputException(msg % tuple(msg_props))\nmistral.exceptions.InputException: Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']]\n'}
Exception updating plan: {'result': 'Failed to run task [error=Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']], wf=tripleo.swift_backup.v1.create_swift_backup_container_plan, task=set_tempurl]:\nTraceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/mistral/engine/task_handler.py", line 63, in run_task\n    task.run()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 453, in run\n    self._run_new()\n  File "/usr/lib/python3.6/site-packages/osprofiler/profiler.py", line 160, in wrapper\n    result = f(*args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 485, in _run_new\n    self._schedule_actions()\n  File "/usr/lib/python3.6/site-packages/mistral/engine/tasks.py", line 563, in _schedule_actions\n    action.validate_input(input_dict)\n  File "/usr/lib/python3.6/site-packages/mistral/engine/actions.py", line 336, in validate_input\n    self.action_def.action_class\n  File "/usr/lib/python3.6/site-packages/mistral/engine/utils.py", line 66, in validate_input\n    raise exc.InputException(msg % tuple(msg_props))\nmistral.exceptions.InputException: Invalid input [name=tripleo.parameters.update, class=tripleo_common.actions.parameters.UpdateParametersAction, unexpected=[\'validate\']]\n'}

real	0m24.002s
user	0m4.088s
sys	0m6.133s

Comment 5 Alex Schultz 2019-05-10 19:36:17 UTC
That error points to a mismatch in containers & tripleo-common on the undercloud. What containers are you using?  See Bug 1700096

*** This bug has been marked as a duplicate of bug 1700096 ***

Comment 6 Sai Sindhur Malleni 2019-05-10 20:20:35 UTC
Tag is 20190306.1 (passed_phase1)

Comment 7 Sai Sindhur Malleni 2019-05-10 20:20:58 UTC
Tag is 20190306.1 (passed_phase1)

Comment 8 Alex Schultz 2019-05-10 20:50:04 UTC
That's way old. you need to use a newer version of the containers that goes with the tripleo-common you have installed. We should have containers from May 9th at least available (the most recent pass of phase1)