Bug 1747126 - SSH enablement workflow timeout during deploy of a large overcloud using config-download
Summary: SSH enablement workflow timeout during deploy of a large overcloud using conf...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
: 1703618 1746953 1780687 1786063 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-29 17:48 UTC by Sai Sindhur Malleni
Modified: 2023-10-06 18:31 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-common-8.6.8-15.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-04 10:47:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible-log (1.82 MB, text/plain)
2019-08-29 18:10 UTC, Sai Sindhur Malleni
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1842102 0 None None None 2019-08-30 16:17:01 UTC
OpenStack gerrit 679475 0 'None' MERGED Honor trash_output when not using queue 2021-01-29 03:33:19 UTC
OpenStack gerrit 679476 0 'None' MERGED Use trash_output in create_admin_via_ssh workflow 2021-01-29 03:33:20 UTC
OpenStack gerrit 679482 0 'None' MERGED Don't accumulate ansible output uselessly 2021-01-29 03:33:20 UTC
Red Hat Issue Tracker OSP-28260 0 None None None 2023-09-07 20:33:52 UTC
Red Hat Knowledge Base (Solution) 4091811 0 None None None 2019-09-19 18:42:46 UTC

Description Sai Sindhur Malleni 2019-08-29 17:48:35 UTC
Description of problem:
Trying to scale up a 103 node overcloud to 207 nodes, on a multiple attempts we see the heat stack update finish successfully, but the deploy fail due to timeout of ssh enablement workflow.

ssh admin enabssh admin enablement workflow - TIMED OUT.

We saw the same result even on bumping ENABLE_SSH_ADMIN_TIMEOUT and ENABLE_SSH_ADMIN_SSH_PORT_TIMEOUT  to 600 from 300


James Slagle Looked at the ansible log file, andit appears ansible itself succeeded but in the logs we see 

2019-08-29 16:40:59.867 409069 ERROR mistral.db.utils [req-704d2539-505b-4259-b8a3-9f82c1ffe4da 2a6d10bddc274e00b00ad4d4adeffda5 c67ce78faf0643708bc7b067eb7525bd - default default] DB error detected, operation will be retried: <function on_action_complete at 0x7f1332047140>: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(32, 'Broken pipe'))") [SQL: u'UPDATE action_executions_v2 SET updated_at=%(updated_at)s, state=%(state)s, accepted=%(accepted)s, output=%(output)s WHERE action_executions_v2.id = %(action_executions_v2_id)s'] [parameters: {'output': '{"result": {"log_path": "/tmp/ansible-mistral-actionwGCOhN/ansible.log", "stderr": "ansible-playbook 2.6.11\\n  config file = /tmp/ansible-mistral-ac ... (24673890 characters truncated) ... +0000 (0:00:20.791)       0:02:14.533 ******* \\n=============================================================================== \\n", "stdout": ""}}', 'state': 'SUCCESS', 'accepted': 1, 'updated_at': datetime.datetime(2019, 8, 29, 16, 40, 59), 'action_executions_v2_id': u'6712d0f7-0c20-4239-b03d-b4560193bf46'}] (Background on this error at: http://sqlalche.me/e/e3q8)


James feels this could be related to the stdout geenrated by the command.



Version-Release number of selected component (if applicable):
13

How reproducible:
100% on an overcloud of this size

Steps to Reproduce:
1. deploy a large overcloud using config-donwload
2.
3.

Actual results:
Deploy fails after successful heat stack create/update but fails during ssh enablement workflow.

 

Expected results:
SSh enablement should succeed as well as overcloud deployment.

Additional info:

Comment 1 Sai Sindhur Malleni 2019-08-29 18:10:16 UTC
Created attachment 1609627 [details]
ansible-log

Comment 2 James Slagle 2019-08-30 16:41:53 UTC
https://review.opendev.org/#/c/679481/ should also be backported to queens

Comment 3 James Slagle 2019-08-30 17:17:11 UTC
*** Bug 1746953 has been marked as a duplicate of this bug. ***

Comment 4 John Fulton 2019-09-19 18:42:46 UTC
*** Bug 1703618 has been marked as a duplicate of this bug. ***

Comment 5 Lon Hohberger 2019-09-25 10:44:45 UTC
According to our records, this should be resolved by openstack-tripleo-common-8.6.8-16.el7ost.  This build is available now.

Comment 7 Giulio Fidente 2019-12-09 12:41:22 UTC
*** Bug 1780687 has been marked as a duplicate of this bug. ***

Comment 8 John Fulton 2020-01-20 15:25:34 UTC
*** Bug 1786063 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.