Bug 1747126

Summary: SSH enablement workflow timeout during deploy of a large overcloud using config-download
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-tripleo-commonAssignee: James Slagle <jslagle>
Status: CLOSED CURRENTRELEASE QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aschultz, dbecker, emacchi, gfidente, jhajyahy, johfulto, jslagle, ltamagno, mburns, morazi, pkundal, slinaber
Target Milestone: ---Keywords: TestOnly, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.6.8-15.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-04 10:47:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible-log none

Description Sai Sindhur Malleni 2019-08-29 17:48:35 UTC
Description of problem:
Trying to scale up a 103 node overcloud to 207 nodes, on a multiple attempts we see the heat stack update finish successfully, but the deploy fail due to timeout of ssh enablement workflow.

ssh admin enabssh admin enablement workflow - TIMED OUT.

We saw the same result even on bumping ENABLE_SSH_ADMIN_TIMEOUT and ENABLE_SSH_ADMIN_SSH_PORT_TIMEOUT  to 600 from 300


James Slagle Looked at the ansible log file, andit appears ansible itself succeeded but in the logs we see 

2019-08-29 16:40:59.867 409069 ERROR mistral.db.utils [req-704d2539-505b-4259-b8a3-9f82c1ffe4da 2a6d10bddc274e00b00ad4d4adeffda5 c67ce78faf0643708bc7b067eb7525bd - default default] DB error detected, operation will be retried: <function on_action_complete at 0x7f1332047140>: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(32, 'Broken pipe'))") [SQL: u'UPDATE action_executions_v2 SET updated_at=%(updated_at)s, state=%(state)s, accepted=%(accepted)s, output=%(output)s WHERE action_executions_v2.id = %(action_executions_v2_id)s'] [parameters: {'output': '{"result": {"log_path": "/tmp/ansible-mistral-actionwGCOhN/ansible.log", "stderr": "ansible-playbook 2.6.11\\n  config file = /tmp/ansible-mistral-ac ... (24673890 characters truncated) ... +0000 (0:00:20.791)       0:02:14.533 ******* \\n=============================================================================== \\n", "stdout": ""}}', 'state': 'SUCCESS', 'accepted': 1, 'updated_at': datetime.datetime(2019, 8, 29, 16, 40, 59), 'action_executions_v2_id': u'6712d0f7-0c20-4239-b03d-b4560193bf46'}] (Background on this error at: http://sqlalche.me/e/e3q8)


James feels this could be related to the stdout geenrated by the command.



Version-Release number of selected component (if applicable):
13

How reproducible:
100% on an overcloud of this size

Steps to Reproduce:
1. deploy a large overcloud using config-donwload
2.
3.

Actual results:
Deploy fails after successful heat stack create/update but fails during ssh enablement workflow.

 

Expected results:
SSh enablement should succeed as well as overcloud deployment.

Additional info:

Comment 1 Sai Sindhur Malleni 2019-08-29 18:10:16 UTC
Created attachment 1609627 [details]
ansible-log

Comment 2 James Slagle 2019-08-30 16:41:53 UTC
https://review.opendev.org/#/c/679481/ should also be backported to queens

Comment 3 James Slagle 2019-08-30 17:17:11 UTC
*** Bug 1746953 has been marked as a duplicate of this bug. ***

Comment 4 John Fulton 2019-09-19 18:42:46 UTC
*** Bug 1703618 has been marked as a duplicate of this bug. ***

Comment 5 Lon Hohberger 2019-09-25 10:44:45 UTC
According to our records, this should be resolved by openstack-tripleo-common-8.6.8-16.el7ost.  This build is available now.

Comment 7 Giulio Fidente 2019-12-09 12:41:22 UTC
*** Bug 1780687 has been marked as a duplicate of this bug. ***

Comment 8 John Fulton 2020-01-20 15:25:34 UTC
*** Bug 1786063 has been marked as a duplicate of this bug. ***