Description of problem: Damien and I saw this one while doing a 13->16.1 FFU but it really can affect any workflow instantiated by tripleoclient. In our case while FFUing overcloud-controller-1, for reasons yet unclear/unrelated to us, the connection to the mistral port received a tcp reset and the client failed like this: 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connec tion to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) The full trace is: 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last): 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun chunked=chunked) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun six.raise_from(e, None) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "<string>", line 3, in raise_from 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun httplib_response = conn.getresponse() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun response.begin() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 307, in begin 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun version, status, reason = self._read_status() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 276, in _read_status 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun raise RemoteDisconnected("Remote end closed connection without" 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun http.client.RemoteDisconnected: Remote end closed connection without response 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun During handling of the above exception, another exception occurred: 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last): 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun timeout=timeout 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun _stacktrace=sys.exc_info()[2]) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 368, in increment 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun raise six.reraise(type(error), error, _stacktrace) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/packages/six.py", line 674, in reraise 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun raise value.with_traceback(tb) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun chunked=chunked) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun six.raise_from(e, None) 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "<string>", line 3, in raise_from 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun httplib_response = conn.getresponse() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun response.begin() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 307, in begin 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun version, status, reason = self._read_status() 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun File "/usr/lib64/python3.6/http/client.py", line 276, in _read_status 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun raise RemoteDisconnected("Remote end closed connection without" 2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)) Note that: A) The mistral workflow kept running just fine in the background. THe issue is only with tripleoclient itslef B) We observed that the mistral workflow execution completed successfully This was seen with compose RHOS-16.1-RHEL-8-20201214.n.3 python3-tripleoclient-12.3.2-1.20200914164930.el8ost.noarch The ask in this BZ is to make tripleoclient code a bit more robust in the face of a minor network hiccup. A quick reproducer for this issue is to run a longer workflow (minor update or ffu of a node) and once tripleoclient is just monitoring the mistral execution just run: #!/bin/sh iptables -I INPUT -p tcp --dport 13989 -j REJECT sleep 13 iptables -D INPUT 1 (13 secs is because the monitoring interval in tripleoclient is 10 seconds) Full sosreport and logs are avaiable at: http://file.rdu.redhat.com/~mbaldess/ffu-tripleoclient-mistral-reset/ The patch that seems to fix it for us is: diff --git a/workflows/base.py b/workflows/base.py index b8afd22..1a82f7c 100644 --- a/workflows/base.py +++ b/workflows/base.py @@ -12,6 +12,7 @@ import json import logging +import keystoneauth1 from tripleoclient import exceptions LOG = logging.getLogger(__name__) @@ -93,7 +94,12 @@ def wait_for_messages(mistral, websocket, execution, timeout=None): # Workflows should end with SUCCESS or ERROR statuses. if payload.get('status', 'RUNNING') != "RUNNING": return - execution = mistral.executions.get(execution.id) + try: + execution = mistral.executions.get(execution.id) + except keystoneauth1.exceptions.connection.ConnectFailure as e: + LOG.warning("Connection failure while fetching execution ID. Retrying: %s" % e) + continue + if execution.state != "RUNNING": # yield the output as the last payload which was missed yield json.loads(execution.output) With this patch, while running the reproducer we get: PLAY [Gather facts from overcloud] ********************************************* TASK [Gathering Facts] ********************************************************* Tuesday 22 December 2020 13:27:38 +0000 (0:00:03.337) 0:00:03.438 ****** ok: [controller-1] 2020-12-22 13:27:41.242 3728 WARNING tripleoclient.workflows.base [-] Connection failure while fetching execution ID. Retrying: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',)): keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',))^[[00m ok: [controller-0] PLAY [Load global variables] *************************************************** The mistral execution continues correctly and tripleoclient deals with the hiccup without erroring out.
Moving to ON_QA as the patch is in python3-tripleoclient-12.4.1-2.20210316010910.7536d5b.el8ost which is in RHOS-16.2-RHEL-8-20210426.n.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483