Bug 1910051 - [OSP16] tripleoclient is not sufficently resilient when monitoring mistral workflows
Summary: [OSP16] tripleoclient is not sufficently resilient when monitoring mistral wo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-tripleoclient
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Michele Baldessari
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-22 13:32 UTC by Michele Baldessari
Modified: 2021-09-15 07:11 UTC (History)
9 users (show)

Fixed In Version: python-tripleoclient-12.4.1-2.20210316010910.7536d5b.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-15 07:11:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1909019 0 None None None 2020-12-22 15:24:31 UTC
OpenStack gerrit 771923 0 None MERGED Make workflow monitoring more resilient 2021-01-25 16:01:44 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:11:21 UTC

Description Michele Baldessari 2020-12-22 13:32:47 UTC
Description of problem:
Damien and I saw this one while doing a 13->16.1 FFU but it really can affect any workflow instantiated by tripleoclient.

In our case while FFUing overcloud-controller-1, for reasons yet unclear/unrelated to us, the connection to the mistral port received a tcp reset and the client failed like this:
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connec
tion to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))


The full trace is:
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun [-] Exception occured while running the command: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/dfe7ee67-6cd0-407c-9f61-b355a1cf0b25: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last):
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     chunked=chunked)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     six.raise_from(e, None)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "<string>", line 3, in raise_from
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     httplib_response = conn.getresponse()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     response.begin()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 307, in begin
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     version, status, reason = self._read_status()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 276, in _read_status
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise RemoteDisconnected("Remote end closed connection without"
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun http.client.RemoteDisconnected: Remote end closed connection without response
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun During handling of the above exception, another exception occurred:
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun Traceback (most recent call last):
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     timeout=timeout
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     _stacktrace=sys.exc_info()[2])
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 368, in increment
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise six.reraise(type(error), error, _stacktrace)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/packages/six.py", line 674, in reraise
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise value.with_traceback(tb)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     chunked=chunked)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     six.raise_from(e, None)
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "<string>", line 3, in raise_from
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     httplib_response = conn.getresponse()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     response.begin()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 307, in begin
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     version, status, reason = self._read_status()
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun   File "/usr/lib64/python3.6/http/client.py", line 276, in _read_status
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun     raise RemoteDisconnected("Remote end closed connection without"
2020-12-21 17:16:25 | 2020-12-21 17:16:25.015 297753 ERROR tripleoclient.v1.overcloud_upgrade.UpgradeRun urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Note that:
A) The mistral workflow kept running just fine in the background. THe issue is only with tripleoclient itslef
B) We observed that the mistral workflow execution completed successfully


This was seen with compose RHOS-16.1-RHEL-8-20201214.n.3
python3-tripleoclient-12.3.2-1.20200914164930.el8ost.noarch


The ask in this BZ is to make tripleoclient code a bit more robust in the face of a minor network hiccup.


A quick reproducer for this issue is to run a longer workflow (minor update or ffu of a node) and once tripleoclient is just monitoring the mistral execution just run:
#!/bin/sh
iptables -I INPUT -p tcp --dport 13989 -j REJECT
sleep 13
iptables -D INPUT 1


(13 secs is because the monitoring interval in tripleoclient is 10 seconds)

Full sosreport and logs are avaiable at:
http://file.rdu.redhat.com/~mbaldess/ffu-tripleoclient-mistral-reset/

The patch that seems to fix it for us is:
diff --git a/workflows/base.py b/workflows/base.py
index b8afd22..1a82f7c 100644
--- a/workflows/base.py
+++ b/workflows/base.py
@@ -12,6 +12,7 @@
 import json
 import logging

+import keystoneauth1
 from tripleoclient import exceptions

 LOG = logging.getLogger(__name__)
@@ -93,7 +94,12 @@ def wait_for_messages(mistral, websocket, execution, timeout=None):
             # Workflows should end with SUCCESS or ERROR statuses.
             if payload.get('status', 'RUNNING') != "RUNNING":
                 return
-            execution = mistral.executions.get(execution.id)
+            try:
+                execution = mistral.executions.get(execution.id)
+            except keystoneauth1.exceptions.connection.ConnectFailure as e:
+                LOG.warning("Connection failure while fetching execution ID. Retrying: %s" % e)
+                continue
+
             if execution.state != "RUNNING":
                 # yield the output as the last payload which was missed
                 yield json.loads(execution.output)


With this patch, while running the reproducer we get:
PLAY [Gather facts from overcloud] *********************************************

TASK [Gathering Facts] *********************************************************
Tuesday 22 December 2020  13:27:38 +0000 (0:00:03.337)       0:00:03.438 ******
ok: [controller-1]

2020-12-22 13:27:41.242 3728 WARNING tripleoclient.workflows.base [-] Connection failure while fetching execution ID. Retrying: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',)): keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://192.168.24.2:13989/v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209: HTTPSConnectionPool(host='192.168.24.2', port=13989): Max retries exceeded with url: /v2/executions/bb693f44-180f-4b16-a215-d379e0fd9209 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4b22d9e80>: Failed to establish a new connection: [Errno 111] Connection refused',))^[[00m
ok: [controller-0]

PLAY [Load global variables] ***************************************************

The mistral execution continues correctly and tripleoclient deals with the hiccup without erroring out.

Comment 1 Michele Baldessari 2021-05-06 07:52:38 UTC
Moving to ON_QA as the patch is in python3-tripleoclient-12.4.1-2.20210316010910.7536d5b.el8ost which is in RHOS-16.2-RHEL-8-20210426.n.0

Comment 5 errata-xmlrpc 2021-09-15 07:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.