Bug 2121634

Summary: Designate DNS - CI fails on: "Overcloud deploy stage" when the Job is configured to be "TLS everywhere"
Product: Red Hat OpenStack Reporter: Arkady Shtempler <ashtempl>
Component: openstack-designateAssignee: Michael Johnson <michjohn>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: astillma, beagles, erpeters, gregraka, jamsmith, lavraham, michjohn, mjohnson, mjohnson, scohen
Target Milestone: z1Keywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-designate-12.1.0-0.20220923170221.f255747.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, the Red Hat OpenStack Platform (RHOSP) DNS service (designate) was unable to start its central process when TLS-everywhere was enabled. This was caused by an inability to connect to Redis over TLS. With this update in RHOSP 17.0.1, this issue has been resolved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-25 12:28:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Arkady Shtempler 2022-08-26 05:33:50 UTC
I've tried to run the existing Designate Ci job with the changes needed to support TLS (--tls-everewhe e.t.c.) 
Unfortunately this Job has failed on Overcloud deployment stage with [1]:
 
2022-08-25 10:53:23.152691 |                                      |    WARNING | ERROR: Can't run container designate_pool_manage
2022-08-25 10:53:23.154644 | 52540037-eae6-8809-5dbf-00000000dd03 |      FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_5 | controller-0 | error={"changed": false, "msg": "Failed containers: designate_pool_manage"}
2022-08-25 10:53:23.155762 | 52540037-eae6-8809-5dbf-00000000dd03 |     TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_5 | controller-0 | 0:42:25.602759 | 64.06s
In OSP17 we only support the default pool, but maybe "designate pool management" (looks like a failed container is from that area) needs some attention in terms of TLS?

There is a high number of Errors in Designate logs: [2]

[1] - http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-openstack-designate-17.0_director-rhel-virthost-3cont_2comp-ipv4-geneve/154/undercloud-0/home/stack/overcloud_install.log.gz
[2] - https://paste.openstack.org/show/bLEkAt6wGW5Unb7vAjwn/

Comment 3 Brent Eagles 2022-08-26 14:56:19 UTC
This could be a bug in tooz redis driver. The redis python module can definitely create the connection:

>>> r = redis.from_url('rediss://:1NKFZEUgvAECd2vQSq0Jxrb7u.1.151:6379')
>>> r.ping()
True

>>> r = redis.StrictRedis(password='1NKFZEUgvAECd2vQSq0Jxrb7u', port=6379, host='172.17.1.151', ssl=True)
>>> r.ping()
True

I'll see if I can sort out exactly what's going on.

Comment 4 Brent Eagles 2022-08-26 15:20:43 UTC
It looks like a bug in how the tooz redis driver is handling the ssl=true that is passed in as a query. For fun, I took the relevant bits out of the tooz code and did a little test:

>>> import urllib
>>> from oslo_utils import netutils
>>> from oslo_utils import strutils
>>> 
>>> parsed_url = netutils.urlsplit('redis://:1NKFZEUgvAECd2vQSq0Jxrb7u.1.151:6379/?ssl=true')
>>> print(parsed_url)
_ModifiedSplitResult(scheme='redis', netloc=':1NKFZEUgvAECd2vQSq0Jxrb7u.1.151:6379', path='/', query='ssl=true', fragment='')
>>> parsed_qs = urllib.parse.parse_qs(parsed_url.query)
>>> print(parsed_qs)
{'ssl': ['true']}
>>> 
>>> # As it appears in the tooz redis driver
... print(strutils.bool_from_string(parsed_qs['ssl']))
False
>>> 
>>> # Probably the correct code
... print(strutils.bool_from_string(parsed_qs['ssl'][0]))
True

Comment 5 Brent Eagles 2022-08-29 14:04:03 UTC
It occurred to me that I should check that urllib is behaving as expected. According to the docs it is:

from https://docs.python.org/3/library/urllib.parse.html

" urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name."

AFAICT the tooz redis driver has had this problem for awhile.

Comment 6 Brent Eagles 2022-08-29 14:32:43 UTC
Seems the other tooz drivers use a utility function to handle this kind of thing. I've submitted a bug u/s https://bugs.launchpad.net/python-tooz/+bug/1988059 and a patch  https://review.opendev.org/c/openstack/tooz/+/855039.

Comment 7 Brent Eagles 2022-08-29 16:31:59 UTC
This was a false lead. The tooz redis driver uses a helper function in a base class to properly pre-process the arguments so the bug isn't there.

Comment 8 Michael Johnson 2022-09-07 16:34:48 UTC
This appears to be a known issue in eventlet (linked on the BZ). Eventlet is raising the wrong exception for a socket timeout, which redis is relying on.

There is a monkey-patch the monkey-patch workaround that resolved the issue in designate central.
I will propose a patch upstream in Designate. I think fixing eventlet will take too long as OpenStack is using an older version of eventlet for other issues.

Comment 13 mjohnson 2022-11-10 00:56:12 UTC
@michjohn this was meant for you :)

Comment 21 Lilach Avraham 2022-12-19 14:59:04 UTC
We ran the Designate CI job with the changes needed to support TLS [1], and it passed the "Overcloud" stage.
The job failed at a different stage and we opened a new BZ for that[2].

I am moving this BZ status to VERIFIED.

[1] - https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/openstack-designate/job/DFG-network-openstack-designate-17.0_director-rhel-virthost-3cont_2comp_1ipa-ipv4-geneve-tls/
[2] - https://bugzilla.redhat.com/show_bug.cgi?id=2154887

Comment 22 Gregory Thiemonge 2023-01-03 15:45:09 UTC
*** Bug 2154887 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2023-01-25 12:28:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.0.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0271