Bug 2097372

Summary: In TLS-e setup RadosGW is not configured with SSL certificate
Product: Red Hat OpenStack Reporter: Pavel Sedlák <psedlak>
Component: tripleo-ansibleAssignee: Francesco Pantano <fpantano>
Status: CLOSED ERRATA QA Contact: Alfredo <alfrgarc>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: cephqe-warriors, fpantano, gfidente, jdurgin, jniu, johfulto, jparoly, jschluet, lhh, mhicks, mkrcmari, nlevinki, oblaut, sostapov, vereddy
Target Milestone: gaKeywords: AutomationBlocker, Regression, Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-0.20220706080800.feca772.el9ost tripleo-ansible-3.3.1-0.20220706140824.fa5422f.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:22:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pavel Sedlák 2022-06-15 14:24:46 UTC
In tls-everywhere setup, object storage tests are failing due to haproxy failing to connect to Ceph RadosGW.

In controller-2/var/log/containers/haproxy/haproxy.log can be seen:
> Jun 14 18:28:11 controller-1 haproxy[7]: Server ceph_rgw/controller-1.storage.redhat.local is DOWN, reason: Layer6 invalid response, info: "SSL handshake failure", check duration: 2ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Jun 14 18:28:11 controller-1 haproxy[7]: Server ceph_rgw/controller-0.storage.redhat.local is DOWN, reason: Layer6 invalid response, info: "SSL handshake failure", check duration: 2ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Jun 14 18:28:11 controller-1 haproxy[7]: Server ceph_rgw/controller-2.storage.redhat.local is DOWN, reason: Layer6 invalid response, info: "SSL handshake failure", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Jun 14 18:28:11 controller-1 haproxy[7]: proxy ceph_rgw has no server available!
> Jun 14 18:50:54 controller-1 haproxy[7]: 10.0.0.99:58596 [14/Jun/2022:18:50:54.455] ceph_rgw~ ceph_rgw/<NOSRV> 0/{-}1/-1/-1/0 503 217 - - SC{-}- 1/1/0/0/0 0/0 "GET /info HTTP/1.1"

and that is all about ceph_rgw there (only more NOSRV entries), it never gets marked as UP.

Due to that, any attempts to use object-storage fails like e.g. in tempest case:
> testtools.testresult.real._StringException: pythonlogging:'': {{{
> 2022-06-14 19:08:38,830 193543 INFO     [tempest.lib.common.rest_client] Request (TestObjectStorageBasicOps:test_swift_basic_ops): 503 GET https://overcloud.redhat.local:13808/swift/v1/AUTH_d29b1bd5121f463e8da63823a5b39615 0.136s
> 2022-06-14 19:08:38,830 193543 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>'}
>         Body: None
>     Response - Headers: {'content-length': '107', 'cache-control': 'no-cache', 'content-type': 'text/html', 'connection': 'close', 'status': '503', 'content-location': 'https://overcloud.redhat.local:13808/swift/v1/AUTH_d29b1bd5121f463e8da63823a5b39615'}
>         Body: b'<html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html>\n'
> }}}
>
> Traceback (most recent call last):
>   File "/usr/lib/python3.9/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
>     return f(*func_args, **func_kwargs)
>   File "/usr/lib/python3.9/site-packages/tempest/scenario/test_object_storage_basic_ops.py", line 37, in test_swift_basic_ops
>     self.get_swift_stat()
>   File "/usr/lib/python3.9/site-packages/tempest/scenario/manager.py", line 1628, in get_swift_stat
>     self.account_client.list_account_containers()
>   File "/usr/lib/python3.9/site-packages/tempest/lib/services/object_storage/account_client.py", line 70, in list_account_containers
>     resp, body = self.get(url, headers={})
>   File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 314, in get
>     return self.request('GET', url, extra_headers, headers)
>   File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 703, in request
>     self._error_checker(resp, resp_body)
>   File "/usr/lib/python3.9/site-packages/tempest/lib/common/rest_client.py", line 883, in _error_checker
>     raise exceptions.UnexpectedResponseCode(str(resp.status),
> tempest.lib.exceptions.UnexpectedResponseCode: Unexpected response code received
> Details: 503


I'm not exactly sure what is the root cause behing it, but seems that haproxy is configured to connect to radosgw via https/ssl while rgw is listening on plain http:

1) haproxy rgw backends configured
> [root@controller-1 heat-admin]# grep 'listen |^ server ' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | head -n 4
> listen ceph_rgw
> server controller-0.storage.redhat.local 172.17.3.23:8080 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost controller-0.storage.redhat.local
> server controller-1.storage.redhat.local 172.17.3.99:8080 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost controller-1.storage.redhat.local
> server controller-2.storage.redhat.local 172.17.3.135:8080 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost controller-2.storage.redhat.local
2) trying on over httpS fails
> [root@controller-1 heat-admin]# curl 'https://172.17.3.99:8080/'
> curl: (35) error:0A00010B:SSL routines::wrong version number
3) trying it over plain http works
> [root@controller-1 heat-admin]# curl http://172.17.3.99:8080/
> <?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>

According to ceph orch info the certificate is expected to be used:
> service_type: rgw
> service_id: rgw
> service_name: rgw.rgw
> placement:
>   hosts:
>   - controller-0
>   - controller-1
>   - controller-2
> networks:
> - 172.17.3.0/24
> spec:
>   rgw_frontend_port: 8080
>   rgw_frontend_ssl_certificate: '-----BEGIN CERTIFICATE-----
>     MIIFdTCCA92gAwIBAgIBDTANBgkqhkiG9w0BAQsFADA3MRUwEwYDVQQKDAxSRURI
>     ...

Comment 6 Pavel Sedlák 2022-06-16 10:06:50 UTC
Forgot to mention which versions are involved, here is a quick list:

rpms on overcloud (controller|ceph):
> cephadm-16.2.7-121.el9cp.noarch
> puppet-haproxy-4.2.2-0.20210812210050.a797b8c.el9ost.noarch
> certmonger-0.79.14-5.el9.x86_64
> puppet-certmonger-2.7.1-0.20210812224230.3e2e660.el9ost.noarch

containers on overcloud:
> # from podman ps
> rh-osbs/rhceph:5-170
> rh-osbs/rhceph@sha256:90e4316d65f4a76fea307705d9b0e4706f05e10a63bf041dbee379c8711db115
> # podman images
> undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph                              5-170            9ea8ac4eae90  2 months ago  1.05 GB

rpms on undercloud:
> certmonger-0.79.14-5.el9.x86_64
> openstack-tripleo-heat-templates-14.3.1-0.20220607161058.ced328c.el9ost.noarch
> tripleo-ansible-3.3.1-0.20220607162207.ae139c3.el9ost.noarch
> ansible-core-2.12.2-1.el9.x86_64
> ansible-collection-ansible-posix-1.2.0-1.3.el9ost.noarch
> ansible-collection-community-general-4.0.0-1.1.el9ost.noarch
> ansible-collection-containers-podman-1.9.3-1.el9ost.noarch
> ansible-role-container-registry-1.4.1-0.20220506220849.57da845.el9ost.noarch
> ansible-role-redhat-subscription-1.2.1-0.20220529221557.ef52a27.el9ost.noarch
> ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
> ansible-collection-ansible-utils-2.3.0-2.el9ost.noarch
> ansible-collection-ansible-netcommon-2.2.0-1.2.el9ost.noarch
> ansible-config_template-1.2.2-0.20220427223824.78e7f22.el9ost.noarch
> ansible-role-atos-hsm-1.0.1-0.20210908111811.ccd3896.el9ost.noarch
> ansible-role-chrony-1.2.1-0.20220607160358.7ccf873.el9ost.noarch
> ansible-role-collectd-config-0.0.2-0.20220204170819.1992666.el9ost.noarch
> ansible-role-lunasa-hsm-1.1.1-0.20210908110336.6ebc8f4.el9ost.noarch
> ansible-role-qdr-config-0.0.1-0.20210908110336.b456651.el9ost.noarch
> ansible-role-thales-hsm-1.0.1-0.20210908120803.e0f4569.el9ost.noarch
> ansible-freeipa-1.6.3-1.el9.noarch
> ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
> ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
> ansible-collections-openstack-1.8.0-0.20220513060934.5bb8312.el9ost.noarch
> ansible-pacemaker-1.0.4-0.20210910010919.666f706.el9ost.noarch
> ansible-role-openstack-operations-0.0.1-0.20210915011315.2ab288f.el9ost.noarch
> ansible-role-metalsmith-deployment-1.4.3-0.20220223021106.324b758.el9ost.noarch

Comment 20 errata-xmlrpc 2022-09-21 12:22:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543