Bug 2099855

Summary:	[OSP17][TLS-E] haproxy check fails for ceph-grafana service
Product:	Red Hat OpenStack	Reporter:	Marian Krcmarik <mkrcmari>
Component:	puppet-tripleo	Assignee:	Francesco Pantano <fpantano>
Status:	CLOSED ERRATA	QA Contact:	Alfredo <alfrgarc>
Severity:	high	Docs Contact:
Priority:	high
Version:	17.0 (Wallaby)	CC:	adking, epuertat, fpantano, jjoyce, jschluet, mburns, ramishra, slinaber, tvignaud
Target Milestone:	ga	Keywords:	Triaged
Target Release:	17.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	puppet-tripleo-14.2.3-0.20220705151704.bc62cd8.el9ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-21 12:23:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2022-06-21 20:31:57 UTC

Description of problem:
If OSP is deployed with ceph-dashboard there are multiple ceph-dashboard services deployed and place behind haproxy, one of the services is grafana, The following haproxy configuration is generated for grafana on OSP17:
listen ceph_grafana
  bind 192.168.24.71:3100 transparent ssl crt /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem
  mode http
  balance source
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk HEAD /
  option httplog
  option forwardfor
  server central-controller-0.storage.redhat.local 172.23.1.55:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-0.storage.redhat.local
  server central-controller-1.storage.redhat.local 172.23.1.124:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-1.storage.redhat.local
  server central-controller-2.storage.redhat.local 172.23.1.243:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-2.storage.redhat.local

The haproxy configuration for grafana service seems to be correct and haproxy does service backend checks regularly.
The problem seems to be that the check fails, the grafana service complains every 2 seconds about:
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.243:56364: remote error: tls: internal error
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.55:52296: remote error: tls: internal error
2022/06/21 12:36:01 http: TLS handshake error from 172.23.1.124:52898: remote error: tls: internal error

I think the reason that all the grafana server containers on all the controller nodes (in my case grafana is deployed on controllers) have the same SSL certificate and key deployed in /etc/grafana/certs/cert_file|key, in my case It's SSL certificate generated for grafana service on controller-0. So the haproxy check is successful to grafana on controller-0 but fails to the other grafana backends because the grafana containers have the same certificate generated for controller-0 deployed in /etc/grafana/certs/cert_file|key.

The container's file /etc/grafana/certs/cert_file are bind to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_key on the hosts and It's the same files on all the hosts but the certificates in /etc/pki/tls/certs/ceph_grafana.crt are different and correctly generated for each host.

If I copy /etc/pki/tls/certs/ceph_grafana.crt to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_file and restart grafana containers on all hosts, The haproxy check starts to be successful.

I am not sure about the right component so I am assigning initially to THT.

Version-Release number of selected component (if applicable):
puppet-tripleo-14.2.3-0.20220607163018.bc63c9e.el9ost.noarch
ansible-tripleo-ipsec-11.0.1-0.20210910011424.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.2.3-0.20220301190449.6b0ed82.el9ost.noarch
ansible-role-tripleo-modify-image-1.3.1-0.20220216001439.30d23d5.el9ost.noarch
python3-tripleo-common-15.4.1-0.20220608140403.caa0c1f.el9ost.noarch
tripleo-ansible-3.3.1-0.20220607162207.ae139c3.el9ost.noarch
openstack-tripleo-validations-14.2.2-0.20220514020831.d2a1172.el9ost.noarch
openstack-tripleo-common-containers-15.4.1-0.20220608140403.caa0c1f.el9ost.noarch
openstack-tripleo-common-15.4.1-0.20220608140403.caa0c1f.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-0.20220607161058.ced328c.el9ost.noarch
python3-tripleoclient-16.4.1-0.20220607160517.4d2a5db.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy OSP17 with ceph-dashboard.
2. Check the grafana server log and haproxy status
Actual results:
2022/06/21 12:35:58 http: TLS handshake error from 172.23.1.243:56362: remote error: tls: internal error
2022/06/21 12:35:58 http: TLS handshake error from 172.23.1.55:52294: remote error: tls: internal error
2022/06/21 12:35:59 http: TLS handshake error from 172.23.1.124:52896: remote error: tls: internal error
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.243:56364: remote error: tls: internal error
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.55:52296: remote error: tls: internal error
2022/06/21 12:36:01 http: TLS handshake error from 172.23.1.124:52898: remote error: tls: internal error

and haproxy reports two backends out of 3 to be DOWN for grafana

Additional info:
Feel free to request any logs

Comment 14 errata-xmlrpc 2022-09-21 12:23:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Comment 15 Red Hat Bugzilla 2023-09-18 04:39:50 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days