Description of problem: We had an outage wherein no user could create a new Octavia load balancer. Existing load-balancers continued to function correctly. The root cause relates to the nature and design of Octavia load balancer resource updates. The Octavia worker (living on controllers, networkers) send calls to the Octavia Amphora API (living on each hosted Amphora) using a mutual two-ways TLS connection/authentication (see https://docs.openstack.org/octavia/latest/admin/guides/certificates.html#two-way-tls-authentication-in-octavia and subsequent points). The culprit lies in how PKI is managed for the authentication to work as expected. Specifically the entire PKI (from a product perspective) is managed through a self-signed CA which comes with no automation and/or monitoring to automatically rotate the required certificates. During this specific service degradation the client side certificate used by the Octavia worker to authenticate to the Amphoras expired leaving no good way for the workers to process or communicate securely with the Amphoras. The expiration date was set to the 16th of July 2021, which tells us the service degradation started at that time. Version-Release number of selected component (if applicable): 16.1.6 How reproducible: Install an OpenStack cluster with TLS-everywhere IPA integration (NovaJoin) and wait a year. Steps to Reproduce: 1. Install an OpenStack cluster with TLS-everywhere IPA integration (NovaJoin) 2. Wait a year 3. Watch /var/lib/config-data/puppet-generated/octavia/etc/octavia/certs/client.pem expire on the controllers. Actual results: [2021-07-21 13:15:36 -0400] [1248] [DEBUG] Failed to send error message. [2021-07-21 13:15:41 -0400] [1248] [DEBUG] Error processing SSL request. [2021-07-21 13:15:41 -0400] [1248] [DEBUG] Invalid request from ip=::ffff:172.24.3.245: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:2354) [root@controller-2 certs]# cat client.pem.backup Certificate: Data: Version: 3 (0x2) Serial Number: 1 (0x1) Signature Algorithm: sha256WithRSAEncryption Issuer: C=US, ST=Denial, L=Springfield, O=Dis, CN=www.example.com Validity Not Before: Jul 16 12:22:00 2020 GMT Not After : Jul 16 12:22:00 2021 GMT Subject: C=US, ST=Denial, O=Dis, CN=www.example.com Expected results: Either: IPA to issue these clients certs (which I think might still be a roadmap item with IPA) and renew them automatically -or- Some mechanism to auto-renew these clients certs prior to expiration. Additional info: https://docs.google.com/document/d/1Jeok-VWayejYJnAW_z5lDmwm8MK-Bf1aTa-JrjXLuPk/edit#
Here is a little background information on how the mutual-authentication TLS works in Octavia and OSP. Communication between the control plane and the amphora (load balancing service VMs) is over a TLS connection using mutual authentication. This means that the control plane authenticates certificates issued to the amphora and the amphora authenticate certificates provided by the control plane. They are only used for service-to-service communication. In the case of the amphora certificates, they are issued at boot time and the Octavia housekeeping process rotates them as necessary based on the configuration settings. On the control plane side, in the case of RHOSP, the certificates are created and managed by tripleo/director. We are looking into why that "client" certificate has incorrect information on it.
Brent, please find the needed info in https://access.redhat.com/support/cases/#/case/02996461
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986