Bug 1845650
Summary: | redis_tls_proxy fails to restart properly when doing a brownfield deployment | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ade Lee <alee> |
Component: | openstack-tripleo-heat-templates | Assignee: | Damien Ciabrini <dciabrin> |
Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 16.1 (Train) | CC: | dciabrin, lmiccini, mburns |
Target Milestone: | z2 | Keywords: | Triaged |
Target Release: | 16.1 (Train on RHEL 8.2) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-11.3.2-0.20200728213431.6c7ccc9.el8ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-28 15:37:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ade Lee
2020-06-09 18:08:36 UTC
We investigated the failure with Ade yesterday, and we understood what series of events caused the problem. As a quick refresher: When Redis is configured without TLS, each redis server listens on the internal api NIC, on port 6379. When Redis is configured with TLS [1], Redis listens on localhost:6379 instead. A TLS tunnel is created on the internal api NIC port 6379 to forward traffic to Redis. Likewise, Redis nodes connect to each other via a dedicated TLS tunnel. From a deployment standpoint, a controller node with TLS-e runs an additional container redis_tls_proxy, that manages all the necessary tunnels with stunnel. The redis container and the redis_tls_proxy containers are started/restarted during step 2. And this is a problem with a brownfield deployment: . Initially, TLS-e is disabled, so only the redis container is running, and binds to internal-api:6379 . a stack update is performed to convert the controller nodes to TLS-e. All nodes are converted in parallel. . redis configs are regenerated on all controllers to switch to TLS-e. . On controller-0, when reaching step 2, a pacemaker command is triggered to restart the redis containers on all controllers. On restart, the redis containers read the new config file that makes it bind to localhost:6379. . On controller-0, after the pacemaker command was run, the redis_tls_proxy container is created and started. The stunnel process binds to internal-api:63 and forawrd traffic to localhost:6379 . On controller-1 and controller-2 however, we don't restart redis and instead we rely on the restart command ran on controller-0. . Now the problem happens: on controller-1 and controller-2, the redis_tls_proxy container is created and started. But when stunnel tries to bind internal-api:6379, there no guarantee that controller-0 has run the pacemaker command already. . So when controller-1 or controller-2 runs before controller-0, redis_tls_proxy will be started _before_ the old redis container has restarted. So stunnel won't be able to bind on internal-api:6379 and will fail. Even if the deployment finish, this sequence yields a control plane with redis potentially not reachable. [1] Redis doesn't support TLS natively, so TripleO uses TLS tunnels via stunnel Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284 |