Bug 1845650 - redis_tls_proxy fails to restart properly when doing a brownfield deployment
Summary: redis_tls_proxy fails to restart properly when doing a brownfield deployment
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Damien Ciabrini
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-09 18:08 UTC by Ade Lee
Modified: 2020-10-28 15:37 UTC (History)
3 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200728213431.6c7ccc9.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:37:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 735122 0 None MERGED Ensure redis_tls_proxy starts after all redis instances 2020-10-13 19:44:56 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:37:55 UTC

Description Ade Lee 2020-06-09 18:08:36 UTC
Description of problem:

When doing a brownfield deployment, a system which originally only had public tls enabled is updated to have tls-everywhere deployed.

While the deployment is successful, further tests show that the redis tls-proxy fails to restart correctly on controller-2 and 3. The container fails to start with an error message indicating that it is trying to connect to a port that is already in use.

2020-06-05T19:53:24.373650132+00:00 stderr F [.] Binding service [redis] to 172.17.1.82:6379: Address already in use (98)

The most likely reason this is happening is because the redis-tls-proxy container is being restarted on the non-bootstrap node before the redis_restart_bundle on the bootstrap node.

See: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/database/redis-pacemaker-puppet.yaml#L262

The workaround for now appears to be to restart the reds_tls_proxy on the relevant controllers after the update.  That appears to be sufficient to set everything up correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Damien Ciabrini 2020-06-10 09:48:52 UTC
We investigated the failure with Ade yesterday, and we understood what series of events caused the problem.

As a quick refresher:
When Redis is configured without TLS, each redis server listens on the internal api NIC, on port 6379.

When Redis is configured with TLS [1], Redis listens on localhost:6379 instead. A TLS tunnel is created on the internal api NIC port 6379 to forward traffic to Redis. Likewise, Redis nodes connect to each other via a dedicated TLS tunnel.

From a deployment standpoint, a controller node with TLS-e runs an additional container redis_tls_proxy, that manages all the necessary tunnels with stunnel. The redis container and the redis_tls_proxy containers are started/restarted during step 2. And this is a problem with a brownfield deployment:

  . Initially, TLS-e is disabled, so only the redis container is running, and binds to internal-api:6379
  
  . a stack update is performed to convert the controller nodes to TLS-e. All nodes are converted in parallel.

  . redis configs are regenerated on all controllers to switch to TLS-e.
  
  . On controller-0, when reaching step 2, a pacemaker command is triggered to restart the redis containers on all controllers. On restart, the redis containers read the new config file that makes it bind to localhost:6379.

  . On controller-0, after the pacemaker command was run, the redis_tls_proxy container is created and started. The stunnel process binds to internal-api:63 and forawrd traffic to localhost:6379

  . On controller-1 and controller-2 however, we don't restart redis and instead we rely on the restart command ran on controller-0.

  . Now the problem happens: on controller-1 and controller-2, the redis_tls_proxy container is created and started. But when stunnel tries to bind internal-api:6379, there no guarantee that controller-0 has run the pacemaker command already.

  . So when controller-1 or controller-2 runs before controller-0, redis_tls_proxy will be started _before_ the old redis container has restarted. So stunnel won't be able to bind on internal-api:6379 and will fail.

Even if the deployment finish, this sequence yields a control plane with redis potentially not reachable.


[1] Redis doesn't support TLS natively, so TripleO uses TLS tunnels via stunnel

Comment 19 errata-xmlrpc 2020-10-28 15:37:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.