Bug 1845650

Summary:	redis_tls_proxy fails to restart properly when doing a brownfield deployment
Product:	Red Hat OpenStack	Reporter:	Ade Lee <alee>
Component:	openstack-tripleo-heat-templates	Assignee:	Damien Ciabrini <dciabrin>
Status:	CLOSED ERRATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	medium	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	dciabrin, lmiccini, mburns
Target Milestone:	z2	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-0.20200728213431.6c7ccc9.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-28 15:37:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ade Lee 2020-06-09 18:08:36 UTC

Description of problem:

When doing a brownfield deployment, a system which originally only had public tls enabled is updated to have tls-everywhere deployed.

While the deployment is successful, further tests show that the redis tls-proxy fails to restart correctly on controller-2 and 3. The container fails to start with an error message indicating that it is trying to connect to a port that is already in use.

2020-06-05T19:53:24.373650132+00:00 stderr F [.] Binding service [redis] to 172.17.1.82:6379: Address already in use (98)

The most likely reason this is happening is because the redis-tls-proxy container is being restarted on the non-bootstrap node before the redis_restart_bundle on the bootstrap node.

See: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/database/redis-pacemaker-puppet.yaml#L262

The workaround for now appears to be to restart the reds_tls_proxy on the relevant controllers after the update.  That appears to be sufficient to set everything up correctly.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Damien Ciabrini 2020-06-10 09:48:52 UTC

We investigated the failure with Ade yesterday, and we understood what series of events caused the problem.

As a quick refresher:
When Redis is configured without TLS, each redis server listens on the internal api NIC, on port 6379.

When Redis is configured with TLS [1], Redis listens on localhost:6379 instead. A TLS tunnel is created on the internal api NIC port 6379 to forward traffic to Redis. Likewise, Redis nodes connect to each other via a dedicated TLS tunnel.

From a deployment standpoint, a controller node with TLS-e runs an additional container redis_tls_proxy, that manages all the necessary tunnels with stunnel. The redis container and the redis_tls_proxy containers are started/restarted during step 2. And this is a problem with a brownfield deployment:

. Initially, TLS-e is disabled, so only the redis container is running, and binds to internal-api:6379

. a stack update is performed to convert the controller nodes to TLS-e. All nodes are converted in parallel.

. redis configs are regenerated on all controllers to switch to TLS-e.

. On controller-0, when reaching step 2, a pacemaker command is triggered to restart the redis containers on all controllers. On restart, the redis containers read the new config file that makes it bind to localhost:6379.

. On controller-0, after the pacemaker command was run, the redis_tls_proxy container is created and started. The stunnel process binds to internal-api:63 and forawrd traffic to localhost:6379

. On controller-1 and controller-2 however, we don't restart redis and instead we rely on the restart command ran on controller-0.

. Now the problem happens: on controller-1 and controller-2, the redis_tls_proxy container is created and started. But when stunnel tries to bind internal-api:6379, there no guarantee that controller-0 has run the pacemaker command already.

. So when controller-1 or controller-2 runs before controller-0, redis_tls_proxy will be started _before_ the old redis container has restarted. So stunnel won't be able to bind on internal-api:6379 and will fail.

Even if the deployment finish, this sequence yields a control plane with redis potentially not reachable.

[1] Redis doesn't support TLS natively, so TripleO uses TLS tunnels via stunnel

Comment 19 errata-xmlrpc 2020-10-28 15:37:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284