Bug 1884823

Summary:	swift_proxy_tls_proxy fails to start after TLS-EW brownfield update
Product:	Red Hat OpenStack	Reporter:	Ade Lee <alee>
Component:	openstack-tripleo-heat-templates	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.1 (Train)	CC:	dciabrin, elicohen, gfidente, jagee, mburns, michele
Target Milestone:	Alpha	Keywords:	Triaged
Target Release:	16.2 (Train on RHEL 8.4)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-11.4.1-2.20210323012110.c3396e2.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-15 07:09:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ade Lee 2020-10-02 21:29:17 UTC

Description of problem:

This is very similar to the problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1845650 (which had to do with redis, and for which redis_tls_proxy failed to start on the non-bootstrap controller).

When doing a brownfield deployment, a system which originally only had public tls enabled is updated to have tls-everywhere deployed.

While the deployment is successful, further tests show that the swift_proxy_tls_proxy fails to restart correctly on controller-0. The container fails to start with an error message indicating that it is trying to connect to a port that is already in use.

2020-09-26T10:39:50.166040690+00:00 stderr F (98)Address already in use: AH00072: make_sock: could not bind to address 172.17.3.86:8080

Restarting this container manually fixes the issue and the tests pass successfully for then on.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michele Baldessari 2020-10-05 08:21:24 UTC

Damien and I took a look at this issue (note we have only a bare understanding of swift).

We think this is a race condition in the start ordering of the swift proxy containers. Here is what happens during the stack update that brings in the tls-everywhere. In fact on controller-0 this is the timeline of events:
A) The brand new swift_proxy_tls_proxy gets created (note it is not yet started) at 18:58:20:
2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create --name swift_proxy_tls_proxy

B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
   Loaded: loaded (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 days ago 
 Main PID: 724110 (code=exited, status=1/FAILURE)

C) Swift proxy gets stopped and removed around 18:59:12:
2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy

D) Swift proxy gets started a second later at 18:59:13:
2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name swift_proxy --label config_id=tripleo_step4 --label container_name=swift_proxy --label managed_by=tripleo-Controller   

So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets started, swift_proxy has not yet stopped and is holding the port busy (as it listened to it pre-tls-everywhere) and so it fails with 'Address already in use'.

We suspect this has to do with the fact that the swift_proxy_tls_proxy has no start order defined within step4, and so paunch is free to start it whenever: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/swift/swift-proxy-container-puppet.yaml#L422

Untested fix could be:
diff --git a/deployment/swift/swift-proxy-container-puppet.yaml b/deployment/swift/swift-proxy-container-puppet.yaml
index d62f5c83e668..0300be90d638 100644
--- a/deployment/swift/swift-proxy-container-puppet.yaml
+++ b/deployment/swift/swift-proxy-container-puppet.yaml
@@ -419,6 +419,7 @@ outputs:
             - if:
                 - internal_tls_enabled
                 - swift_proxy_tls_proxy:
+                    start_order: 3
                     image: *swift_proxy_image
                     net: host
                     user: root


I am moving this to DFG:Storage so they can investigate this and provide their feedback. sosreport for the broken node is here:
http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-2020-10-05-jxeyruo.tar.xz

Comment 2 Giulio Fidente 2020-10-21 12:38:47 UTC

(In reply to Michele Baldessari from comment #1)
> Damien and I took a look at this issue (note we have only a bare
> understanding of swift).
> 
> We think this is a race condition in the start ordering of the swift proxy
> containers. Here is what happens during the stack update that brings in the
> tls-everywhere. In fact on controller-0 this is the timeline of events:
> A) The brand new swift_proxy_tls_proxy gets created (note it is not yet
> started) at 18:58:20:
> 2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create
> --name swift_proxy_tls_proxy
> 
> B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
> root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
> ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
>    Loaded: loaded
> (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor
> preset: disabled)
>    Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3
> days ago 
>  Main PID: 724110 (code=exited, status=1/FAILURE)
> 
> C) Swift proxy gets stopped and removed around 18:59:12:
> 2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy
> 
> D) Swift proxy gets started a second later at 18:59:13:
> 2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name
> swift_proxy --label config_id=tripleo_step4 --label
> container_name=swift_proxy --label managed_by=tripleo-Controller   
> 
> So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets
> started, swift_proxy has not yet stopped and is holding the port busy (as it
> listened to it pre-tls-everywhere) and so it fails with 'Address already in
> use'.
> 
> We suspect this has to do with the fact that the swift_proxy_tls_proxy has
> no start order defined within step4, and so paunch is free to start it
> whenever:
> https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> swift/swift-proxy-container-puppet.yaml#L422
> 
> Untested fix could be:
> diff --git a/deployment/swift/swift-proxy-container-puppet.yaml
> b/deployment/swift/swift-proxy-container-puppet.yaml
> index d62f5c83e668..0300be90d638 100644
> --- a/deployment/swift/swift-proxy-container-puppet.yaml
> +++ b/deployment/swift/swift-proxy-container-puppet.yaml
> @@ -419,6 +419,7 @@ outputs:
>              - if:
>                  - internal_tls_enabled
>                  - swift_proxy_tls_proxy:
> +                    start_order: 3
>                      image: *swift_proxy_image
>                      net: host
>                      user: root
> 
> 
> I am moving this to DFG:Storage so they can investigate this and provide
> their feedback. sosreport for the broken node is here:
> http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-
> 2020-10-05-jxeyruo.tar.xz

Michele, thanks a lot for helping with this bug! I noticed you're suggesting 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't use 2 for the tls_proxy container?

Comment 5 Michele Baldessari 2021-03-11 11:25:06 UTC

(In reply to Giulio Fidente from comment #2)
> (In reply to Michele Baldessari from comment #1)
> > Damien and I took a look at this issue (note we have only a bare
> > understanding of swift).
> > 
> > We think this is a race condition in the start ordering of the swift proxy
> > containers. Here is what happens during the stack update that brings in the
> > tls-everywhere. In fact on controller-0 this is the timeline of events:
> > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet
> > started) at 18:58:20:
> > 2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create
> > --name swift_proxy_tls_proxy
> > 
> > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
> > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
> > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
> >    Loaded: loaded
> > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor
> > preset: disabled)
> >    Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3
> > days ago 
> >  Main PID: 724110 (code=exited, status=1/FAILURE)
> > 
> > C) Swift proxy gets stopped and removed around 18:59:12:
> > 2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy
> > 
> > D) Swift proxy gets started a second later at 18:59:13:
> > 2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name
> > swift_proxy --label config_id=tripleo_step4 --label
> > container_name=swift_proxy --label managed_by=tripleo-Controller   
> > 
> > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets
> > started, swift_proxy has not yet stopped and is holding the port busy (as it
> > listened to it pre-tls-everywhere) and so it fails with 'Address already in
> > use'.
> > 
> > We suspect this has to do with the fact that the swift_proxy_tls_proxy has
> > no start order defined within step4, and so paunch is free to start it
> > whenever:
> > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> > swift/swift-proxy-container-puppet.yaml#L422
> > 
> > Untested fix could be:
> > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml
> > b/deployment/swift/swift-proxy-container-puppet.yaml
> > index d62f5c83e668..0300be90d638 100644
> > --- a/deployment/swift/swift-proxy-container-puppet.yaml
> > +++ b/deployment/swift/swift-proxy-container-puppet.yaml
> > @@ -419,6 +419,7 @@ outputs:
> >              - if:
> >                  - internal_tls_enabled
> >                  - swift_proxy_tls_proxy:
> > +                    start_order: 3
> >                      image: *swift_proxy_image
> >                      net: host
> >                      user: root
> > 
> > 
> > I am moving this to DFG:Storage so they can investigate this and provide
> > their feedback. sosreport for the broken node is here:
> > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-
> > 2020-10-05-jxeyruo.tar.xz
> 
> Michele, thanks a lot for helping with this bug! I noticed you're suggesting
> 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't
> use 2 for the tls_proxy container?

Sorry I had missed this question entirely. I think you should use 3 that way you
have the explicit guarantee of starting after the swift_proxy. Which is what this case is about
If you put start_order: 2 I am just not sure if paunch will use the ordering in the file or
if it is free to start them in any order.

Comment 8 Eliad Cohen 2021-08-03 20:13:48 UTC

Verified via our automation for 16.2 brownfield update

Comment 10 errata-xmlrpc 2021-09-15 07:09:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483