1884823 – swift_proxy_tls_proxy fails to start after TLS-EW brownfield update

Bug 1884823 - swift_proxy_tls_proxy fails to start after TLS-EW brownfield update

Summary: swift_proxy_tls_proxy fails to start after TLS-EW brownfield update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	Alpha
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Giulio Fidente
QA Contact:	David Rosenfeld
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-02 21:29 UTC by Ade Lee
Modified:	2022-07-08 19:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.4.1-2.20210323012110.c3396e2.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-15 07:09:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1918642	None	None	None	2021-03-11 10:51:05 UTC
OpenStack gerrit	779976	None	NEW	Fix start order for swift_proxy and swift_proxy_tls_proxy	2021-03-11 10:56:18 UTC
Red Hat Issue Tracker	OSP-1906	None	None	None	2022-07-08 19:40:30 UTC
Red Hat Product Errata	RHEA-2021:3483	None	None	None	2021-09-15 07:09:47 UTC

Description Ade Lee 2020-10-02 21:29:17 UTC

Description of problem:

This is very similar to the problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1845650 (which had to do with redis, and for which redis_tls_proxy failed to start on the non-bootstrap controller).

When doing a brownfield deployment, a system which originally only had public tls enabled is updated to have tls-everywhere deployed.

While the deployment is successful, further tests show that the swift_proxy_tls_proxy fails to restart correctly on controller-0. The container fails to start with an error message indicating that it is trying to connect to a port that is already in use.

2020-09-26T10:39:50.166040690+00:00 stderr F (98)Address already in use: AH00072: make_sock: could not bind to address 172.17.3.86:8080

Restarting this container manually fixes the issue and the tests pass successfully for then on.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michele Baldessari 2020-10-05 08:21:24 UTC

Damien and I took a look at this issue (note we have only a bare understanding of swift).

We think this is a race condition in the start ordering of the swift proxy containers. Here is what happens during the stack update that brings in the tls-everywhere. In fact on controller-0 this is the timeline of events:
A) The brand new swift_proxy_tls_proxy gets created (note it is not yet started) at 18:58:20:
2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create --name swift_proxy_tls_proxy

B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
   Loaded: loaded (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 days ago 
 Main PID: 724110 (code=exited, status=1/FAILURE)

C) Swift proxy gets stopped and removed around 18:59:12:
2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy

D) Swift proxy gets started a second later at 18:59:13:
2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name swift_proxy --label config_id=tripleo_step4 --label container_name=swift_proxy --label managed_by=tripleo-Controller   

So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets started, swift_proxy has not yet stopped and is holding the port busy (as it listened to it pre-tls-everywhere) and so it fails with 'Address already in use'.

We suspect this has to do with the fact that the swift_proxy_tls_proxy has no start order defined within step4, and so paunch is free to start it whenever: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/swift/swift-proxy-container-puppet.yaml#L422

Untested fix could be:
diff --git a/deployment/swift/swift-proxy-container-puppet.yaml b/deployment/swift/swift-proxy-container-puppet.yaml
index d62f5c83e668..0300be90d638 100644
--- a/deployment/swift/swift-proxy-container-puppet.yaml
+++ b/deployment/swift/swift-proxy-container-puppet.yaml
@@ -419,6 +419,7 @@ outputs:
             - if:
                 - internal_tls_enabled
                 - swift_proxy_tls_proxy:
+                    start_order: 3
                     image: *swift_proxy_image
                     net: host
                     user: root


I am moving this to DFG:Storage so they can investigate this and provide their feedback. sosreport for the broken node is here:
http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-2020-10-05-jxeyruo.tar.xz

Comment 2 Giulio Fidente 2020-10-21 12:38:47 UTC

(In reply to Michele Baldessari from comment #1)
> Damien and I took a look at this issue (note we have only a bare
> understanding of swift).
> 
> We think this is a race condition in the start ordering of the swift proxy
> containers. Here is what happens during the stack update that brings in the
> tls-everywhere. In fact on controller-0 this is the timeline of events:
> A) The brand new swift_proxy_tls_proxy gets created (note it is not yet
> started) at 18:58:20:
> 2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create
> --name swift_proxy_tls_proxy
> 
> B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
> root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
> ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
>    Loaded: loaded
> (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor
> preset: disabled)
>    Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3
> days ago 
>  Main PID: 724110 (code=exited, status=1/FAILURE)
> 
> C) Swift proxy gets stopped and removed around 18:59:12:
> 2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy
> 
> D) Swift proxy gets started a second later at 18:59:13:
> 2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name
> swift_proxy --label config_id=tripleo_step4 --label
> container_name=swift_proxy --label managed_by=tripleo-Controller   
> 
> So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets
> started, swift_proxy has not yet stopped and is holding the port busy (as it
> listened to it pre-tls-everywhere) and so it fails with 'Address already in
> use'.
> 
> We suspect this has to do with the fact that the swift_proxy_tls_proxy has
> no start order defined within step4, and so paunch is free to start it
> whenever:
> https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> swift/swift-proxy-container-puppet.yaml#L422
> 
> Untested fix could be:
> diff --git a/deployment/swift/swift-proxy-container-puppet.yaml
> b/deployment/swift/swift-proxy-container-puppet.yaml
> index d62f5c83e668..0300be90d638 100644
> --- a/deployment/swift/swift-proxy-container-puppet.yaml
> +++ b/deployment/swift/swift-proxy-container-puppet.yaml
> @@ -419,6 +419,7 @@ outputs:
>              - if:
>                  - internal_tls_enabled
>                  - swift_proxy_tls_proxy:
> +                    start_order: 3
>                      image: *swift_proxy_image
>                      net: host
>                      user: root
> 
> 
> I am moving this to DFG:Storage so they can investigate this and provide
> their feedback. sosreport for the broken node is here:
> http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-
> 2020-10-05-jxeyruo.tar.xz

Michele, thanks a lot for helping with this bug! I noticed you're suggesting 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't use 2 for the tls_proxy container?

Comment 5 Michele Baldessari 2021-03-11 11:25:06 UTC

(In reply to Giulio Fidente from comment #2)
> (In reply to Michele Baldessari from comment #1)
> > Damien and I took a look at this issue (note we have only a bare
> > understanding of swift).
> > 
> > We think this is a race condition in the start ordering of the swift proxy
> > containers. Here is what happens during the stack update that brings in the
> > tls-everywhere. In fact on controller-0 this is the timeline of events:
> > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet
> > started) at 18:58:20:
> > 2020-10-01 18:58:20.872 678431 DEBUG paunch [  ] Completed $ podman create
> > --name swift_proxy_tls_proxy
> > 
> > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd
> > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy
> > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container
> >    Loaded: loaded
> > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor
> > preset: disabled)
> >    Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3
> > days ago 
> >  Main PID: 724110 (code=exited, status=1/FAILURE)
> > 
> > C) Swift proxy gets stopped and removed around 18:59:12:
> > 2020-10-01 18:59:12.416 678431 DEBUG paunch [  ] $ podman rm swift_proxy
> > 
> > D) Swift proxy gets started a second later at 18:59:13:
> > 2020-10-01 18:59:13.072 678431 DEBUG paunch [  ] $ podman create --name
> > swift_proxy --label config_id=tripleo_step4 --label
> > container_name=swift_proxy --label managed_by=tripleo-Controller   
> > 
> > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets
> > started, swift_proxy has not yet stopped and is holding the port busy (as it
> > listened to it pre-tls-everywhere) and so it fails with 'Address already in
> > use'.
> > 
> > We suspect this has to do with the fact that the swift_proxy_tls_proxy has
> > no start order defined within step4, and so paunch is free to start it
> > whenever:
> > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/
> > swift/swift-proxy-container-puppet.yaml#L422
> > 
> > Untested fix could be:
> > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml
> > b/deployment/swift/swift-proxy-container-puppet.yaml
> > index d62f5c83e668..0300be90d638 100644
> > --- a/deployment/swift/swift-proxy-container-puppet.yaml
> > +++ b/deployment/swift/swift-proxy-container-puppet.yaml
> > @@ -419,6 +419,7 @@ outputs:
> >              - if:
> >                  - internal_tls_enabled
> >                  - swift_proxy_tls_proxy:
> > +                    start_order: 3
> >                      image: *swift_proxy_image
> >                      net: host
> >                      user: root
> > 
> > 
> > I am moving this to DFG:Storage so they can investigate this and provide
> > their feedback. sosreport for the broken node is here:
> > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-
> > 2020-10-05-jxeyruo.tar.xz
> 
> Michele, thanks a lot for helping with this bug! I noticed you're suggesting
> 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't
> use 2 for the tls_proxy container?

Sorry I had missed this question entirely. I think you should use 3 that way you
have the explicit guarantee of starting after the swift_proxy. Which is what this case is about
If you put start_order: 2 I am just not sure if paunch will use the ordering in the file or
if it is free to start them in any order.

Comment 8 Eliad Cohen 2021-08-03 20:13:48 UTC

Verified via our automation for 16.2 brownfield update

Comment 10 errata-xmlrpc 2021-09-15 07:09:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.