Bug 1884823
| Summary: | swift_proxy_tls_proxy fails to start after TLS-EW brownfield update | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ade Lee <alee> |
| Component: | openstack-tripleo-heat-templates | Assignee: | Giulio Fidente <gfidente> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 16.1 (Train) | CC: | dciabrin, elicohen, gfidente, jagee, mburns, michele |
| Target Milestone: | Alpha | Keywords: | Triaged |
| Target Release: | 16.2 (Train on RHEL 8.4) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-heat-templates-11.4.1-2.20210323012110.c3396e2.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-15 07:09:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Ade Lee
2020-10-02 21:29:17 UTC
Damien and I took a look at this issue (note we have only a bare understanding of swift). We think this is a race condition in the start ordering of the swift proxy containers. Here is what happens during the stack update that brings in the tls-everywhere. In fact on controller-0 this is the timeline of events: A) The brand new swift_proxy_tls_proxy gets created (note it is not yet started) at 18:58:20: 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create --name swift_proxy_tls_proxy B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container Loaded: loaded (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 days ago Main PID: 724110 (code=exited, status=1/FAILURE) C) Swift proxy gets stopped and removed around 18:59:12: 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy D) Swift proxy gets started a second later at 18:59:13: 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name swift_proxy --label config_id=tripleo_step4 --label container_name=swift_proxy --label managed_by=tripleo-Controller So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets started, swift_proxy has not yet stopped and is holding the port busy (as it listened to it pre-tls-everywhere) and so it fails with 'Address already in use'. We suspect this has to do with the fact that the swift_proxy_tls_proxy has no start order defined within step4, and so paunch is free to start it whenever: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/swift/swift-proxy-container-puppet.yaml#L422 Untested fix could be: diff --git a/deployment/swift/swift-proxy-container-puppet.yaml b/deployment/swift/swift-proxy-container-puppet.yaml index d62f5c83e668..0300be90d638 100644 --- a/deployment/swift/swift-proxy-container-puppet.yaml +++ b/deployment/swift/swift-proxy-container-puppet.yaml @@ -419,6 +419,7 @@ outputs: - if: - internal_tls_enabled - swift_proxy_tls_proxy: + start_order: 3 image: *swift_proxy_image net: host user: root I am moving this to DFG:Storage so they can investigate this and provide their feedback. sosreport for the broken node is here: http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-2020-10-05-jxeyruo.tar.xz (In reply to Michele Baldessari from comment #1) > Damien and I took a look at this issue (note we have only a bare > understanding of swift). > > We think this is a race condition in the start ordering of the swift proxy > containers. Here is what happens during the stack update that brings in the > tls-everywhere. In fact on controller-0 this is the timeline of events: > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet > started) at 18:58:20: > 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create > --name swift_proxy_tls_proxy > > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container > Loaded: loaded > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor > preset: disabled) > Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 > days ago > Main PID: 724110 (code=exited, status=1/FAILURE) > > C) Swift proxy gets stopped and removed around 18:59:12: > 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy > > D) Swift proxy gets started a second later at 18:59:13: > 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name > swift_proxy --label config_id=tripleo_step4 --label > container_name=swift_proxy --label managed_by=tripleo-Controller > > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets > started, swift_proxy has not yet stopped and is holding the port busy (as it > listened to it pre-tls-everywhere) and so it fails with 'Address already in > use'. > > We suspect this has to do with the fact that the swift_proxy_tls_proxy has > no start order defined within step4, and so paunch is free to start it > whenever: > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ > swift/swift-proxy-container-puppet.yaml#L422 > > Untested fix could be: > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml > b/deployment/swift/swift-proxy-container-puppet.yaml > index d62f5c83e668..0300be90d638 100644 > --- a/deployment/swift/swift-proxy-container-puppet.yaml > +++ b/deployment/swift/swift-proxy-container-puppet.yaml > @@ -419,6 +419,7 @@ outputs: > - if: > - internal_tls_enabled > - swift_proxy_tls_proxy: > + start_order: 3 > image: *swift_proxy_image > net: host > user: root > > > I am moving this to DFG:Storage so they can investigate this and provide > their feedback. sosreport for the broken node is here: > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0- > 2020-10-05-jxeyruo.tar.xz Michele, thanks a lot for helping with this bug! I noticed you're suggesting 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't use 2 for the tls_proxy container? (In reply to Giulio Fidente from comment #2) > (In reply to Michele Baldessari from comment #1) > > Damien and I took a look at this issue (note we have only a bare > > understanding of swift). > > > > We think this is a race condition in the start ordering of the swift proxy > > containers. Here is what happens during the stack update that brings in the > > tls-everywhere. In fact on controller-0 this is the timeline of events: > > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet > > started) at 18:58:20: > > 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create > > --name swift_proxy_tls_proxy > > > > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd > > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy > > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container > > Loaded: loaded > > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor > > preset: disabled) > > Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 > > days ago > > Main PID: 724110 (code=exited, status=1/FAILURE) > > > > C) Swift proxy gets stopped and removed around 18:59:12: > > 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy > > > > D) Swift proxy gets started a second later at 18:59:13: > > 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name > > swift_proxy --label config_id=tripleo_step4 --label > > container_name=swift_proxy --label managed_by=tripleo-Controller > > > > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets > > started, swift_proxy has not yet stopped and is holding the port busy (as it > > listened to it pre-tls-everywhere) and so it fails with 'Address already in > > use'. > > > > We suspect this has to do with the fact that the swift_proxy_tls_proxy has > > no start order defined within step4, and so paunch is free to start it > > whenever: > > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ > > swift/swift-proxy-container-puppet.yaml#L422 > > > > Untested fix could be: > > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml > > b/deployment/swift/swift-proxy-container-puppet.yaml > > index d62f5c83e668..0300be90d638 100644 > > --- a/deployment/swift/swift-proxy-container-puppet.yaml > > +++ b/deployment/swift/swift-proxy-container-puppet.yaml > > @@ -419,6 +419,7 @@ outputs: > > - if: > > - internal_tls_enabled > > - swift_proxy_tls_proxy: > > + start_order: 3 > > image: *swift_proxy_image > > net: host > > user: root > > > > > > I am moving this to DFG:Storage so they can investigate this and provide > > their feedback. sosreport for the broken node is here: > > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0- > > 2020-10-05-jxeyruo.tar.xz > > Michele, thanks a lot for helping with this bug! I noticed you're suggesting > 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't > use 2 for the tls_proxy container? Sorry I had missed this question entirely. I think you should use 3 that way you have the explicit guarantee of starting after the swift_proxy. Which is what this case is about If you put start_order: 2 I am just not sure if paunch will use the ordering in the file or if it is free to start them in any order. Verified via our automation for 16.2 brownfield update Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483 |