Description of problem: This is very similar to the problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1845650 (which had to do with redis, and for which redis_tls_proxy failed to start on the non-bootstrap controller). When doing a brownfield deployment, a system which originally only had public tls enabled is updated to have tls-everywhere deployed. While the deployment is successful, further tests show that the swift_proxy_tls_proxy fails to restart correctly on controller-0. The container fails to start with an error message indicating that it is trying to connect to a port that is already in use. 2020-09-26T10:39:50.166040690+00:00 stderr F (98)Address already in use: AH00072: make_sock: could not bind to address 172.17.3.86:8080 Restarting this container manually fixes the issue and the tests pass successfully for then on. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Damien and I took a look at this issue (note we have only a bare understanding of swift). We think this is a race condition in the start ordering of the swift proxy containers. Here is what happens during the stack update that brings in the tls-everywhere. In fact on controller-0 this is the timeline of events: A) The brand new swift_proxy_tls_proxy gets created (note it is not yet started) at 18:58:20: 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create --name swift_proxy_tls_proxy B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container Loaded: loaded (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 days ago Main PID: 724110 (code=exited, status=1/FAILURE) C) Swift proxy gets stopped and removed around 18:59:12: 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy D) Swift proxy gets started a second later at 18:59:13: 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name swift_proxy --label config_id=tripleo_step4 --label container_name=swift_proxy --label managed_by=tripleo-Controller So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets started, swift_proxy has not yet stopped and is holding the port busy (as it listened to it pre-tls-everywhere) and so it fails with 'Address already in use'. We suspect this has to do with the fact that the swift_proxy_tls_proxy has no start order defined within step4, and so paunch is free to start it whenever: https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/swift/swift-proxy-container-puppet.yaml#L422 Untested fix could be: diff --git a/deployment/swift/swift-proxy-container-puppet.yaml b/deployment/swift/swift-proxy-container-puppet.yaml index d62f5c83e668..0300be90d638 100644 --- a/deployment/swift/swift-proxy-container-puppet.yaml +++ b/deployment/swift/swift-proxy-container-puppet.yaml @@ -419,6 +419,7 @@ outputs: - if: - internal_tls_enabled - swift_proxy_tls_proxy: + start_order: 3 image: *swift_proxy_image net: host user: root I am moving this to DFG:Storage so they can investigate this and provide their feedback. sosreport for the broken node is here: http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0-2020-10-05-jxeyruo.tar.xz
(In reply to Michele Baldessari from comment #1) > Damien and I took a look at this issue (note we have only a bare > understanding of swift). > > We think this is a race condition in the start ordering of the swift proxy > containers. Here is what happens during the stack update that brings in the > tls-everywhere. In fact on controller-0 this is the timeline of events: > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet > started) at 18:58:20: > 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create > --name swift_proxy_tls_proxy > > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container > Loaded: loaded > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor > preset: disabled) > Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 > days ago > Main PID: 724110 (code=exited, status=1/FAILURE) > > C) Swift proxy gets stopped and removed around 18:59:12: > 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy > > D) Swift proxy gets started a second later at 18:59:13: > 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name > swift_proxy --label config_id=tripleo_step4 --label > container_name=swift_proxy --label managed_by=tripleo-Controller > > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets > started, swift_proxy has not yet stopped and is holding the port busy (as it > listened to it pre-tls-everywhere) and so it fails with 'Address already in > use'. > > We suspect this has to do with the fact that the swift_proxy_tls_proxy has > no start order defined within step4, and so paunch is free to start it > whenever: > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ > swift/swift-proxy-container-puppet.yaml#L422 > > Untested fix could be: > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml > b/deployment/swift/swift-proxy-container-puppet.yaml > index d62f5c83e668..0300be90d638 100644 > --- a/deployment/swift/swift-proxy-container-puppet.yaml > +++ b/deployment/swift/swift-proxy-container-puppet.yaml > @@ -419,6 +419,7 @@ outputs: > - if: > - internal_tls_enabled > - swift_proxy_tls_proxy: > + start_order: 3 > image: *swift_proxy_image > net: host > user: root > > > I am moving this to DFG:Storage so they can investigate this and provide > their feedback. sosreport for the broken node is here: > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0- > 2020-10-05-jxeyruo.tar.xz Michele, thanks a lot for helping with this bug! I noticed you're suggesting 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't use 2 for the tls_proxy container?
(In reply to Giulio Fidente from comment #2) > (In reply to Michele Baldessari from comment #1) > > Damien and I took a look at this issue (note we have only a bare > > understanding of swift). > > > > We think this is a race condition in the start ordering of the swift proxy > > containers. Here is what happens during the stack update that brings in the > > tls-everywhere. In fact on controller-0 this is the timeline of events: > > A) The brand new swift_proxy_tls_proxy gets created (note it is not yet > > started) at 18:58:20: > > 2020-10-01 18:58:20.872 678431 DEBUG paunch [ ] Completed $ podman create > > --name swift_proxy_tls_proxy > > > > B) At 18:59:09 swift_proxy_tls_proxy gets started via systemd > > root@controller-0 containers]# systemctl status tripleo_swift_proxy_tls_proxy > > ● tripleo_swift_proxy_tls_proxy.service - swift_proxy_tls_proxy container > > Loaded: loaded > > (/etc/systemd/system/tripleo_swift_proxy_tls_proxy.service; enabled; vendor > > preset: disabled) > > Active: failed (Result: exit-code) since Thu 2020-10-01 18:59:09 UTC; 3 > > days ago > > Main PID: 724110 (code=exited, status=1/FAILURE) > > > > C) Swift proxy gets stopped and removed around 18:59:12: > > 2020-10-01 18:59:12.416 678431 DEBUG paunch [ ] $ podman rm swift_proxy > > > > D) Swift proxy gets started a second later at 18:59:13: > > 2020-10-01 18:59:13.072 678431 DEBUG paunch [ ] $ podman create --name > > swift_proxy --label config_id=tripleo_step4 --label > > container_name=swift_proxy --label managed_by=tripleo-Controller > > > > So ultimately the problem is that at (B) when swift_proxy_tls_proxy gets > > started, swift_proxy has not yet stopped and is holding the port busy (as it > > listened to it pre-tls-everywhere) and so it fails with 'Address already in > > use'. > > > > We suspect this has to do with the fact that the swift_proxy_tls_proxy has > > no start order defined within step4, and so paunch is free to start it > > whenever: > > https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/ > > swift/swift-proxy-container-puppet.yaml#L422 > > > > Untested fix could be: > > diff --git a/deployment/swift/swift-proxy-container-puppet.yaml > > b/deployment/swift/swift-proxy-container-puppet.yaml > > index d62f5c83e668..0300be90d638 100644 > > --- a/deployment/swift/swift-proxy-container-puppet.yaml > > +++ b/deployment/swift/swift-proxy-container-puppet.yaml > > @@ -419,6 +419,7 @@ outputs: > > - if: > > - internal_tls_enabled > > - swift_proxy_tls_proxy: > > + start_order: 3 > > image: *swift_proxy_image > > net: host > > user: root > > > > > > I am moving this to DFG:Storage so they can investigate this and provide > > their feedback. sosreport for the broken node is here: > > http://file.rdu.redhat.com/~mbaldess/swift_brownfield/sosreport-controller-0- > > 2020-10-05-jxeyruo.tar.xz > > Michele, thanks a lot for helping with this bug! I noticed you're suggesting > 3, swift_proxy itself uses 2 though, can you see any reason why we shouldn't > use 2 for the tls_proxy container? Sorry I had missed this question entirely. I think you should use 3 that way you have the explicit guarantee of starting after the swift_proxy. Which is what this case is about If you put start_order: 2 I am just not sure if paunch will use the ordering in the file or if it is free to start them in any order.
Verified via our automation for 16.2 brownfield update
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483