Description of problem:
After enabling SSL on existing deployment, it got hung at step 3 and gets failed with below error:
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz.ControllerDeployment_Step3]: UPDATE_FAILED resources.ControllerDeployment_Step3: Stack UPDATE cancelled
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz]: UPDATE_FAILED Resource UPDATE failed: resources.ControllerDeployment_Step3: Stack UPDATE cancelled
Stack overcloud UPDATE_FAILED
resources.ControllerDeployment_Step3: Stack UPDATE cancelled
Heat Stack update failed.
Heat Stack update failed.
(undercloud) [stack@directorvm ~]$
We came across a similar bugzilla and the given workaround did not work and it again failed with the same error.
We tried restarting haproxy-bundle manually but it did not work.
We made some manual changes in haproxy.conf and restarted haproxy, it still failed.
we thought of moving back to the original state(without SSl) it was also hung at step 3
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Cu is available from 9:30 to 21:30 IST(Indian standard time)
We tracked this down to the haproxy-bundle resource not having a bind_mount to the dir containing the certificate, so the haproxy container failed while accessing /etc/pki/tls/private/overcloud_endpoint.pem simply because it wasn't there.
Theoretically deep_compare should've done the right thing (adding such dir to the resource definition) we will try to reproduce to rule out any issue with it.
We further investigated with Luca and by redeploying a stock osp13, and then add the TLS external endpoints with a stack redeploy.
We think we reproduced the error, although in our case ultimately the environment healed itself and the deploy finished.
What we think is happening is the following:
. the haproxy pacemaker service file is designed to restart haproxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes.
. when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container.
. so this triggers two restart of the haproxy containers.
. however the first restart is triggered by the config change
. when the haproxy container is restarted, haproxy reads the new config, tries to certificates and keys from the pem file, but this pem file is not bind-monted yet (the bundle configuration hasn't been updated yet).
. so all haproxy container fail to start
In our case, when the haproxy bundle config has been updated, pacemaker cleaned all the errors, restarted the haproxy containers and this time haproxy could access the new pem file, so everything recovered automatically.
In our current design, we schedule the "restart on config change" before the "restart on bundle config change" on purpose, and but unfortunately this design breaks use cases such as "upgrade haproxy to use TLS".
We need to discuss internally how best to fix the double restart ordering issue.