Description of problem: After enabling SSL on existing deployment, it got hung at step 3 and gets failed with below error: ~~~ 2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz.ControllerDeployment_Step3]: UPDATE_FAILED resources.ControllerDeployment_Step3: Stack UPDATE cancelled 2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz]: UPDATE_FAILED Resource UPDATE failed: resources.ControllerDeployment_Step3: Stack UPDATE cancelled Stack overcloud UPDATE_FAILED overcloud.AllNodesDeploySteps.ControllerDeployment_Step3: resource_type: OS::TripleO::DeploymentSteps physical_resource_id: 634d0ccc-b73c-42e3-8278-c8ad3ab54399 status: UPDATE_FAILED status_reason: | resources.ControllerDeployment_Step3: Stack UPDATE cancelled Heat Stack update failed. Heat Stack update failed. (undercloud) [stack@directorvm ~]$ ~~~ We came across a similar bugzilla[1] and the given workaround[2] did not work and it again failed with the same error. We tried restarting haproxy-bundle manually but it did not work. We made some manual changes in haproxy.conf and restarted haproxy, it still failed. we thought of moving back to the original state(without SSl) it was also hung at step 3 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1699101 [2] ~~~ parameter_defaults: ControllerExtraConfig: pacemaker::resource::bundle::deep_compare: true pacemaker::resource::ip::deep_compare: true pacemaker::resource::ocf::deep_compare: true ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Case: 02443042 Account: 1233645 Severity: Sev1 Cu is available from 9:30 to 21:30 IST(Indian standard time)
We tracked this down to the haproxy-bundle resource not having a bind_mount to the dir containing the certificate, so the haproxy container failed while accessing /etc/pki/tls/private/overcloud_endpoint.pem simply because it wasn't there. Theoretically deep_compare should've done the right thing (adding such dir to the resource definition) we will try to reproduce to rule out any issue with it.
We further investigated with Luca and by redeploying a stock osp13, and then add the TLS external endpoints with a stack redeploy. We think we reproduced the error, although in our case ultimately the environment healed itself and the deploy finished. What we think is happening is the following: . the haproxy pacemaker service file is designed to restart haproxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes. . when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container. . so this triggers two restart of the haproxy containers. . however the first restart is triggered by the config change . when the haproxy container is restarted, haproxy reads the new config, tries to certificates and keys from the pem file, but this pem file is not bind-monted yet (the bundle configuration hasn't been updated yet). . so all haproxy container fail to start In our case, when the haproxy bundle config has been updated, pacemaker cleaned all the errors, restarted the haproxy containers and this time haproxy could access the new pem file, so everything recovered automatically. In our current design, we schedule the "restart on config change" before the "restart on bundle config change" on purpose, and but unfortunately this design breaks use cases such as "upgrade haproxy to use TLS". We need to discuss internally how best to fix the double restart ordering issue.