Bug 1738927

Summary:	Enabling TLS results in timeout during deployment and haproxy containers are in stopped state
Product:	Red Hat OpenStack	Reporter:	vivek koul <vkoul>
Component:	puppet-haproxy	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	nlevinki <nlevinki>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	bperkins, dciabrin, jjoyce, jschluet, lmiccini, michele, pkomarov, sisadoun, slinaber, tvignaud
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1809958 (view as bug list)		Environment:
Last Closed:	2020-10-15 13:02:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1809958

Description vivek koul 2019-08-08 11:54:37 UTC

Description of problem:

After enabling SSL on existing deployment, it got hung at step 3 and gets failed with below error:

~~~
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz.ControllerDeployment_Step3]: UPDATE_FAILED  resources.ControllerDeployment_Step3: Stack UPDATE cancelled
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz]: UPDATE_FAILED  Resource UPDATE failed: resources.ControllerDeployment_Step3: Stack UPDATE cancelled

 Stack overcloud UPDATE_FAILED

overcloud.AllNodesDeploySteps.ControllerDeployment_Step3:
  resource_type: OS::TripleO::DeploymentSteps
  physical_resource_id: 634d0ccc-b73c-42e3-8278-c8ad3ab54399
  status: UPDATE_FAILED
  status_reason: |
    resources.ControllerDeployment_Step3: Stack UPDATE cancelled
Heat Stack update failed.
Heat Stack update failed.
(undercloud) [stack@directorvm ~]$
~~~

We came across a similar bugzilla[1] and the given workaround[2] did not work and it again failed with the same error.

We tried restarting haproxy-bundle manually but it did not work.

We made some manual changes in haproxy.conf and restarted haproxy, it still failed.

we thought of moving back to the original state(without SSl) it was also hung at step 3

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1699101
[2] 
~~~
parameter_defaults:
  ControllerExtraConfig:
    pacemaker::resource::bundle::deep_compare: true
    pacemaker::resource::ip::deep_compare: true
    pacemaker::resource::ocf::deep_compare: true
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Case: 02443042
Account: 1233645
Severity: Sev1
Cu is available  from 9:30 to 21:30 IST(Indian standard time)

Comment 6 Luca Miccini 2019-08-09 11:10:46 UTC

We tracked this down to the haproxy-bundle resource not having a bind_mount to the dir containing the certificate, so the haproxy container failed while accessing /etc/pki/tls/private/overcloud_endpoint.pem simply because it wasn't there.

Theoretically deep_compare should've done the right thing (adding such dir to the resource definition) we will try to reproduce to rule out any issue with it.

Comment 7 Damien Ciabrini 2019-08-09 14:39:46 UTC

We further investigated with Luca and by redeploying a stock osp13, and then add the TLS external endpoints with a stack redeploy.
We think we reproduced the error, although in our case ultimately the environment healed itself and the deploy finished.

What we think is happening is the following:
. the haproxy pacemaker service file is designed to restart haproxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes.
. when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container.
. so this triggers two restart of the haproxy containers.
. however the first restart is triggered by the config change
. when the haproxy container is restarted, haproxy reads the new config, tries to certificates and keys from the pem file, but this pem file is not bind-monted yet (the bundle configuration hasn't been updated yet).
. so all haproxy container fail to start

In our case, when the haproxy bundle config has been updated, pacemaker cleaned all the errors, restarted the haproxy containers and this time haproxy could access the new pem file, so everything recovered automatically.

In our current design, we schedule the "restart on config change" before the "restart on bundle config change" on purpose, and but unfortunately this design breaks use cases such as "upgrade haproxy to use TLS".

We need to discuss internally how best to fix the double restart ordering issue.