Bug 1738927 - Enabling TLS results in timeout during deployment and haproxy containers are in stopped state
Summary: Enabling TLS results in timeout during deployment and haproxy containers are ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-haproxy
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks: 1809958
TreeView+ depends on / blocked
 
Reported: 2019-08-08 11:54 UTC by vivek koul
Modified: 2021-01-06 09:56 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1809958 (view as bug list)
Environment:
Last Closed: 2020-10-15 13:02:11 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 675993 0 'None' MERGED HA: reorder init_bundle and restart_bundle for improved updates 2021-01-06 09:51:24 UTC

Description vivek koul 2019-08-08 11:54:37 UTC
Description of problem:

After enabling SSL on existing deployment, it got hung at step 3 and gets failed with below error:

~~~
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz.ControllerDeployment_Step3]: UPDATE_FAILED  resources.ControllerDeployment_Step3: Stack UPDATE cancelled
2019-08-07 08:11:53Z [overcloud-AllNodesDeploySteps-e5knnpw4iluz]: UPDATE_FAILED  Resource UPDATE failed: resources.ControllerDeployment_Step3: Stack UPDATE cancelled

 Stack overcloud UPDATE_FAILED

overcloud.AllNodesDeploySteps.ControllerDeployment_Step3:
  resource_type: OS::TripleO::DeploymentSteps
  physical_resource_id: 634d0ccc-b73c-42e3-8278-c8ad3ab54399
  status: UPDATE_FAILED
  status_reason: |
    resources.ControllerDeployment_Step3: Stack UPDATE cancelled
Heat Stack update failed.
Heat Stack update failed.
(undercloud) [stack@directorvm ~]$
~~~

We came across a similar bugzilla[1] and the given workaround[2] did not work and it again failed with the same error.

We tried restarting haproxy-bundle manually but it did not work.

We made some manual changes in haproxy.conf and restarted haproxy, it still failed.

we thought of moving back to the original state(without SSl) it was also hung at step 3

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1699101
[2] 
~~~
parameter_defaults:
  ControllerExtraConfig:
    pacemaker::resource::bundle::deep_compare: true
    pacemaker::resource::ip::deep_compare: true
    pacemaker::resource::ocf::deep_compare: true
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Case: 02443042
Account: 1233645
Severity: Sev1
Cu is available  from 9:30 to 21:30 IST(Indian standard time)

Comment 6 Luca Miccini 2019-08-09 11:10:46 UTC
We tracked this down to the haproxy-bundle resource not having a bind_mount to the dir containing the certificate, so the haproxy container failed while accessing /etc/pki/tls/private/overcloud_endpoint.pem simply because it wasn't there.

Theoretically deep_compare should've done the right thing (adding such dir to the resource definition) we will try to reproduce to rule out any issue with it.

Comment 7 Damien Ciabrini 2019-08-09 14:39:46 UTC
We further investigated with Luca and by redeploying a stock osp13, and then add the TLS external endpoints with a stack redeploy.
We think we reproduced the error, although in our case ultimately the environment healed itself and the deploy finished.

What we think is happening is the following:
  . the haproxy pacemaker service file is designed to restart haproxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes.
  . when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container.
  . so this triggers two restart of the haproxy containers.
  . however the first restart is triggered by the config change
  . when the haproxy container is restarted, haproxy reads the new config, tries to certificates and keys from the pem file, but this pem file is not bind-monted yet (the bundle configuration hasn't been updated yet).
  . so all haproxy container fail to start

In our case, when the haproxy bundle config has been updated, pacemaker cleaned all the errors, restarted the haproxy containers and this time haproxy could access the new pem file, so everything recovered automatically.


In our current design, we schedule the "restart on config change" before the "restart on bundle config change" on purpose, and but unfortunately this design breaks use cases such as "upgrade haproxy to use TLS".

We need to discuss internally how best to fix the double restart ordering issue.


Note You need to log in before you can comment on or make changes to this bug.