Bug 1738927
Summary: | Enabling TLS results in timeout during deployment and haproxy containers are in stopped state | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | vivek koul <vkoul> | |
Component: | puppet-haproxy | Assignee: | RHOS Maint <rhos-maint> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | nlevinki <nlevinki> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 13.0 (Queens) | CC: | bperkins, dciabrin, jjoyce, jschluet, lmiccini, michele, pkomarov, sisadoun, slinaber, tvignaud | |
Target Milestone: | --- | Keywords: | Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1809958 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-15 13:02:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1809958 |
Description
vivek koul
2019-08-08 11:54:37 UTC
We tracked this down to the haproxy-bundle resource not having a bind_mount to the dir containing the certificate, so the haproxy container failed while accessing /etc/pki/tls/private/overcloud_endpoint.pem simply because it wasn't there. Theoretically deep_compare should've done the right thing (adding such dir to the resource definition) we will try to reproduce to rule out any issue with it. We further investigated with Luca and by redeploying a stock osp13, and then add the TLS external endpoints with a stack redeploy. We think we reproduced the error, although in our case ultimately the environment healed itself and the deploy finished. What we think is happening is the following: . the haproxy pacemaker service file is designed to restart haproxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes. . when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container. . so this triggers two restart of the haproxy containers. . however the first restart is triggered by the config change . when the haproxy container is restarted, haproxy reads the new config, tries to certificates and keys from the pem file, but this pem file is not bind-monted yet (the bundle configuration hasn't been updated yet). . so all haproxy container fail to start In our case, when the haproxy bundle config has been updated, pacemaker cleaned all the errors, restarted the haproxy containers and this time haproxy could access the new pem file, so everything recovered automatically. In our current design, we schedule the "restart on config change" before the "restart on bundle config change" on purpose, and but unfortunately this design breaks use cases such as "upgrade haproxy to use TLS". We need to discuss internally how best to fix the double restart ordering issue. |