1699101 – [OSP13] Enabling TLS results in timeout during deployment and haproxy containers disappear

Bug 1699101 - [OSP13] Enabling TLS results in timeout during deployment and haproxy containers disappear

Summary: [OSP13] Enabling TLS results in timeout during deployment and haproxy contain...

Keywords:
Status:	CLOSED DUPLICATE of bug 1643535
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-haproxy
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Michele Baldessari
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-11 19:15 UTC by ggrimaux
Modified:	2019-05-08 07:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 07:38:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description ggrimaux 2019-04-11 19:15:28 UTC

Description of problem:
This is coming from a customer's case.
I was able to reproduce the issue.

On a clean and fresh OSP13 installation I followed the procedure to install and configure TLS on the overcloud: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/advanced_overcloud_customization/sect-enabling_ssltls_on_the_overcloud

The deployment fails with this:
2019-04-11 16:35:51Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-CephStorageDeployment_Step3-c6abx2pr42ag]: UPDATE_COMPLETE  Stack UPDATE completed successfully
2019-04-11 16:35:51Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g.CephStorageDeployment_Step3]: UPDATE_COMPLETE  state changed
2019-04-11 16:36:09Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-ControllerDeployment_Step3-s7kyvaruayay.2]: SIGNAL_IN_PROGRESS  Signal: deployment f6dec962-30d0-4434-ab2b-895b813850a5 succeeded
2019-04-11 16:36:10Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-ControllerDeployment_Step3-s7kyvaruayay.2]: UPDATE_COMPLETE  state changed
2019-04-11 16:36:14Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-ControllerDeployment_Step3-s7kyvaruayay.1]: SIGNAL_IN_PROGRESS  Signal: deployment 6faa301b-5485-464d-bcce-085475f1bdd6 succeeded
2019-04-11 16:36:14Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-ControllerDeployment_Step3-s7kyvaruayay.1]: UPDATE_COMPLETE  state changed
2019-04-11 17:25:59Z [AllNodesDeploySteps]: UPDATE_FAILED  UPDATE aborted (Task update from TemplateResource "AllNodesDeploySteps" [5d539de6-0f6b-4725-bb0b-1f01612db057] Stack "overcloud" [b8920176-a660-447e-9426-d364d169edf1] Timed out)
2019-04-11 17:26:00Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g]: UPDATE_FAILED  Stack UPDATE cancelled
2019-04-11 17:26:00Z [overcloud]: UPDATE_FAILED  Timed out
2019-04-11 17:26:00Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g-ControllerDeployment_Step3-s7kyvaruayay]: UPDATE_FAILED  Stack UPDATE cancelled
2019-04-11 17:26:01Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g.ControllerDeployment_Step3]: UPDATE_FAILED  resources.ControllerDeployment_Step3: Stack UPDATE cancelled
2019-04-11 17:26:01Z [overcloud-AllNodesDeploySteps-mue3k3oadw6g]: UPDATE_FAILED  Resource UPDATE failed: resources.ControllerDeployment_Step3: Stack UPDATE cancelled

 Stack overcloud UPDATE_FAILED 

overcloud.AllNodesDeploySteps.ControllerDeployment_Step3:
  resource_type: OS::TripleO::DeploymentSteps
  physical_resource_id: 63fc1f75-7b7b-4e71-9e40-4601d47af22c
  status: UPDATE_FAILED
  status_reason: |
    resources.ControllerDeployment_Step3: Stack UPDATE cancelled
Heat Stack update failed.
Heat Stack update failed.

The "openstack software deployment show 70049cee-a8b2-4bbc-b69b-6586355f8a6b" that failed:
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Field         | Value                                                                                                                                                                      
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| id            | 70049cee-a8b2-4bbc-b69b-6586355f8a6b                                                                                                                                       
| server_id     | bfe73e5e-d7ad-4481-8ad0-eca26a005e08                                                                                                                                       
| config_id     | d058abec-38fe-40c6-bdd6-2ac5b2cdfc7d                                                                                                                                       
| creation_time | 2019-04-10T19:45:44Z                                                                                                                                                       
| updated_time  | 2019-04-11T18:14:43Z                                                                                                                                                       
| status        | FAILED                                                                                                                                                                     
| status_reason | Deployment cancelled.                                                                                                                                                      
| input_values  | {u'role_data_docker_config': {u'step_1': {u'cinder_volume_image_tag': {u'start_order': 1, u'image': u'192.168.24.1:8787/rhosp13/openstack-cinder-volume:2019-02-28.1', u'co
| action        | UPDATE                                                                                                                                                                     
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On the controllers, the haproxy-bundle container is gone (the were all the and running fine before the update, I made sure of that):

[root@controller-0 ~]# docker ps --all|grep haproxy
a7051a82532f        192.168.24.1:8787/rhosp13/openstack-haproxy:2019-02-28.1                     "/docker_puppet_ap..."   2 hours ago         Exited (0) 2 hours ago                        haproxy_init_bundle
123c4c3c834b        192.168.24.1:8787/rhosp13/openstack-haproxy:2019-02-28.1                     "/usr/bin/bootstra..."   2 hours ago         Exited (0) 2 hours ago                        haproxy_restart_bundle
f76af7f46f0a        192.168.24.1:8787/rhosp13/openstack-haproxy:2019-02-28.1                     "/bin/bash -c '/us..."   29 hours ago        Exited (0) 29 hours ago                       haproxy_image_tag

So haproxy_init_bundle was triggered. Here's the docker logs of that container:
Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Tripleo::Pacemaker::Resource_restart_flag[haproxy-clone]/File[/var/lib/tripleo]/ensure: created
Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy_bundle/Tripleo::Pacemaker::Resource_restart_flag[haproxy-clone]/File[/var/lib/tripleo/pacemaker-restarts]/ensure: created
Info: Tripleo::Pacemaker::Resource_restart_flag[haproxy-clone]: Unscheduling all events on Tripleo::Pacemaker::Resource_restart_flag[haproxy-clone]
Info: Creating state file /var/lib/puppet/state/state.yaml
Notice: Applied catalog in 76.56 seconds
Changes:
            Total: 9
Events:
          Success: 9
            Total: 9
Resources:
            Total: 250
          Skipped: 37
      Out of sync: 8
          Changed: 8
Time:
      Concat file: 0.00
        File line: 0.00
   Concat fragment: 0.01
         Firewall: 0.07
             File: 0.13
         Last run: 1555000471
   Pcmk constraint: 21.20
    Pcmk resource: 37.04
      Pcmk bundle: 6.01
   Config retrieval: 7.72
            Total: 81.97
    Pcmk property: 9.79
Version:
           Config: 1555000386
           Puppet: 4.8.2

And here's the docker container logs of the haproxy_restart_bundle container:
 Bundle: haproxy-bundle
  Docker: image=192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest network=host options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replicas=3 run-command="/bin/bash /usr/local/bin/kolla_start"
  Storage Mapping:
   options=ro source-dir=/var/lib/kolla/config_files/haproxy.json target-dir=/var/lib/kolla/config_files/config.json (haproxy-cfg-files)
   options=ro source-dir=/var/lib/config-data/puppet-generated/haproxy/ target-dir=/var/lib/kolla/config_files/src (haproxy-cfg-data)
   options=ro source-dir=/etc/hosts target-dir=/etc/hosts (haproxy-hosts)
   options=ro source-dir=/etc/localtime target-dir=/etc/localtime (haproxy-localtime)
   options=rw source-dir=/var/lib/haproxy target-dir=/var/lib/haproxy (haproxy-var-lib)
   options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted (haproxy-pki-extracted)
   options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt (haproxy-pki-ca-bundle-crt)
   options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt (haproxy-pki-ca-bundle-trust-crt)
   options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem (haproxy-pki-cert)
   options=rw source-dir=/dev/log target-dir=/dev/log (haproxy-dev-log)
haproxy-bundle successfully restarted
haproxy-bundle restart invoked

Here you have pacemaker info about haproxy:
[root@controller-0 ~]# pcs status|grep haproxy
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Stopped
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Stopped
* haproxy-bundle-docker-0_start_0 on controller-2 'unknown error' (1): call=145, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-1_start_0 on controller-2 'unknown error' (1): call=149, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-2_start_0 on controller-2 'unknown error' (1): call=133, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-0_start_0 on controller-0 'unknown error' (1): call=139, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-1_start_0 on controller-0 'unknown error' (1): call=143, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-2_start_0 on controller-0 'unknown error' (1): call=145, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-0_start_0 on controller-1 'unknown error' (1): call=137, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-1_start_0 on controller-1 'unknown error' (1): call=133, status=complete, exitreason='Newly created docker container exited after start',
* haproxy-bundle-docker-2_start_0 on controller-1 'unknown error' (1): call=149, status=complete, exitreason='Newly created docker container exited after start',

So the containers were started but were killed because of something I still can't identify.

My deploy command (only thing I added are the 3 new yaml files):
(undercloud) [stack@undercloud-0 ~]$ cat overcloud_deploy_TLS.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /home/stack/virt/nodes_data.yaml \
-e /home/stack/virt/extra_templates.yaml \
-e /home/stack/virt/docker-images.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
--log-file overcloud_deployment_48.log


I need your help to troubleshoot this.

I have an environment you can have access if you want to troubleshoot it quickly. Just poke me (IRC).

Version-Release number of selected component (if applicable):
Most recent OSP13 installation.

How reproducible:
100%

Steps to Reproduce:
1. Install OSP13 without TLS
2. Follow the procedure: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/advanced_overcloud_customization/sect-enabling_ssltls_on_the_overcloud
3. deployment will timeout and haproxy containers disappear from each controller node.

Actual results:
openstack overcloud deploy command timeout and haproxy containers disappear.

Expected results:
Finish integration of TLS.

Additional info:

Comment 1 ggrimaux 2019-04-15 11:26:02 UTC

A working workaround is to apply the following to your templates:
parameter_defaults:
  ControllerExtraConfig:
    pacemaker::resource::bundle::deep_compare: true
    pacemaker::resource::ip::deep_compare: true
    pacemaker::resource::ocf::deep_compare: true

And rerun your overcloud deploy command (with your added TLS yaml files of course).

Thank you Michele Baldessari

Comment 2 ggrimaux 2019-04-16 08:31:33 UTC

You also need this package on your overcloud nodes:
puppet-pacemaker-0.7.2-0.20180423212257.el7ost.noarch.rpm

Note You need to log in before you can comment on or make changes to this bug.