Bug 1468256

Summary: rhosp-director: HA Overcloud deployment with SSL fails: Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: puppet-tripleoAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: aschultz, dbecker, dprince, jjoyce, jschluet, m.andre, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, slinaber, tvignaud
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-7.1.1-0.20170715004705.el7ost openstack-tripleo-heat-templates-7.0.0-0.20170715081739.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:39:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2017-07-06 13:26:35 UTC
rhosp-director: HA Overcloud deployment with SSL fails: Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout

Environment:
openstack-tripleo-heat-templates-7.0.0-0.20170628002128.el7ost.noarch
openstack-puppet-modules-10.0.0-0.20170315222135.0333c73.el7.1.noarch
instack-undercloud-7.1.1-0.20170623182135.el7ost.noarch


Steps to reproduce:
Attempt an HA  deployment with SSL.


Result:
The deployment fails:

After breaking the very long one line output into many lines and grepping it for errors:
Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... TASK [Write the config_step hieradata] clouds.yaml compute.yaml container_images.yaml controller.yaml core_puddle_version customization.yaml debug.yaml enable-tls.yaml errors errors2 f hascript.sh inject-trust-anchor.yaml instackenv.json ironic-python-agent.initramfs ironic-python-agent.kernel martin network-environment.yaml overcloud_deploy.sh overcloud-full.initrd overcloud-full.qcow2 overcloud-full-rpm.manifest overcloud-full-signature.manifest overcloud-full.vmlinuz overcloudrc overcloudrc.v3 pacemaker r roles sasha stackrc tempest-deployer-input.conf tripleo tripleo-heat-templates undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf upgrade changed: [localhost] TASK [Run puppet host configuration for step 3] clouds.yaml compute.yaml container_images.yaml controller.yaml core_puddle_version customization.yaml debug.yaml enable-tls.yaml errors errors2 f hascript.sh inject-trust-anchor.yaml instackenv.json ironic-python-agent.initramfs ironic-python-agent.kernel martin network-environment.yaml overcloud_deploy.sh overcloud-full.initrd overcloud-full.qcow2 overcloud-full-rpm.manifest overcloud-full-signature.manifest overcloud-full.vmlinuz overcloudrc overcloudrc.v3 pacemaker r roles sasha stackrc tempest-deployer-input.conf tripleo tripleo-heat-templates undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf upgrade fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "/usr/bin/timeout -s 9 30m /usr/bin/puppet apply --detailed-exitcodes --no-noop /var/lib/tripleo-config/puppet_step_config.pp failed with return code: 6", "rc": 6, "stderr": "exception: connect failed
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout", "
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout"], "stdout": "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend
Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout
Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout", "
Error: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout", "
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout", "
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout"], "stdout": "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend
Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2 deploy_stdout: | ... TASK [Write the config_step hieradata] clouds.yaml compute.yaml container_images.yaml controller.yaml core_puddle_version customization.yaml debug.yaml enable-tls.yaml errors errors2 f hascript.sh inject-trust-anchor.yaml instackenv.json ironic-python-agent.initramfs ironic-python-agent.kernel martin network-environment.yaml overcloud_deploy.sh overcloud-full.initrd overcloud-full.qcow2 overcloud-full-rpm.manifest overcloud-full-signature.manifest overcloud-full.vmlinuz overcloudrc overcloudrc.v3 pacemaker r roles sasha stackrc tempest-deployer-input.conf tripleo tripleo-heat-templates undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf upgrade changed: [localhost] TASK [Run puppet host configuration for step 3] clouds.yaml compute.yaml container_images.yaml controller.yaml core_puddle_version customization.yaml debug.yaml enable-tls.yaml errors errors2 f hascript.sh inject-trust-anchor.yaml instackenv.json ironic-python-agent.initramfs ironic-python-agent.kernel martin network-environment.yaml overcloud_deploy.sh overcloud-full.initrd overcloud-full.qcow2 overcloud-full-rpm.manifest overcloud-full-signature.manifest overcloud-full.vmlinuz overcloudrc overcloudrc.v3 pacemaker r roles sasha stackrc tempest-deployer-input.conf tripleo tripleo-heat-templates undercloud.conf undercloud_deploy.sh undercloud_install.log undercloud-passwords.conf upgrade fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "/usr/bin/timeout -s 9 30m /usr/bin/puppet apply --detailed-exitcodes --no-noop /var/lib/tripleo-config/puppet_step_config.pp failed with return code: 6", "rc": 6, "stderr": "exception: connect failed
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout", "
Error: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout"], "stdout": "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend







Checking os-collect-config on controller:
Jul 06 03:04:25 overcloud-controller-0.redhat.local os-collect-config[3169]: module list --tree' to see information about modules\n   (file & line not available)\u001b[0m\n\u001b[1;33mWarning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules\n   (file & line not available)\u001b[0m\n\u001b[1;31mError: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout\u001b[0m\n\u001b[1;31mError: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout\u001b[0m\n\u001b[1;31mError: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout\u001b[0m\n\u001b[1;31mError: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout\u001b[0m\n", "stderr_lines": ["exception: connect failed", "\u001b[1;33mWarning: Facter: Could not retrieve fact='rabbitmq_nodename', resolution='<anonymous>': undefined method `[]' for nil:NilClass\u001b[0m", "\u001b[1;33mWarning: Undefined variable 'deploy_config_name'; ", "   (file & line not available)\u001b[0m", "\u001b[1;33mWarning: ModuleLoader: module 'openstacklib' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", "   (file & line not available)\u001b[0m", "\u001b[1;33mWarning: This method is deprecated, please use the stdlib validate_legacy function, with Pattern[]. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/cinder/manifests/db.pp\", 64]:[\"/etc/puppet/modules/cinder/manifests/init.pp\", 385]", "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:25:in `deprecation')\u001b[0m", "\u001b[1;33mWarning: Scope(Class[Cinder]): host is deprecated, has no effect and will be removed in a future release, use backend_host instead\u001b[0m", "\u001b[1;33mWarning: Scope(Class[Cinder]): cinder::rabbit_host, cinder::rabbit_hosts, cinder::rabbit_password, cinder::rabbit_port, cinder
Jul 06 03:04:25 overcloud-controller-0.redhat.local os-collect-config[3169]: e_legacy function in the README. at [\"/etc/puppet/modules/ntp/manifests/init.pp\", 76]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/time/ntp.pp\", 29]", "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:25:in `deprecation')\u001b[0m", "\u001b[1;33mWarning: ModuleLoader: module 'ssh' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", "   (file & line not available)\u001b[0m", "\u001b[1;33mWarning: ModuleLoader: module 'timezone' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", "   (file & line not available)\u001b[0m", "\u001b[1;33mWarning: ModuleLoader: module 'mysql' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", "   (file & line not available)\u001b[0m", "\u001b[1;31mError: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Failed to call refresh: Command exceeded timeout\u001b[0m", "\u001b[1;31mError: /Stage[main]/Cinder::Db::Sync/Exec[cinder-manage db_sync]: Command exceeded timeout\u001b[0m", "\u001b[1;31mError: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Failed to call refresh: Command exceeded timeout\u001b[0m", "\u001b[1;31mError: /Stage[main]/Heat::Db::Sync/Exec[heat-dbsync]: Command exceeded timeout\u001b[0m"], "stdout": "\u001b[mNotice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend\u001b[0m\n\u001b[mNotice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend\u001b[0m\n\u001b[mNotice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.\u001b[0m\n\u001b[mNotice: Compiled catalog for overcloud-controller-0.redhat.local in environment production in 5.32 seconds\u001b[0m\n\u001b[mNotice: /Stage[main]/Cinder/Cinder_config[DEFAULT/api_paste_config]/ensure: created\u00
Jul 06 03:04:25 overcloud-controller-0.redhat.local os-collect-config[3169]: [2017-07-06 03:04:25,638] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-ansible/793083e2-ffdd-4c9d-9a8e-3293d80312a9_playbook.yaml. [2]
Jul 06 03:04:27 overcloud-controller-0.redhat.local os-collect-config[3169]: [2017-07-06 03:04:27,080] (heat-config) [ERROR] Skipping group os-apply-config with no hook script None



[root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-ansible/793083e2-ffdd-4c9d-9a8e-3293d80312a9_playbook.yaml
- hosts: localhost
  connection: local
  tasks:
    #####################################################
    # Per step puppet configuration of the baremetal host
    #####################################################
    - name: Write the config_step hieradata
      copy: content="{{dict(step=step|int)|to_json}}" dest=/etc/puppet/hieradata/config_step.json force=true
    - name: Run puppet host configuration for step {{step}}
      # FIXME: modulepath requires ansible 2.4, our builds currently only have 2.3
      # puppet: manifest=/var/lib/tripleo-config/puppet_step_config.pp modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules
      puppet: manifest=/var/lib/tripleo-config/puppet_step_config.pp
    ######################################
    # Generate config via docker-puppet.py
    ######################################
    - name: Run docker-puppet tasks (generate config)
      shell: python /var/lib/docker-puppet/docker-puppet.py
      environment:
        NET_HOST: 'true'
        DEBUG: '{{docker_puppet_debug}}'
      when: step == "1"
      changed_when: false
      check_mode: no
    ##################################################
    # Per step starting of the containers using paunch
    ##################################################
    - name: Check if /var/lib/hashed-tripleo-config/docker-container-startup-config-step_{{step}}.json exists
      stat:
        path: /var/lib/tripleo-config/hashed-docker-container-startup-config-step_{{step}}.json
      register: docker_config_json
    # Note docker-puppet.py generates the hashed-*.json file, which is a copy of
    # the *step_n.json with a hash of the generated external config added
    # This acts as a salt to enable restarting the container if config changes
    - name: Start containers for step {{step}}
      command: paunch --debug apply --file /var/lib/tripleo-config/hashed-docker-container-startup-config-step_{{step}}.json --config-id tripleo_step{{step}} --managed-by tripleo-{{role_name}}
      when: docker_config_json.stat.exists
      changed_when: false
      check_mode: no
    ########################################################
    # Bootstrap tasks, only performed on bootstrap_server_id
    ########################################################
    - name: Run docker-puppet tasks (bootstrap tasks)
      shell: python /var/lib/docker-puppet/docker-puppet.py
      environment:
        CONFIG: /var/lib/docker-puppet/docker-puppet-tasks{{step}}.json
        NET_HOST: "true"
        NO_ARCHIVE: "true"
        STEP: "{{step}}"
      when: deploy_server_id == bootstrap_server_id
      changed_when: false
      check_mode: no

Comment 2 Alexander Chuzhoy 2017-07-06 15:25:14 UTC
Retried the deployment including /home/stack/tripleo-heat-templates/environments/low-memory-usage.yaml


Same result.

Comment 3 Alex Schultz 2017-07-06 15:51:05 UTC
So in the past the db-sync processes are very sensitive to IO performance of the underlying disks. If the database is containerized and the environment is on a VM this may cause problems. That being said, Sasha mentioned that this only seems to be when ssl is enabled, so I'm also wondering about the performance of the database if TLS is enabled. Might want to check that as well.  In the past we usually hit this with nova or neutron syncs so I'm not sure if heat/cinder db sync timeouts are touched by the setting in low-memory-usage.yaml.

Comment 4 Dan Prince 2017-07-06 17:36:54 UTC
With OSP12 Heat should be executing the 'heat-manage db_sync' command via docker-cmd like this:

http://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/docker/services/heat-engine.yaml#n110

The stack trace here shows that it is a Puppet resources that is failing. I would like to understand more about why this is happening since Puppet should not be trying to execute the DB syncs unless Heat is running on barematal.

Comment 5 Dan Prince 2017-07-06 18:27:10 UTC
Took a look with sasha at the raw puppet manifest which is failing at step 3 during deployment. It shows this is included in the deployment:

include ::tripleo::profile::base::heat::api_cloudwatch

---

AFAIK the cloudwatch API is deprecated. We haven't containerized it, nor do we have plans to I think. So perhaps this is something we need to "stub out" for the containerized effort so that users including the old cloudwatch role get handled gracefully for containers?

Comment 6 Alexander Chuzhoy 2017-07-06 21:11:39 UTC
Tried few more times to deploy with and without SSL.

HA deployment constantly fails with the same error with SSL and successfully passes without SSL.

Comment 7 Martin André 2017-07-10 08:27:01 UTC
We've debugged with Damien and Omri and identified that haproxy container fails to start because it's missing /etc/pki/tls/private/overcloud_endpoint.pem. We need to add the bind mount to puppet-tripleo similar to what https://review.openstack.org/#/c/473854/ does for the non-ha case.

Comment 8 Martin André 2017-07-17 14:00:46 UTC
All fixes merged upstream.

Comment 11 Alexander Chuzhoy 2017-10-23 16:59:51 UTC
Verified:
Environment:
puppet-tripleo-7.4.2-0.20171007035632.195db7c.el7ost.noarch
openstack-tripleo-heat-templates-7.0.2-0.20171007062244.el7ost.noarch


The reported issue doesn't reproduce.
Was able to deploy overcloud with SSL.

Comment 15 errata-xmlrpc 2017-12-13 21:39:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462