Bug 1386447

Summary: Newton (HA) Deployments failing w/ httpd service restart timeout
Product: [Community] RDO Reporter: Matt Young <matyoung>
Component: openstack-tripleo-heat-templatesAssignee: Tomas Sedovic <tsedovic>
Status: CLOSED ERRATA QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: high    
Version: trunkCC: apevec, chris.brown, jschluet, matyoung, mburns, whayutin
Target Milestone: ---Keywords: Automation, AutomationBlocker
Target Release: trunk   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-22 15:23:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matt Young 2016-10-19 01:26:59 UTC
Description of problem:

We have seen this issue multiple times in the internal newton tripleo-quickstart based pipelines when deploying an overcloud in virt HA configuration.  What we observe is thus:

deploy command
--------------

openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates --libvirt-type kvm --control-flavor oooq_control --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml --control-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server pool.ntp.org

deploy output
-------------

Started Mistral Workflow. Execution ID: a00f2423-5c6d-4957-94c8-a853457abbe9
Plan updated
Deploying templates in the directory /tmp/tripleoclient-PD8sog/tripleo-heat-templates
Started Mistral Workflow. Execution ID: f800a2b6-6353-4798-9971-b10d9a42194c
2016-10-18 18:22:22Z [overcloud]: CREATE_IN_PROGRESS  Stack CREATE started
2016-10-18 18:22:22Z [overcloud.Networks]: CREATE_IN_PROGRESS  state changed
2016-10-18 18:22:23Z [overcloud.HorizonSecret]: CREATE_IN_PROGRESS  state changed
2016-10-18 18:22:23Z [overcloud.MysqlRootPassword]: CREATE_IN_PROGRESS  state changed

<snip>many CREATE_IN_PROGRESS && CREATE_COMPLETE</snip>

2016-10-18 19:04:21Z [overcloud.Heat Stack create failed.
AllNodesDeploySteps.ComputeDeployment_Step5.0]: SIGNAL_IN_PROGRESS  Signal: deployment 4e020e01-acd5-45f3-ad29-be122f27bcb5 succeeded
2016-10-18 19:04:21Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step5.0]: CREATE_COMPLETE  state changed
2016-10-18 19:04:21Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step5]: CREATE_COMPLETE  Stack CREATE completed successfully
2016-10-18 19:04:22Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step5]: CREATE_COMPLETE  state changed
2016-10-18 19:24:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.2]: SIGNAL_IN_PROGRESS  Signal: deployment 48ddd689-3651-47cd-b0cd-e7da00cc70b5 succeeded
2016-10-18 19:24:19Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.2]: CREATE_COMPLETE  state changed
2016-10-18 19:27:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.1]: SIGNAL_IN_PROGRESS  Signal: deployment 81a8ffaf-07a2-4b59-9e48-1ccb89335f55 failed (6)
2016-10-18 19:27:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.1]: CREATE_FAILED  Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-18 19:31:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0]: CREATE_FAILED  CREATE aborted
2016-10-18 19:31:18Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5]: CREATE_FAILED  Resource CREATE failed: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
2016-10-18 19:31:19Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step5]: CREATE_FAILED  Error: resources.ControllerDeployment_Step5.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2016-10-18 19:31:19Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED  Resource CREATE failed: Error: resources.ControllerDeployment_Step5.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2016-10-18 19:31:20Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED  Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step5.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
2016-10-18 19:31:20Z [overcloud]: CREATE_FAILED  Resource CREATE failed: Error: resources.AllNodesDeploySteps.resources.ControllerDeployment_Step5.resources[1]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6

failed_deployment_list.log
--------------------------

(see private comments for links to full logs)

overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 81a8ffaf-07a2-4b59-9e48-1ccb89335f55
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    [mNotice: /Stage[main]/Keystone::Deps/Anchor[keystone::service::end]: Triggered 'refresh' from 6 events[0m
    [mNotice: /Stage[main]/Neutron::Deps/Anchor[neutron::config::end]: Triggered 'refresh' from 30 events[0m
    [mNotice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::begin]: Triggered 'refresh' from 1 events[0m
    [mNotice: /Stage[main]/Neutron::Server/Service[neutron-server]/ensure: ensure changed 'stopped' to 'running'[0m
    [mNotice: /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service]: Triggered 'refresh' from 1 events[0m
    [mNotice: /Stage[main]/Neutron::Agents::Dhcp/Service[neutron-dhcp-service]: Triggered 'refresh' from 1 events[0m
    [mNotice: /Stage[main]/Neutron::Agents::L3/Service[neutron-l3]: Triggered 'refresh' from 1 events[0m
    [mNotice: /Stage[main]/Neutron::Agents::Metadata/Service[neutron-metadata]: Triggered 'refresh' from 1 events[0m
    [mNotice: /Stage[main]/Neutron::Deps/Anchor[neutron::service::end]: Triggered 'refresh' from 5 events[0m
    [mNotice: Finished catalog run in 1081.00 seconds[0m
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    [1;31mWarning: Scope(Class[Gnocchi::Api]): gnocchi:api::keystone_identity_uri is deprecated, use gnocchi::keystone::authtoken::auth_url instead[0m
    [1;31mWarning: Scope(Class[Gnocchi::Api]): gnocchi::api::keystone_auth_uri is deprecated, use gnocchi::keystone::authtoken::auth_uri instead[0m
    [1;31mWarning: Not collecting exported resources without storeconfigs[0m
    [1;31mWarning: Not collecting exported resources without storeconfigs[0m
    [1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.[0m
    [1;31mWarning: Not collecting exported resources without storeconfigs[0m
    [1;31mWarning: Not collecting exported resources without storeconfigs[0m
    [1;31mWarning: Not collecting exported resources without storeconfigs[0m
    [1;31mError: /Stage[main]/Apache::Service/Service[httpd]: Failed to call refresh: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service failed because a timeout was exceeded. See "systemctl status httpd.service" and "journalctl -xe" for details.[0m
    [1;31mError: /Stage[main]/Apache::Service/Service[httpd]: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service failed because a timeout was exceeded. See "systemctl status httpd.service" and "journalctl -xe" for details.[0m
    (truncated, view all with --long)
overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: d55a81aa-3454-4eda-b164-f5b323416c08
  status: CREATE_FAILED
  status_reason: |
    CREATE aborted
  deploy_stdout: |
None
  deploy_stderr: |
None


---

Version-Release number of selected component (if applicable):

We started observing this at:
- https://trunk.rdoproject.org/centos7-newton/b0/bc/b0bca42c26080066f5f649c5cb76746e242f0c23_0dcd1644

Currently (tonight, 10/18) RDO just promoted:
- https://trunk.rdoproject.org/centos7-newton/b2/6c/b26caaaf6e975588be93dbd873f4d995159d9905_70164fd8

which is currently in test/CI (awaiting results)

How reproducible:

Reproduced in CI, and we have successfully reproduced this in isolation as well (see comments below for internal RH links to artifacts

Steps to Reproduce:

We are using tripleo-quickstart to deploy HA --> virt, and CI jobs are reproducing.  If we need to set up a separate reproducer box we can address.  We can also pull a CI node in the failed state out of rotation to debug, if the artifacts are not enough to go on (for now).  

Actual results:
overcloud deploy fails

Expected results:
overcloud deploys in HA configuration

Additional info:

https://review.rdoproject.org/etherpad/p/rdo-internal-issues #38

Comment 2 Alan Pevec 2016-10-19 11:14:19 UTC
> Error: /Stage[main]/Apache::Service/Service[httpd]: Could not restart Service[httpd]: Execution of '/usr/bin/systemctl restart httpd' returned 1: Job for httpd.service failed because a timeout was exceeded. See "systemctl status httpd.service" and "journalctl -xe" for details.

Timeout was increased to 5min in https://review.rdoproject.org/r/1822
Is it really taking that much, is this emulated qemu VM?

Comment 3 wes hayutin 2016-10-19 14:41:49 UTC
@Alan,

If httpd fails to start 3/10 times which is about what we've seen, then yes.
Ideally, this setting should be configurable from the overcloud deployment.
I need to check on that.

It's not 100% clear yet the issue would be resolved w/ a longer timeout.

Comment 4 Matt Young 2016-10-19 15:23:59 UTC
@Alan, this is running in nested KVM, with --libvirt-type kvm being passed to overcloud deploy, so not qemu mode.

Comment 5 Christopher Brown 2017-06-22 15:23:58 UTC
I'm pretty sure these were resolved a while back with various fixes.
I'll close but please re-open if not the case.