Bug 1379274 - overcloud deployment stuck
Summary: overcloud deployment stuck
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Angus Thomas
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-26 09:12 UTC by Yogev Rabl
Modified: 2019-05-12 11:07 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-02 16:16:45 UTC
Target Upstream Version:


Attachments (Terms of Use)
templates used for the deployment (149.61 KB, application/x-gzip)
2016-09-26 09:12 UTC, Yogev Rabl
no flags Details
heat-engine.log (2.24 MB, application/x-gzip)
2016-09-26 09:12 UTC, Yogev Rabl
no flags Details
templates used (10.33 KB, application/x-rar)
2018-03-26 16:57 UTC, d_mor_hua
no flags Details
heat resource list (283.61 KB, text/plain)
2018-03-27 08:56 UTC, d_mor_hua
no flags Details

Description Yogev Rabl 2016-09-26 09:12:11 UTC
Created attachment 1204745 [details]
templates used for the deployment

Description of problem:
An overcloud deployment of 
- 3 controller nodes
- 3 compute nodes
- 3 Ceph storage nodes (each with 10 OSDs) 
stuck with an error with a *very* long string. The deployment command is 
# openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with-vlans.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml --libvirt-type qemu --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com

With the templates directory attached. 

A similar deployment with the same topology was successful in OSPD 9

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost.noarch
openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy with the command above and the templates provided in the attachments

Actual results:
The deployment is stuck 

Expected results:
Two result are expected:
1. if there's a fault - the deployment should fail with errors
2. if there isn't a fault - the deployment should succeed 

Additional info:

Comment 1 Yogev Rabl 2016-09-26 09:12:50 UTC
Created attachment 1204746 [details]
heat-engine.log

Comment 2 Yogev Rabl 2016-09-26 11:08:26 UTC
The error in the heat-engine log wasn't the cause of the deployment freeze. The issue lied someplace else.
Moving the priority to low

Comment 3 Zane Bitter 2016-09-26 16:06:37 UTC
Yeah, that log message always appears, regardless of success or failure. It looks from the log that stuff was still in progress at the time the log finished; it's not clear if it was going to complete or not. If it was not it's more likely to be due to a failure to signal back from a software deployment for whatever reason.

I'm going to change the component to Director.

Comment 4 James Slagle 2017-03-01 16:10:14 UTC
we need sosreports from the overcloud nodes to see why things may have been stuck.
as well as "heat resource list -n 5 overcloud", to see what resources were in progress.

Comment 5 Alex Schultz 2017-03-24 20:07:17 UTC
Closing due to lack of information and updates. Feel free to reopen with logs if this issue occurs again.

Comment 6 d_mor_hua 2018-03-26 16:03:04 UTC
(In reply to Yogev Rabl from comment #0)
> Created attachment 1204745 [details]
> templates used for the deployment
> 
> Description of problem:
> An overcloud deployment of 
> - 3 controller nodes
> - 3 compute nodes
> - 3 Ceph storage nodes (each with 10 OSDs) 
> stuck with an error with a *very* long string. The deployment command is 
> # openstack overcloud deploy --templates -e
> /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-environment.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with-
> vlans.yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml --libvirt-type qemu --control-scale 3 --compute-scale 3
> --ceph-storage-scale 3 --control-flavor control --compute-flavor compute
> --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com
> 
> With the templates directory attached. 
> 
> A similar deployment with the same topology was successful in OSPD 9
> 
> Version-Release number of selected component (if applicable):
> openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost.
> noarch
> openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch
> openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
> python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
> openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Deploy with the command above and the templates provided in the
> attachments
> 
> Actual results:
> The deployment is stuck 
> 
> Expected results:
> Two result are expected:
> 1. if there's a fault - the deployment should fail with errors
> 2. if there isn't a fault - the deployment should succeed 
> 
> Additional info:

Hi, I am having same issue. Did you come up with a conclusion for it? What kind of logs do you need?

Comment 7 Alex Schultz 2018-03-26 16:41:20 UTC
We would need sosreports from the undercloud and the affected overcloud nodes. The original bug report was not specific enough to figure out what was failing.  As previously mentioned 'heat resource list -n 5 overcloud' would be a start.

Comment 8 d_mor_hua 2018-03-26 16:51:54 UTC
The deployments involves 

3 controller
5 compute (3 SR-IOV and 2 DPDK)
3 ceph storage

The command used to launch the deployment is:

--->openstack overcloud deploy --templates --environment-directory /home/stack/environments/ --ntp-server 163.162.16.29<---------

and it remains stuck here:

2018-03-26 16:14:21Z [overcloud.Compute.2.SshHostPubKey]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.ComputeExtraConfigPre]: CREATE_COMPLETE  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.NodeTLSCAData]: CREATE_COMPLETE  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:23Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_COMPLETE  state changed
2018-03-26 16:14:35Z [overcloud.Compute.2.SshHostPubKey]: CREATE_COMPLETE  state changed
2018-03-26 16:14:35Z [overcloud.Compute.2]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:36Z [overcloud.Compute.2]: CREATE_COMPLETE  state changed
2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS  Stack CREATE started
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_COMPLETE  state changed
2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE  state changed


No error is shown


openstack baremetal node list
+--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name        | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
| cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | 8bf5dc6a-0136-403c-8125-04faa62bf620 | power on    | active             | False       |
| 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | 0736209a-b201-4903-a0cd-e8e8e7d8a8b1 | power on    | active             | False       |
| 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | c8c9e49c-f3a2-4ccb-93c7-eb2f6e8ab7b4 | power on    | active             | False       |
| 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0    | 940b6c3d-08e1-4f1f-aba8-cb73ab199978 | power on    | active             | False       |
| d242faa2-4c6e-401c-9584-f2e3e335306b | compute1    | e0d1499d-c2e2-4cab-bdeb-74c61f9b0c9f | power on    | active             | False       |
| 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2    | 3c966a6c-e9a2-45df-b3ef-94e0b946e246 | power on    | active             | False       |
| 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3    | 3aed3d8b-298a-4661-8779-f188045b4d9a | power on    | active             | False       |
| 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4    | 7b11c68e-2331-423c-906d-b58eab937400 | power on    | active             | False       |
| 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1    | 91918cc1-30b7-44e9-9acb-4174bb183f34 | power on    | active             | False       |
| 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2    | 8ded8faa-13ce-4cf5-b291-8569ec012679 | power on    | active             | False       |
| bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0    | 9a9dd4a7-fd8f-4b5e-9c53-8e6b348bcc06 | power on    | active             | False       |





openstack overcloud profiles list
+--------------------------------------+-------------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name   | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+-------------+-----------------+-----------------+-------------------+
| cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | active          | control         |                   |
| 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | active          | control         |                   |
| 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | active          | control         |                   |
| 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0    | active          | compute         |                   |
| d242faa2-4c6e-401c-9584-f2e3e335306b | compute1    | active          | compute         |                   |
| 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2    | active          | compute         |                   |
| 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3    | active          | compute         |                   |
| 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4    | active          | compute         |                   |
| 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1    | active          | ceph-storage    |                   |
| 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2    | active          | ceph-storage    |                   |
| bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0    | active          | ceph-storage    |                   |
+--------------------------------------+-------------+-----------------+-----------------+-------------------+


All nodes seem active but the deployment still results "IN PROGRESS", even after hours

openstack stack list
+--------------------------------------+------------+--------------------+----------------------+--------------+
| ID                                   | Stack Name | Stack Status       | Creation Time        | Updated Time |
+--------------------------------------+------------+--------------------+----------------------+--------------+
| 6ae96997-cd9e-4143-8c6e-9e530c8f99d8 | overcloud  | CREATE_IN_PROGRESS | 2018-03-26T15:50:00Z | None         |


If I try to interrupt it this is the message that appears

 File "/bin/openstack", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 209, in main
    return OpenStackShell().run(argv)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 267, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 180, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run
    return self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1200, in take_action
    self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 395, in _deploy_tripleo_heat_templates_tmpdir
    new_tht_root, tht_root)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 467, in _deploy_tripleo_heat_templates
    parsed_args.skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 479, in _try_overcloud_deploy_with_compat_yaml
    skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 254, in _heat_deploy
    skip_deploy_identifier=skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 78, in deploy_and_wait
    orchestration_client, plan_name, marker, action, verbose_events)
  File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 204, in wait_for_stack_ready
    poll_period=5, marker=marker, out=out, nested_depth=2)
  File "/usr/lib/python2.7/site-packages/heatclient/common/event_utils.py", line 228, in poll_for_events
    time.sleep(poll_period)

Comment 9 d_mor_hua 2018-03-26 16:57:18 UTC
Created attachment 1413244 [details]
templates used

Here are the templates used

Comment 10 Alex Schultz 2018-03-26 17:26:37 UTC
It's not obvious what's happening in the output provided. Please provide 'heat resource list -n 5 overcloud' and also a sosreport from the undercloud.  From there we might be able to determine what is happening. Many times if the deployment just hangs, the network configuration is incorrect and the nodes are no longer able to connect back to the undercloud to report the status and continue the deployment. You may also want to login to the node being deployed and verify they still have connectivity back to the undercloud.

Comment 11 d_mor_hua 2018-03-27 08:56:32 UTC
Created attachment 1413618 [details]
heat resource list

Comment 12 d_mor_hua 2018-03-27 09:02:32 UTC
Cannot upload sosreport, file too large. It is needed something specific?

Comment 13 d_mor_hua 2018-03-27 09:32:51 UTC
All nodes are accessible through ssh heat-admin@IP

Comment 14 d_mor_hua 2018-03-27 09:46:02 UTC
besides the storage nodes

Comment 15 Alex Schultz 2018-03-27 14:59:42 UTC
So from the resource list you can see that it's failed on the NetworkDeployment of the CephStorage configuration. The CephStorage deployment has also failed.  What version are you attempting to deploy?  Is this OSP10 or something newer? We've seen something similar with Bug 1559536 but that's for newer versions.

Can you provide the messages logs from the ceph nodes?

Comment 16 d_mor_hua 2018-03-29 09:13:07 UTC
Hi, so the errors were:

-ceph.storage.yaml [a "-" at the end of network interfaces configuration]
-controller.yaml [error in set of default route]
-networking issue [switch conf]

Thanks for your support

Comment 17 Alex Schultz 2018-04-02 16:16:45 UTC
Closing the bug out again.


Note You need to log in before you can comment on or make changes to this bug.