Bug 1379274

Summary: overcloud deployment stuck
Product: Red Hat OpenStack Reporter: Yogev Rabl <yrabl>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Omri Hochman <ohochman>
Severity: medium Docs Contact:
Priority: low    
Version: 10.0 (Newton)CC: aschultz, daniele.morvillo, dbecker, jslagle, lpic.lt, mburns, mcornea, morazi, rhel-osp-director-maint, sbaker, shardy, srevivo, yrabl
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-02 16:16:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
templates used for the deployment
none
heat-engine.log
none
templates used
none
heat resource list none

Description Yogev Rabl 2016-09-26 09:12:11 UTC
Created attachment 1204745 [details]
templates used for the deployment

Description of problem:
An overcloud deployment of 
- 3 controller nodes
- 3 compute nodes
- 3 Ceph storage nodes (each with 10 OSDs) 
stuck with an error with a *very* long string. The deployment command is 
# openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with-vlans.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml --libvirt-type qemu --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com

With the templates directory attached. 

A similar deployment with the same topology was successful in OSPD 9

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost.noarch
openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy with the command above and the templates provided in the attachments

Actual results:
The deployment is stuck 

Expected results:
Two result are expected:
1. if there's a fault - the deployment should fail with errors
2. if there isn't a fault - the deployment should succeed 

Additional info:

Comment 1 Yogev Rabl 2016-09-26 09:12:50 UTC
Created attachment 1204746 [details]
heat-engine.log

Comment 2 Yogev Rabl 2016-09-26 11:08:26 UTC
The error in the heat-engine log wasn't the cause of the deployment freeze. The issue lied someplace else.
Moving the priority to low

Comment 3 Zane Bitter 2016-09-26 16:06:37 UTC
Yeah, that log message always appears, regardless of success or failure. It looks from the log that stuff was still in progress at the time the log finished; it's not clear if it was going to complete or not. If it was not it's more likely to be due to a failure to signal back from a software deployment for whatever reason.

I'm going to change the component to Director.

Comment 4 James Slagle 2017-03-01 16:10:14 UTC
we need sosreports from the overcloud nodes to see why things may have been stuck.
as well as "heat resource list -n 5 overcloud", to see what resources were in progress.

Comment 5 Alex Schultz 2017-03-24 20:07:17 UTC
Closing due to lack of information and updates. Feel free to reopen with logs if this issue occurs again.

Comment 6 d_mor_hua 2018-03-26 16:03:04 UTC
(In reply to Yogev Rabl from comment #0)
> Created attachment 1204745 [details]
> templates used for the deployment
> 
> Description of problem:
> An overcloud deployment of 
> - 3 controller nodes
> - 3 compute nodes
> - 3 Ceph storage nodes (each with 10 OSDs) 
> stuck with an error with a *very* long string. The deployment command is 
> # openstack overcloud deploy --templates -e
> /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-environment.
> yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with-
> vlans.yaml -e
> /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.
> yaml --libvirt-type qemu --control-scale 3 --compute-scale 3
> --ceph-storage-scale 3 --control-flavor control --compute-flavor compute
> --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com
> 
> With the templates directory attached. 
> 
> A similar deployment with the same topology was successful in OSPD 9
> 
> Version-Release number of selected component (if applicable):
> openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost.
> noarch
> openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch
> openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
> python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
> openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Deploy with the command above and the templates provided in the
> attachments
> 
> Actual results:
> The deployment is stuck 
> 
> Expected results:
> Two result are expected:
> 1. if there's a fault - the deployment should fail with errors
> 2. if there isn't a fault - the deployment should succeed 
> 
> Additional info:

Hi, I am having same issue. Did you come up with a conclusion for it? What kind of logs do you need?

Comment 7 Alex Schultz 2018-03-26 16:41:20 UTC
We would need sosreports from the undercloud and the affected overcloud nodes. The original bug report was not specific enough to figure out what was failing.  As previously mentioned 'heat resource list -n 5 overcloud' would be a start.

Comment 8 d_mor_hua 2018-03-26 16:51:54 UTC
The deployments involves 

3 controller
5 compute (3 SR-IOV and 2 DPDK)
3 ceph storage

The command used to launch the deployment is:

--->openstack overcloud deploy --templates --environment-directory /home/stack/environments/ --ntp-server 163.162.16.29<---------

and it remains stuck here:

2018-03-26 16:14:21Z [overcloud.Compute.2.SshHostPubKey]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.ComputeExtraConfigPre]: CREATE_COMPLETE  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.NodeTLSCAData]: CREATE_COMPLETE  state changed
2018-03-26 16:14:22Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:23Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_COMPLETE  state changed
2018-03-26 16:14:35Z [overcloud.Compute.2.SshHostPubKey]: CREATE_COMPLETE  state changed
2018-03-26 16:14:35Z [overcloud.Compute.2]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:36Z [overcloud.Compute.2]: CREATE_COMPLETE  state changed
2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS  Stack CREATE started
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_IN_PROGRESS  state changed
2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_COMPLETE  state changed
2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE  Stack CREATE completed successfully
2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE  state changed


No error is shown


openstack baremetal node list
+--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name        | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
| cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | 8bf5dc6a-0136-403c-8125-04faa62bf620 | power on    | active             | False       |
| 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | 0736209a-b201-4903-a0cd-e8e8e7d8a8b1 | power on    | active             | False       |
| 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | c8c9e49c-f3a2-4ccb-93c7-eb2f6e8ab7b4 | power on    | active             | False       |
| 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0    | 940b6c3d-08e1-4f1f-aba8-cb73ab199978 | power on    | active             | False       |
| d242faa2-4c6e-401c-9584-f2e3e335306b | compute1    | e0d1499d-c2e2-4cab-bdeb-74c61f9b0c9f | power on    | active             | False       |
| 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2    | 3c966a6c-e9a2-45df-b3ef-94e0b946e246 | power on    | active             | False       |
| 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3    | 3aed3d8b-298a-4661-8779-f188045b4d9a | power on    | active             | False       |
| 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4    | 7b11c68e-2331-423c-906d-b58eab937400 | power on    | active             | False       |
| 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1    | 91918cc1-30b7-44e9-9acb-4174bb183f34 | power on    | active             | False       |
| 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2    | 8ded8faa-13ce-4cf5-b291-8569ec012679 | power on    | active             | False       |
| bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0    | 9a9dd4a7-fd8f-4b5e-9c53-8e6b348bcc06 | power on    | active             | False       |





openstack overcloud profiles list
+--------------------------------------+-------------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name   | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+-------------+-----------------+-----------------+-------------------+
| cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | active          | control         |                   |
| 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | active          | control         |                   |
| 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | active          | control         |                   |
| 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0    | active          | compute         |                   |
| d242faa2-4c6e-401c-9584-f2e3e335306b | compute1    | active          | compute         |                   |
| 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2    | active          | compute         |                   |
| 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3    | active          | compute         |                   |
| 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4    | active          | compute         |                   |
| 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1    | active          | ceph-storage    |                   |
| 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2    | active          | ceph-storage    |                   |
| bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0    | active          | ceph-storage    |                   |
+--------------------------------------+-------------+-----------------+-----------------+-------------------+


All nodes seem active but the deployment still results "IN PROGRESS", even after hours

openstack stack list
+--------------------------------------+------------+--------------------+----------------------+--------------+
| ID                                   | Stack Name | Stack Status       | Creation Time        | Updated Time |
+--------------------------------------+------------+--------------------+----------------------+--------------+
| 6ae96997-cd9e-4143-8c6e-9e530c8f99d8 | overcloud  | CREATE_IN_PROGRESS | 2018-03-26T15:50:00Z | None         |


If I try to interrupt it this is the message that appears

 File "/bin/openstack", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 209, in main
    return OpenStackShell().run(argv)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run
    ret_val = super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 267, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 180, in run_subcommand
    ret_value = super(OpenStackShell, self).run_subcommand(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run
    return self.take_action(parsed_args) or 0
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1200, in take_action
    self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 395, in _deploy_tripleo_heat_templates_tmpdir
    new_tht_root, tht_root)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 467, in _deploy_tripleo_heat_templates
    parsed_args.skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 479, in _try_overcloud_deploy_with_compat_yaml
    skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 254, in _heat_deploy
    skip_deploy_identifier=skip_deploy_identifier)
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 78, in deploy_and_wait
    orchestration_client, plan_name, marker, action, verbose_events)
  File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 204, in wait_for_stack_ready
    poll_period=5, marker=marker, out=out, nested_depth=2)
  File "/usr/lib/python2.7/site-packages/heatclient/common/event_utils.py", line 228, in poll_for_events
    time.sleep(poll_period)

Comment 9 d_mor_hua 2018-03-26 16:57:18 UTC
Created attachment 1413244 [details]
templates used

Here are the templates used

Comment 10 Alex Schultz 2018-03-26 17:26:37 UTC
It's not obvious what's happening in the output provided. Please provide 'heat resource list -n 5 overcloud' and also a sosreport from the undercloud.  From there we might be able to determine what is happening. Many times if the deployment just hangs, the network configuration is incorrect and the nodes are no longer able to connect back to the undercloud to report the status and continue the deployment. You may also want to login to the node being deployed and verify they still have connectivity back to the undercloud.

Comment 11 d_mor_hua 2018-03-27 08:56:32 UTC
Created attachment 1413618 [details]
heat resource list

Comment 12 d_mor_hua 2018-03-27 09:02:32 UTC
Cannot upload sosreport, file too large. It is needed something specific?

Comment 13 d_mor_hua 2018-03-27 09:32:51 UTC
All nodes are accessible through ssh heat-admin@IP

Comment 14 d_mor_hua 2018-03-27 09:46:02 UTC
besides the storage nodes

Comment 15 Alex Schultz 2018-03-27 14:59:42 UTC
So from the resource list you can see that it's failed on the NetworkDeployment of the CephStorage configuration. The CephStorage deployment has also failed.  What version are you attempting to deploy?  Is this OSP10 or something newer? We've seen something similar with Bug 1559536 but that's for newer versions.

Can you provide the messages logs from the ceph nodes?

Comment 16 d_mor_hua 2018-03-29 09:13:07 UTC
Hi, so the errors were:

-ceph.storage.yaml [a "-" at the end of network interfaces configuration]
-controller.yaml [error in set of default route]
-networking issue [switch conf]

Thanks for your support

Comment 17 Alex Schultz 2018-04-02 16:16:45 UTC
Closing the bug out again.