Hide Forgot
Created attachment 1204745 [details] templates used for the deployment Description of problem: An overcloud deployment of - 3 controller nodes - 3 compute nodes - 3 Ceph storage nodes (each with 10 OSDs) stuck with an error with a *very* long string. The deployment command is # openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with-vlans.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml --libvirt-type qemu --control-scale 3 --compute-scale 3 --ceph-storage-scale 3 --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com With the templates directory attached. A similar deployment with the same topology was successful in OSPD 9 Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost.noarch openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy with the command above and the templates provided in the attachments Actual results: The deployment is stuck Expected results: Two result are expected: 1. if there's a fault - the deployment should fail with errors 2. if there isn't a fault - the deployment should succeed Additional info:
Created attachment 1204746 [details] heat-engine.log
The error in the heat-engine log wasn't the cause of the deployment freeze. The issue lied someplace else. Moving the priority to low
Yeah, that log message always appears, regardless of success or failure. It looks from the log that stuff was still in progress at the time the log finished; it's not clear if it was going to complete or not. If it was not it's more likely to be due to a failure to signal back from a software deployment for whatever reason. I'm going to change the component to Director.
we need sosreports from the overcloud nodes to see why things may have been stuck. as well as "heat resource list -n 5 overcloud", to see what resources were in progress.
Closing due to lack of information and updates. Feel free to reopen with logs if this issue occurs again.
(In reply to Yogev Rabl from comment #0) > Created attachment 1204745 [details] > templates used for the deployment > > Description of problem: > An overcloud deployment of > - 3 controller nodes > - 3 compute nodes > - 3 Ceph storage nodes (each with 10 OSDs) > stuck with an error with a *very* long string. The deployment command is > # openstack overcloud deploy --templates -e > /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker. > yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/storage-environment. > yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/network-environment. > yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/net-two-nic-with- > vlans.yaml -e > /usr/share/openstack-tripleo-heat-templates/environments/network-isolation. > yaml --libvirt-type qemu --control-scale 3 --compute-scale 3 > --ceph-storage-scale 3 --control-flavor control --compute-flavor compute > --ceph-storage-flavor ceph-storage --ntp-server clock.redhat.com > > With the templates directory attached. > > A similar deployment with the same topology was successful in OSPD 9 > > Version-Release number of selected component (if applicable): > openstack-tripleo-heat-templates-5.0.0-0.20160907212643.90c852e.2.el7ost. > noarch > openstack-heat-templates-0.0.1-0.20160906185549.ac2db55.el7ost.noarch > openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch > python-heat-tests-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch > openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch > > How reproducible: > 100% > > Steps to Reproduce: > 1. Deploy with the command above and the templates provided in the > attachments > > Actual results: > The deployment is stuck > > Expected results: > Two result are expected: > 1. if there's a fault - the deployment should fail with errors > 2. if there isn't a fault - the deployment should succeed > > Additional info: Hi, I am having same issue. Did you come up with a conclusion for it? What kind of logs do you need?
We would need sosreports from the undercloud and the affected overcloud nodes. The original bug report was not specific enough to figure out what was failing. As previously mentioned 'heat resource list -n 5 overcloud' would be a start.
The deployments involves 3 controller 5 compute (3 SR-IOV and 2 DPDK) 3 ceph storage The command used to launch the deployment is: --->openstack overcloud deploy --templates --environment-directory /home/stack/environments/ --ntp-server 163.162.16.29<--------- and it remains stuck here: 2018-03-26 16:14:21Z [overcloud.Compute.2.SshHostPubKey]: CREATE_IN_PROGRESS state changed 2018-03-26 16:14:22Z [overcloud.Compute.2.ComputeExtraConfigPre]: CREATE_COMPLETE state changed 2018-03-26 16:14:22Z [overcloud.Compute.2.NodeTLSCAData]: CREATE_COMPLETE state changed 2018-03-26 16:14:22Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_IN_PROGRESS state changed 2018-03-26 16:14:23Z [overcloud.Compute.2.NodeExtraConfig]: CREATE_COMPLETE state changed 2018-03-26 16:14:35Z [overcloud.Compute.2.SshHostPubKey]: CREATE_COMPLETE state changed 2018-03-26 16:14:35Z [overcloud.Compute.2]: CREATE_COMPLETE Stack CREATE completed successfully 2018-03-26 16:14:36Z [overcloud.Compute.2]: CREATE_COMPLETE state changed 2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE Stack CREATE completed successfully 2018-03-26 16:14:36Z [overcloud.Compute]: CREATE_COMPLETE state changed 2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS state changed 2018-03-26 16:14:43Z [overcloud.ComputeIpListMap]: CREATE_IN_PROGRESS Stack CREATE started 2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_IN_PROGRESS state changed 2018-03-26 16:14:43Z [overcloud.ComputeIpListMap.EnabledServicesValue]: CREATE_COMPLETE state changed 2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE Stack CREATE completed successfully 2018-03-26 16:14:44Z [overcloud.ComputeIpListMap]: CREATE_COMPLETE state changed No error is shown openstack baremetal node list +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+ | cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | 8bf5dc6a-0136-403c-8125-04faa62bf620 | power on | active | False | | 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | 0736209a-b201-4903-a0cd-e8e8e7d8a8b1 | power on | active | False | | 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | c8c9e49c-f3a2-4ccb-93c7-eb2f6e8ab7b4 | power on | active | False | | 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0 | 940b6c3d-08e1-4f1f-aba8-cb73ab199978 | power on | active | False | | d242faa2-4c6e-401c-9584-f2e3e335306b | compute1 | e0d1499d-c2e2-4cab-bdeb-74c61f9b0c9f | power on | active | False | | 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2 | 3c966a6c-e9a2-45df-b3ef-94e0b946e246 | power on | active | False | | 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3 | 3aed3d8b-298a-4661-8779-f188045b4d9a | power on | active | False | | 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4 | 7b11c68e-2331-423c-906d-b58eab937400 | power on | active | False | | 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1 | 91918cc1-30b7-44e9-9acb-4174bb183f34 | power on | active | False | | 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2 | 8ded8faa-13ce-4cf5-b291-8569ec012679 | power on | active | False | | bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0 | 9a9dd4a7-fd8f-4b5e-9c53-8e6b348bcc06 | power on | active | False | openstack overcloud profiles list +--------------------------------------+-------------+-----------------+-----------------+-------------------+ | Node UUID | Node Name | Provision State | Current Profile | Possible Profiles | +--------------------------------------+-------------+-----------------+-----------------+-------------------+ | cf2cbfbf-7e69-491d-aa50-d5331c0707ec | controller0 | active | control | | | 17ebf3d3-f00e-48df-b3ee-1eed588f1586 | controller1 | active | control | | | 1ee11087-4db4-4a52-8712-cf9a9cb8ecc8 | controller2 | active | control | | | 0eae7e4d-1e02-4101-947a-b2abe0163434 | compute0 | active | compute | | | d242faa2-4c6e-401c-9584-f2e3e335306b | compute1 | active | compute | | | 22bd879d-a551-4a0b-984f-0c46d8124200 | compute2 | active | compute | | | 5d898c66-936a-468d-abb3-ac6276f835a1 | compute3 | active | compute | | | 4584c456-fadc-4813-91ce-7afc167c9a9a | compute4 | active | compute | | | 0f438ba7-076b-4f85-b95d-2537a1700e44 | storage1 | active | ceph-storage | | | 505e7ac1-8eec-46e5-9cc9-603fbd833263 | storage2 | active | ceph-storage | | | bd2437e0-9436-48e1-bf96-94b4b76729c1 | storage0 | active | ceph-storage | | +--------------------------------------+-------------+-----------------+-----------------+-------------------+ All nodes seem active but the deployment still results "IN PROGRESS", even after hours openstack stack list +--------------------------------------+------------+--------------------+----------------------+--------------+ | ID | Stack Name | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+--------------------+----------------------+--------------+ | 6ae96997-cd9e-4143-8c6e-9e530c8f99d8 | overcloud | CREATE_IN_PROGRESS | 2018-03-26T15:50:00Z | None | If I try to interrupt it this is the message that appears File "/bin/openstack", line 10, in <module> sys.exit(main()) File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 209, in main return OpenStackShell().run(argv) File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 135, in run ret_val = super(OpenStackShell, self).run(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 267, in run result = self.run_subcommand(remainder) File "/usr/lib/python2.7/site-packages/osc_lib/shell.py", line 180, in run_subcommand ret_value = super(OpenStackShell, self).run_subcommand(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run return super(Command, self).run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 59, in run return self.take_action(parsed_args) or 0 File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 1200, in take_action self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 395, in _deploy_tripleo_heat_templates_tmpdir new_tht_root, tht_root) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 467, in _deploy_tripleo_heat_templates parsed_args.skip_deploy_identifier) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 479, in _try_overcloud_deploy_with_compat_yaml skip_deploy_identifier) File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 254, in _heat_deploy skip_deploy_identifier=skip_deploy_identifier) File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 78, in deploy_and_wait orchestration_client, plan_name, marker, action, verbose_events) File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 204, in wait_for_stack_ready poll_period=5, marker=marker, out=out, nested_depth=2) File "/usr/lib/python2.7/site-packages/heatclient/common/event_utils.py", line 228, in poll_for_events time.sleep(poll_period)
Created attachment 1413244 [details] templates used Here are the templates used
It's not obvious what's happening in the output provided. Please provide 'heat resource list -n 5 overcloud' and also a sosreport from the undercloud. From there we might be able to determine what is happening. Many times if the deployment just hangs, the network configuration is incorrect and the nodes are no longer able to connect back to the undercloud to report the status and continue the deployment. You may also want to login to the node being deployed and verify they still have connectivity back to the undercloud.
Created attachment 1413618 [details] heat resource list
Cannot upload sosreport, file too large. It is needed something specific?
All nodes are accessible through ssh heat-admin@IP
besides the storage nodes
So from the resource list you can see that it's failed on the NetworkDeployment of the CephStorage configuration. The CephStorage deployment has also failed. What version are you attempting to deploy? Is this OSP10 or something newer? We've seen something similar with Bug 1559536 but that's for newer versions. Can you provide the messages logs from the ceph nodes?
Hi, so the errors were: -ceph.storage.yaml [a "-" at the end of network interfaces configuration] -controller.yaml [error in set of default route] -networking issue [switch conf] Thanks for your support
Closing the bug out again.