Description of problem: Referring to bug https://bugzilla.redhat.com/show_bug.cgi?id=1417914, we are not able to get the additional compute nodes in power on state during the deployment. Version-Release number of selected component (if applicable): RHOS 8 How reproducible: Always Actual results: Not able to get the additional compute nodes up during the deployment Expected results: Nodes should power on during deployment (stack update) Additional info: There was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen.
Could you please clarify the exact sequence of events? 1. Do you see 'nova hypervisor-stats' reporting enough free resources before scale-out? 2. Do you see new nova instances created during the scale out? 3. What's the status of the nova instances, if they do get created? Do you see ironic nodes getting assigned to them? The falling walk-through may help you figure out the part of the deployment that failed: http://tripleo.org/troubleshooting/troubleshooting.html#identifying-failed-component
(In reply to Dmitry Tantsur from comment #2) > Could you please clarify the exact sequence of events? > > 1. Do you see 'nova hypervisor-stats' reporting enough free resources before > scale-out? I will provide you the output soon. > 2. Do you see new nova instances created during the scale out? No. We do not see nova instances created during scale out. > 3. What's the status of the nova instances, if they do get created? Do you > see ironic nodes getting assigned to them? Instance do not get created. We see the ironic node in power-off state and the deployment stucks at 'ComputeAllNodesDeployment' resource. Please note that there was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen.
Output of 'nova hypervisor-stats': $nova hypervisor-stats +----------------------+--------+ | Property | Value | +----------------------+--------+ | count | 9 | | current_workload | 0 | | disk_available_least | 277 | | free_disk_gb | 1462 | | free_ram_mb | 786432 | | local_gb | 1662 | | local_gb_used | 200 | | memory_mb | 806912 | | memory_mb_used | 20480 | | running_vms | 5 | | vcpus | 53 | | vcpus_used | 5 | +----------------------+--------+
Could you please check if the nova-compute logs on the undercloud contain any errors? If not, I'm probably not the right person to continue the investigation, as it's indeed somewhere in Heat.
There are no errors in nova-compute. As i observed nova, there are no requests to create instance. The possibility is heat not deploying the computes.
Reading between the lines, it sounds like you scaled out, then deleted the servers from Nova behind Heat's back, then tried to update Heat again with the same scale. Likely it's stuck trying to update software deployments on machines that will never reply because you've already deleted them from Nova. You need to delete the phantom machines from Heat by scaling down to the previous size first before trying to scale up again. Note that you will likely need to use the workaround documented in bug 1313885 to scale down (since there is a deployment that runs during the delete phase too).
Zane, thanks for investigation! I'm moving this back to you, as it proved not (directly) related to the hardware provisioning phase. Please triage on. Unfortunately, I cannot help further. (also removing needinfo from Angus, I don't see how he can help here)