1418566 – Not able to scale up after manually removing nodes from nova

Bug 1418566 - Not able to scale up after manually removing nodes from nova

Summary: Not able to scale up after manually removing nodes from nova

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	async
Target Release:	---
Assignee:	Zane Bitter
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-02 06:05 UTC by Chaitanya Shastri
Modified:	2020-04-15 15:12 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-28 17:27:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1417914	0	high	CLOSED	ipmitool not able to bring up the node	2021-02-22 00:41:40 UTC

Description Chaitanya Shastri 2017-02-02 06:05:20 UTC

Description of problem:
Referring to bug https://bugzilla.redhat.com/show_bug.cgi?id=1417914, we are not able to get the additional compute nodes in power on state during the deployment. 

Version-Release number of selected component (if applicable):
RHOS 8

How reproducible:
Always

Actual results:
Not able to get the additional compute nodes up during the deployment

Expected results:
Nodes should power on during deployment (stack update)

Additional info:
There was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen.

Comment 2 Dmitry Tantsur 2017-02-02 10:23:40 UTC

Could you please clarify the exact sequence of events?

1. Do you see 'nova hypervisor-stats' reporting enough free resources before scale-out?

2. Do you see new nova instances created during the scale out?

3. What's the status of the nova instances, if they do get created? Do you see ironic nodes getting assigned to them?

The falling walk-through may help you figure out the part of the deployment that failed: http://tripleo.org/troubleshooting/troubleshooting.html#identifying-failed-component

Comment 3 Chaitanya Shastri 2017-02-02 11:27:38 UTC

(In reply to Dmitry Tantsur from comment #2)
> Could you please clarify the exact sequence of events?
> 
> 1. Do you see 'nova hypervisor-stats' reporting enough free resources before
> scale-out?

I will provide you the output soon.

> 2. Do you see new nova instances created during the scale out?

No. We do not see nova instances created during scale out.

> 3. What's the status of the nova instances, if they do get created? Do you
> see ironic nodes getting assigned to them?

Instance do not get created. We see the ironic node in power-off state and the deployment stucks at 'ComputeAllNodesDeployment' resource.

Please note that there was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen.

Comment 4 Chaitanya Shastri 2017-02-02 11:34:11 UTC

Output of 'nova hypervisor-stats':

$nova hypervisor-stats
+----------------------+--------+
| Property             | Value  |
+----------------------+--------+
| count                | 9      |
| current_workload     | 0      |
| disk_available_least | 277    |
| free_disk_gb         | 1462   |
| free_ram_mb          | 786432 |
| local_gb             | 1662   |
| local_gb_used        | 200    |
| memory_mb            | 806912 |
| memory_mb_used       | 20480  |
| running_vms          | 5      |
| vcpus                | 53     |
| vcpus_used           | 5      |
+----------------------+--------+

Comment 7 Dmitry Tantsur 2017-02-02 16:15:19 UTC

Could you please check if the nova-compute logs on the undercloud contain any errors? If not, I'm probably not the right person to continue the investigation, as it's indeed somewhere in Heat.

Comment 8 PURANDHAR SAIRAM MANNIDI 2017-02-02 23:48:07 UTC

There are no errors in nova-compute. As i observed nova, there are no requests to create instance. The possibility is heat not deploying the computes.

Comment 10 Zane Bitter 2017-02-03 05:01:48 UTC

Reading between the lines, it sounds like you scaled out, then deleted the servers from Nova behind Heat's back, then tried to update Heat again with the same scale. Likely it's stuck trying to update software deployments on machines that will never reply because you've already deleted them from Nova. You need to delete the phantom machines from Heat by scaling down to the previous size first before trying to scale up again.

Note that you will likely need to use the workaround documented in bug 1313885 to scale down (since there is a deployment that runs during the delete phase too).

Comment 12 Dmitry Tantsur 2017-02-03 09:31:07 UTC

Zane, thanks for investigation! I'm moving this back to you, as it proved not (directly) related to the hardware provisioning phase. Please triage on. Unfortunately, I cannot help further.

(also removing needinfo from Angus, I don't see how he can help here)

Note You need to log in before you can comment on or make changes to this bug.