1305557 – director fails to deploy additional compute node

Bug 1305557 - director fails to deploy additional compute node

Summary: director fails to deploy additional compute node

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	7.0 (Kilo)
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.0 (Kilo)
Assignee:	Steve Baker
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-08 15:02 UTC by Jack Waterworth
Modified:	2019-09-12 09:54 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-10 17:23:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	274921	0	None	ABANDONED	Use TemplateResource._stack_kwargs in _needs_update	2020-11-16 23:00:40 UTC

Description Jack Waterworth 2016-02-08 15:02:08 UTC

Description of problem:
director fails to deploy additional compute node

Version-Release number of selected component (if applicable):
7.2

How reproducible:
Every time

Steps to Reproduce:
1. Attempt to deploy an additional compute node from director

Actual results:
deploy fails.


Expected results:
deploy should succeed.


Additional info:

It appears that the actual software deploy is working correctly. The software is available on the compute node and the network has been configured. However, nothing appears to be configured and nothing is running.

Here is the output from heat:

[stack@blkcclu001 ~]$ heat resource-show overcloud Compute
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Property               | Value                                                                                                                                            |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes             | {                                                                                                                                                |
|                        |   "attributes": null,                                                                                                                            |
|                        |   "refs": null                                                                                                                                   |
|                        | }                                                                                                                                                |
| description            |                                                                                                                                                  |
| links                  | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self)      |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack)                       |
|                        | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) |
| logical_resource_id    | Compute                                                                                                                                          |
| physical_resource_id   | b4ac65d4-8b33-4337-859a-1a453b8f3034                                                                                                             |
| required_by            | AllNodesExtraConfig                                                                                                                              |
|                        | ComputeCephDeployment                                                                                                                            |
|                        | allNodesConfig                                                                                                                                   |
|                        | ComputeAllNodesDeployment                                                                                                                        |
|                        | ComputeNodesPostDeployment                                                                                                                       |
|                        | ComputeAllNodesValidationDeployment                                                                                                              |
| resource_name          | Compute                                                                                                                                          |
| resource_status        | UPDATE_FAILED                                                                                                                                    |
| resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda                  |
| resource_type          | OS::Heat::ResourceGroup                                                                                                                          |
| updated_time           | 2016-02-05T15:19:19Z                                                                                                                             |
+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+

I was unable to find any further information regarding the timeout.  We attempted to re-run the deploy, but it fails out quickly with the same error

Comment 2 Zane Bitter 2016-02-08 15:14:46 UTC


*** This bug has been marked as a duplicate of bug 1290949 ***

Comment 3 Zane Bitter 2016-02-08 15:34:45 UTC

My bad, this is on an undercloud machine with 8 cores so it shouldn't have been closed as a duplicate; there's something else to investigate here.

Comment 5 Steve Baker 2016-02-08 21:01:47 UTC

Could you please try increasing the RPC response timeout, so as root:

openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600
systemctl restart openstack-heat-engine

My current theory is that there is a flood of RPC calls during stack updates, and as the stack is scaled up the volume of these concurrent calls increases leading to these timeouts.

I've attached an upstream change which should reduce the number of these concurrent RPC calls during stack updates enough to avoid this problem, but hopefully raising the rpc_response_timeout will be enough for now.

Comment 6 Steve Baker 2016-02-08 21:54:16 UTC

The rpc_response_timeout has been default 600 since 7.1, can you confirm that this undercloud has been upgraded from 7.0? If so then this would explain why it wasn't set to 600 in the first place.

Therefore, if the above comment fixes the problem then this could be marked as resolved.

Comment 8 Zane Bitter 2016-02-09 00:32:57 UTC

Can you also check "journalctl -u openstack-heat-engine" to see if there are any suspicious exceptions in the journal?

Comment 24 Jack Waterworth 2016-02-09 16:06:50 UTC

I think there was some miscommunication with the customer overnight.  The deployment was proceeding by re-trying the the deploy until it worked and did not give the Message timeout. This is the original issue. They have not moved forward from that with any other node deploys yet.

They have set the rpc timeout to 600, but have NOT retried the deploy yet.

The customer is concerned about the naming of his nodes. They want their compute# to match their host names. This it the reason they are deploying 1 compute at a time.

Currently, there are 2 FAILED nodes in nova:

nova
-------------------
|3e2bea55-a20b-43b9-96c0-4a1045bf6fe9  | blkcclc0011 | ERROR  | -          | NOSTATE     |                        |
| 57e85dd6-790c-4b8d-a45b-bab035d8ac6a | blkcclc0011 | ERROR  | -          | NOSTATE     |      
                  |
-------------------

However, these nodes are NOT in ironic. Attemping to use the `openstack overcloud node delete` command returns a traceback that the node does not exist.

-------------------
[stack@blkcclu001 ~]$ ironic node-list | grep  3e2bea55-a20b-43b9-96c0-4a1045bf6fe9  
[stack@blkcclu001 ~]$ ironic node-list | grep  57e85dd6-790c-4b8d-a45b-bab035d8ac6a 
-------------------

here is the traceback:

-------------------
ERROR: openstack Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_node.py", line 74, in take_action
    scale_manager.scaledown(parsed_args.nodes)
  File "/usr/lib/python2.7/site-packages/tripleo_common/scale.py", line 107, in scaledown
    (self.stack_id, ','.join(instance_list)))
ValueError: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a
DEBUG: openstackclient.shell clean_up DeleteNode
DEBUG: openstackclient.shell got an error: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a
ERROR: openstackclient.shell Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 176, in run
    return super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 230, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_node.py", line 74, in take_action
    scale_manager.scaledown(parsed_args.nodes)
  File "/usr/lib/python2.7/site-packages/tripleo_common/scale.py", line 107, in scaledown
    (self.stack_id, ','.join(instance_list)))
ValueError: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a
-------------------

The customer has 2 questions at this point:

    * What is the correct way to clean up these ERROR instances from nova?

    * Is there a way to reset the index count for the nodes so that they can continue to deploy with node names that match their hostnames.

This will be the determining factor on whether they are going to redeploy the entire stack or not.

Comment 25 Zane Bitter 2016-02-09 18:11:20 UTC

From a Heat perspective, the main thing is to remove any stacks that may be referring to these two nodes *before* deleting them from Nova. (In fact, once you've done that they should be gone from Nova.) If you manually remove them behind Heat's back then things can get messier.

It's not clear from the info above what the state in Heat is, and therefore hard to give more specific advice. If the first 9 nodes are OK and the current scale is >9 then the easiest way to resolve the problem is to scale down to 9. (If the errored nodes still exist in Nova after this, then delete them manually.)

Comment 26 Jack Waterworth 2016-02-09 19:19:35 UTC

The customer has decided to delete the stack and clean up the database and start fresh.  They will put the rpc timeout into place and follow the same procedure.

Comment 29 Zane Bitter 2016-02-10 17:12:33 UTC

Note that we raised (and fixed) a separate bz for the rpc_response_timeout issue, bug 1305947.

Comment 30 Angus Thomas 2016-02-10 17:23:25 UTC

This is now resolved, through the fix linked in Comment 29

Note You need to log in before you can comment on or make changes to this bug.