Description of problem: director fails to deploy additional compute node Version-Release number of selected component (if applicable): 7.2 How reproducible: Every time Steps to Reproduce: 1. Attempt to deploy an additional compute node from director Actual results: deploy fails. Expected results: deploy should succeed. Additional info: It appears that the actual software deploy is working correctly. The software is available on the compute node and the network has been configured. However, nothing appears to be configured and nothing is running. Here is the output from heat: [stack@blkcclu001 ~]$ heat resource-show overcloud Compute +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ | attributes | { | | | "attributes": null, | | | "refs": null | | | } | | description | | | links | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b/resources/Compute (self) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud/d7e1d0ee-4af6-4a9e-9c5e-d37f571f202b (stack) | | | http://45.32.159.21:8004/v1/0642009b359d416bbe1dfb7ab813db6e/stacks/overcloud-Compute-vyutisc7pljo/b4ac65d4-8b33-4337-859a-1a453b8f3034 (nested) | | logical_resource_id | Compute | | physical_resource_id | b4ac65d4-8b33-4337-859a-1a453b8f3034 | | required_by | AllNodesExtraConfig | | | ComputeCephDeployment | | | allNodesConfig | | | ComputeAllNodesDeployment | | | ComputeNodesPostDeployment | | | ComputeAllNodesValidationDeployment | | resource_name | Compute | | resource_status | UPDATE_FAILED | | resource_status_reason | resources.Compute: MessagingTimeout: resources[8]: Timed out waiting for a reply to message ID eabc9302615648ab8b29adc361b4bfda | | resource_type | OS::Heat::ResourceGroup | | updated_time | 2016-02-05T15:19:19Z | +------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+ I was unable to find any further information regarding the timeout. We attempted to re-run the deploy, but it fails out quickly with the same error
*** This bug has been marked as a duplicate of bug 1290949 ***
My bad, this is on an undercloud machine with 8 cores so it shouldn't have been closed as a duplicate; there's something else to investigate here.
Could you please try increasing the RPC response timeout, so as root: openstack-config --set /etc/heat/heat.conf DEFAULT rpc_response_timeout 600 systemctl restart openstack-heat-engine My current theory is that there is a flood of RPC calls during stack updates, and as the stack is scaled up the volume of these concurrent calls increases leading to these timeouts. I've attached an upstream change which should reduce the number of these concurrent RPC calls during stack updates enough to avoid this problem, but hopefully raising the rpc_response_timeout will be enough for now.
The rpc_response_timeout has been default 600 since 7.1, can you confirm that this undercloud has been upgraded from 7.0? If so then this would explain why it wasn't set to 600 in the first place. Therefore, if the above comment fixes the problem then this could be marked as resolved.
Can you also check "journalctl -u openstack-heat-engine" to see if there are any suspicious exceptions in the journal?
I think there was some miscommunication with the customer overnight. The deployment was proceeding by re-trying the the deploy until it worked and did not give the Message timeout. This is the original issue. They have not moved forward from that with any other node deploys yet. They have set the rpc timeout to 600, but have NOT retried the deploy yet. The customer is concerned about the naming of his nodes. They want their compute# to match their host names. This it the reason they are deploying 1 compute at a time. Currently, there are 2 FAILED nodes in nova: nova ------------------- |3e2bea55-a20b-43b9-96c0-4a1045bf6fe9 | blkcclc0011 | ERROR | - | NOSTATE | | | 57e85dd6-790c-4b8d-a45b-bab035d8ac6a | blkcclc0011 | ERROR | - | NOSTATE | | ------------------- However, these nodes are NOT in ironic. Attemping to use the `openstack overcloud node delete` command returns a traceback that the node does not exist. ------------------- [stack@blkcclu001 ~]$ ironic node-list | grep 3e2bea55-a20b-43b9-96c0-4a1045bf6fe9 [stack@blkcclu001 ~]$ ironic node-list | grep 57e85dd6-790c-4b8d-a45b-bab035d8ac6a ------------------- here is the traceback: ------------------- ERROR: openstack Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run self.take_action(parsed_args) File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_node.py", line 74, in take_action scale_manager.scaledown(parsed_args.nodes) File "/usr/lib/python2.7/site-packages/tripleo_common/scale.py", line 107, in scaledown (self.stack_id, ','.join(instance_list))) ValueError: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a DEBUG: openstackclient.shell clean_up DeleteNode DEBUG: openstackclient.shell got an error: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a ERROR: openstackclient.shell Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 176, in run return super(OpenStackShell, self).run(argv) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 230, in run result = self.run_subcommand(remainder) File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand result = cmd.run(parsed_args) File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run self.take_action(parsed_args) File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_node.py", line 74, in take_action scale_manager.scaledown(parsed_args.nodes) File "/usr/lib/python2.7/site-packages/tripleo_common/scale.py", line 107, in scaledown (self.stack_id, ','.join(instance_list))) ValueError: Couldn't find following instances in stack overcloud: 57e85dd6-790c-4b8d-a45b-bab035d8ac6a ------------------- The customer has 2 questions at this point: * What is the correct way to clean up these ERROR instances from nova? * Is there a way to reset the index count for the nodes so that they can continue to deploy with node names that match their hostnames. This will be the determining factor on whether they are going to redeploy the entire stack or not.
From a Heat perspective, the main thing is to remove any stacks that may be referring to these two nodes *before* deleting them from Nova. (In fact, once you've done that they should be gone from Nova.) If you manually remove them behind Heat's back then things can get messier. It's not clear from the info above what the state in Heat is, and therefore hard to give more specific advice. If the first 9 nodes are OK and the current scale is >9 then the easiest way to resolve the problem is to scale down to 9. (If the errored nodes still exist in Nova after this, then delete them manually.)
The customer has decided to delete the stack and clean up the database and start fresh. They will put the rpc timeout into place and follow the same procedure.
Note that we raised (and fixed) a separate bz for the rpc_response_timeout issue, bug 1305947.
This is now resolved, through the fix linked in Comment 29