Bug 1276433 - Unable to deploy more than 60 compute nodes [NEEDINFO]
Unable to deploy more than 60 compute nodes
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
x86_64 Linux
unspecified Severity medium
: ---
: 10.0 (Newton)
Assigned To: Hugh Brock
Shai Revivo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-29 12:40 EDT by bigswitch
Modified: 2016-10-10 23:25 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-10 23:25:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
morazi: needinfo? (rhosp-bugs-internal)


Attachments (Terms of Use)

  None (edit)
Description bigswitch 2015-10-29 12:40:18 EDT
Description of problem:
Unable to deploy more than 60 compute nodes with the following message:
ERROR: openstack Heat Stack update failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 1223, in take_action
    self._deploy_tripleo_heat_templates(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 527, in _deploy_tripleo_heat_templates
    environments, parsed_args.timeout)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 451, in _heat_deploy
    raise Exception("Heat Stack update failed.")
Exception: Heat Stack update failed.
DEBUG: openstackclient.shell clean_up DeployOvercloud
DEBUG: openstackclient.shell got an error: Heat Stack update failed.
ERROR: openstackclient.shell Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 176, in run
    return super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 230, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 1223, in take_action
    self._deploy_tripleo_heat_templates(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 527, in _deploy_tripleo_heat_templates
    environments, parsed_args.timeout)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 451, in _heat_deploy
    raise Exception("Heat Stack update failed.")
Exception: Heat Stack update failed.

Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
Stack failed with status: MessagingTimeout: resources.Compute: Timed out waiting for a reply to message ID d91a569be47c409f8764f31386eb85d2

We seen this twice, once when deploying above 80 nodes, and a second time when deploying more than 60 nodes. The first attempt was using a Dell R220 as undercloud, and the second attempt we used a Dell R620 with 32gig memory.
Each deployment was done 10 nodes at a time.
There are three controller, three ceph nodes, and 60 compute nodes in the second attempt. Saw the timeout message when attempting to deploy 70 compute nodes. No nodes was deployed, I rerun the deployment again, and nova list show 70 compute nodes but nova service-list still show 60. 
We also notice the deployment goes slower and slower as the number of nodes is increase. From 40 to 50 nodes, it took almost 3 hours to deploy (earlier with less number of nodes it sometime took less than 30 minutes to deploy).


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Attempt to deploy from 60 nodes to 70 nodes
2.
3.

Actual results:


Expected results:


Additional info:
Comment 2 Mike Orazi 2016-01-07 11:52:40 EST
Is this still an issue?

I believe at the very least we have been able to work around any issues that have been encountered but wanted to verify and see whether we should close this bug out and open more specific and targeted bugs if some issues still exist.
Comment 4 Mike Burns 2016-04-07 16:54:03 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 6 Jaromir Coufal 2016-10-10 23:25:58 EDT
No more info provided, this should be already fixed. Please re-open if re-appears.

Note You need to log in before you can comment on or make changes to this bug.