This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1313479 - [Heat] NodeUserData cannot scale beyond 3 nodes
[Heat] NodeUserData cannot scale beyond 3 nodes
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-puppet-modules (Show other bugs)
7.0 (Kilo)
All All
unspecified Severity urgent
: beta
: 10.0 (Newton)
Assigned To: Emilien Macchi
Arik Chernetsky
: TestOnly, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-01 11:36 EST by Joe Talerico
Modified: 2016-12-14 10:25 EST (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-12-14 10:09:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Joe Talerico 2016-03-01 11:36:21 EST
Description of problem:
Running OSPd and having a pre-deployment script run (wipe_disk) for the ceph nodes. I am unable to scale beyond 3 ceph nodes. If I attempt to run > 3, the deployment fails because the wipe_disk task never runs on some of the nodes. 


Version-Release number of selected component (if applicable):
openstack-heat-engine-2015.1.2-9.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy with a pre-deployment task, and with a node count > 3

Actual results:
Install fails

Expected results:
Install succeeds with a node count >3

Additional info:
Current fix is to run the deployment with a node count == 3, but when you have 32 nodes you are trying to reach, this becomes painful.
Comment 2 Giulio Fidente 2016-03-02 07:11:53 EST
NodeUserData seems to have been propagated to all nodes looking at the stack resources list, but not all of them will actually receive and execute it.
Comment 6 Mike Burns 2016-04-07 17:11:06 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 7 Steven Hardy 2016-05-04 08:58:12 EDT
We need some more information here I think, if the NodeUserData resources are all created OK in heat, and we start all the servers OK, the next step is to look at the cloud-init logs on each node, and at the data retrieved by cloud-init (NodeUserData gets provided to the node via Nova user-data).

I assume either some nodes don't get the data, or are failing silently to run it for some reason.

Another option would be to use OS::TripleO::CephStorageExtraConfigPre instead of NodeUserData, as this provides an error path in the event the script fails to run.  This could be made to run only on CREATE, and the deployment would stop if it failed to run on any nodes. Let me know if you need an example of this.
Comment 8 Joe Talerico 2016-05-04 10:35:14 EDT
Hey Steve @Giulio should be able to provide that information, he really did the leg work here to determine the scale issue.
Comment 9 Steve Baker 2016-05-04 19:47:19 EDT
This may have happened when we had the rpc response timeout config regression. It trying to reproduce now would be useful
Comment 10 Joe Talerico 2016-05-04 20:03:40 EDT
If i had the OSIC lab again I would but I don't :(
Comment 11 Jaromir Coufal 2016-07-05 14:55:11 EDT
Do we have somewhere to reproduce this setup? Since it seems we might have solution.
Comment 12 Joe Talerico 2016-07-06 07:09:27 EDT
Jaromir - Maybe the end of July/August for in-house testing. Maybe sooner.
Comment 13 Federico Lucifredi 2016-07-13 15:29:39 EDT
Assigning to Julio since he has already done the legwork here.
Comment 15 Jeff Brown 2016-08-04 15:51:42 EDT
The fixed needs to be validated by QE.
Comment 19 Yogev Rabl 2016-10-14 05:19:24 EDT
Verified on:
openstack-tripleo-ui-1.0.3-0.20160930145215.f7297c3.el7ost.noarch
openstack-tripleo-puppet-elements-5.0.0-0.20160929220627.200d011.el7ost.noarch
python-tripleoclient-5.2.0-1.el7ost.noarch
openstack-tripleo-common-5.2.1-0.20160930181658.40ad7e5.el7ost.noarch
puppet-tripleo-5.2.0-1.el7ost.noarch
openstack-tripleo-0.0.1-0.20160916135259.4de13b3.el7ost.noarch
openstack-tripleo-image-elements-5.0.0-0.20161002235922.14e1f41.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch

Was able to deploy 4 Ceph storage nodes on a fresh deployment and was able to scale up from 3 to 4 nodes
Comment 23 Zane Bitter 2016-12-14 10:09:26 EST

*** This bug has been marked as a duplicate of bug 1305947 ***
Comment 24 errata-xmlrpc 2016-12-14 10:25:04 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.