Bug 1313479

Summary: [Heat] NodeUserData cannot scale beyond 3 nodes
Product: Red Hat OpenStack Reporter: Joe Talerico <jtaleric>
Component: openstack-puppet-modulesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: Arik Chernetsky <achernet>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: dbecker, flucifre, gfidente, jcoufal, jefbrown, jliberma, jschluet, mburns, mcornea, morazi, rhel-osp-director-maint, sbaker, sclewis, shardy, srevivo, yrabl, zbitter
Target Milestone: betaKeywords: TestOnly, Triaged
Target Release: 10.0 (Newton)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 15:09:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Talerico 2016-03-01 16:36:21 UTC
Description of problem:
Running OSPd and having a pre-deployment script run (wipe_disk) for the ceph nodes. I am unable to scale beyond 3 ceph nodes. If I attempt to run > 3, the deployment fails because the wipe_disk task never runs on some of the nodes. 


Version-Release number of selected component (if applicable):
openstack-heat-engine-2015.1.2-9.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy with a pre-deployment task, and with a node count > 3

Actual results:
Install fails

Expected results:
Install succeeds with a node count >3

Additional info:
Current fix is to run the deployment with a node count == 3, but when you have 32 nodes you are trying to reach, this becomes painful.

Comment 2 Giulio Fidente 2016-03-02 12:11:53 UTC
NodeUserData seems to have been propagated to all nodes looking at the stack resources list, but not all of them will actually receive and execute it.

Comment 6 Mike Burns 2016-04-07 21:11:06 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 7 Steven Hardy 2016-05-04 12:58:12 UTC
We need some more information here I think, if the NodeUserData resources are all created OK in heat, and we start all the servers OK, the next step is to look at the cloud-init logs on each node, and at the data retrieved by cloud-init (NodeUserData gets provided to the node via Nova user-data).

I assume either some nodes don't get the data, or are failing silently to run it for some reason.

Another option would be to use OS::TripleO::CephStorageExtraConfigPre instead of NodeUserData, as this provides an error path in the event the script fails to run.  This could be made to run only on CREATE, and the deployment would stop if it failed to run on any nodes. Let me know if you need an example of this.

Comment 8 Joe Talerico 2016-05-04 14:35:14 UTC
Hey Steve @Giulio should be able to provide that information, he really did the leg work here to determine the scale issue.

Comment 9 Steve Baker 2016-05-04 23:47:19 UTC
This may have happened when we had the rpc response timeout config regression. It trying to reproduce now would be useful

Comment 10 Joe Talerico 2016-05-05 00:03:40 UTC
If i had the OSIC lab again I would but I don't :(

Comment 11 Jaromir Coufal 2016-07-05 18:55:11 UTC
Do we have somewhere to reproduce this setup? Since it seems we might have solution.

Comment 12 Joe Talerico 2016-07-06 11:09:27 UTC
Jaromir - Maybe the end of July/August for in-house testing. Maybe sooner.

Comment 13 Federico Lucifredi 2016-07-13 19:29:39 UTC
Assigning to Julio since he has already done the legwork here.

Comment 15 Jeff Brown 2016-08-04 19:51:42 UTC
The fixed needs to be validated by QE.

Comment 19 Yogev Rabl 2016-10-14 09:19:24 UTC
Verified on:
openstack-tripleo-ui-1.0.3-0.20160930145215.f7297c3.el7ost.noarch
openstack-tripleo-puppet-elements-5.0.0-0.20160929220627.200d011.el7ost.noarch
python-tripleoclient-5.2.0-1.el7ost.noarch
openstack-tripleo-common-5.2.1-0.20160930181658.40ad7e5.el7ost.noarch
puppet-tripleo-5.2.0-1.el7ost.noarch
openstack-tripleo-0.0.1-0.20160916135259.4de13b3.el7ost.noarch
openstack-tripleo-image-elements-5.0.0-0.20161002235922.14e1f41.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch

Was able to deploy 4 Ceph storage nodes on a fresh deployment and was able to scale up from 3 to 4 nodes

Comment 23 Zane Bitter 2016-12-14 15:09:26 UTC

*** This bug has been marked as a duplicate of bug 1305947 ***

Comment 24 errata-xmlrpc 2016-12-14 15:25:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html