1313479 – [Heat] NodeUserData cannot scale beyond 3 nodes

Bug 1313479 - [Heat] NodeUserData cannot scale beyond 3 nodes

Summary: [Heat] NodeUserData cannot scale beyond 3 nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-puppet-modules
Sub Component:
Version:	7.0 (Kilo)
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	beta
Target Release:	10.0 (Newton)
Assignee:	Emilien Macchi
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-01 16:36 UTC by Joe Talerico
Modified:	2016-12-14 15:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 15:09:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:2948	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Joe Talerico 2016-03-01 16:36:21 UTC

Description of problem:
Running OSPd and having a pre-deployment script run (wipe_disk) for the ceph nodes. I am unable to scale beyond 3 ceph nodes. If I attempt to run > 3, the deployment fails because the wipe_disk task never runs on some of the nodes. 


Version-Release number of selected component (if applicable):
openstack-heat-engine-2015.1.2-9.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy with a pre-deployment task, and with a node count > 3

Actual results:
Install fails

Expected results:
Install succeeds with a node count >3

Additional info:
Current fix is to run the deployment with a node count == 3, but when you have 32 nodes you are trying to reach, this becomes painful.

Comment 2 Giulio Fidente 2016-03-02 12:11:53 UTC

NodeUserData seems to have been propagated to all nodes looking at the stack resources list, but not all of them will actually receive and execute it.

Comment 6 Mike Burns 2016-04-07 21:11:06 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 7 Steven Hardy 2016-05-04 12:58:12 UTC

We need some more information here I think, if the NodeUserData resources are all created OK in heat, and we start all the servers OK, the next step is to look at the cloud-init logs on each node, and at the data retrieved by cloud-init (NodeUserData gets provided to the node via Nova user-data).

I assume either some nodes don't get the data, or are failing silently to run it for some reason.

Another option would be to use OS::TripleO::CephStorageExtraConfigPre instead of NodeUserData, as this provides an error path in the event the script fails to run.  This could be made to run only on CREATE, and the deployment would stop if it failed to run on any nodes. Let me know if you need an example of this.

Comment 8 Joe Talerico 2016-05-04 14:35:14 UTC

Hey Steve @Giulio should be able to provide that information, he really did the leg work here to determine the scale issue.

Comment 9 Steve Baker 2016-05-04 23:47:19 UTC

This may have happened when we had the rpc response timeout config regression. It trying to reproduce now would be useful

Comment 10 Joe Talerico 2016-05-05 00:03:40 UTC

If i had the OSIC lab again I would but I don't :(

Comment 11 Jaromir Coufal 2016-07-05 18:55:11 UTC

Do we have somewhere to reproduce this setup? Since it seems we might have solution.

Comment 12 Joe Talerico 2016-07-06 11:09:27 UTC

Jaromir - Maybe the end of July/August for in-house testing. Maybe sooner.

Comment 13 Federico Lucifredi 2016-07-13 19:29:39 UTC

Assigning to Julio since he has already done the legwork here.

Comment 15 Jeff Brown 2016-08-04 19:51:42 UTC

The fixed needs to be validated by QE.

Comment 19 Yogev Rabl 2016-10-14 09:19:24 UTC

Verified on:
openstack-tripleo-ui-1.0.3-0.20160930145215.f7297c3.el7ost.noarch
openstack-tripleo-puppet-elements-5.0.0-0.20160929220627.200d011.el7ost.noarch
python-tripleoclient-5.2.0-1.el7ost.noarch
openstack-tripleo-common-5.2.1-0.20160930181658.40ad7e5.el7ost.noarch
puppet-tripleo-5.2.0-1.el7ost.noarch
openstack-tripleo-0.0.1-0.20160916135259.4de13b3.el7ost.noarch
openstack-tripleo-image-elements-5.0.0-0.20161002235922.14e1f41.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch

Was able to deploy 4 Ceph storage nodes on a fresh deployment and was able to scale up from 3 to 4 nodes

Comment 23 Zane Bitter 2016-12-14 15:09:26 UTC


*** This bug has been marked as a duplicate of bug 1305947 ***

Comment 24 errata-xmlrpc 2016-12-14 15:25:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.