Bug 1313479
Summary: | [Heat] NodeUserData cannot scale beyond 3 nodes | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Joe Talerico <jtaleric> |
Component: | openstack-puppet-modules | Assignee: | Emilien Macchi <emacchi> |
Status: | CLOSED ERRATA | QA Contact: | Arik Chernetsky <achernet> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 (Kilo) | CC: | dbecker, flucifre, gfidente, jcoufal, jefbrown, jliberma, jschluet, mburns, mcornea, morazi, rhel-osp-director-maint, sbaker, sclewis, shardy, srevivo, yrabl, zbitter |
Target Milestone: | beta | Keywords: | TestOnly, Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 15:09:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Joe Talerico
2016-03-01 16:36:21 UTC
NodeUserData seems to have been propagated to all nodes looking at the stack resources list, but not all of them will actually receive and execute it. This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. We need some more information here I think, if the NodeUserData resources are all created OK in heat, and we start all the servers OK, the next step is to look at the cloud-init logs on each node, and at the data retrieved by cloud-init (NodeUserData gets provided to the node via Nova user-data). I assume either some nodes don't get the data, or are failing silently to run it for some reason. Another option would be to use OS::TripleO::CephStorageExtraConfigPre instead of NodeUserData, as this provides an error path in the event the script fails to run. This could be made to run only on CREATE, and the deployment would stop if it failed to run on any nodes. Let me know if you need an example of this. Hey Steve @Giulio should be able to provide that information, he really did the leg work here to determine the scale issue. This may have happened when we had the rpc response timeout config regression. It trying to reproduce now would be useful If i had the OSIC lab again I would but I don't :( Do we have somewhere to reproduce this setup? Since it seems we might have solution. Jaromir - Maybe the end of July/August for in-house testing. Maybe sooner. Assigning to Julio since he has already done the legwork here. The fixed needs to be validated by QE. Verified on: openstack-tripleo-ui-1.0.3-0.20160930145215.f7297c3.el7ost.noarch openstack-tripleo-puppet-elements-5.0.0-0.20160929220627.200d011.el7ost.noarch python-tripleoclient-5.2.0-1.el7ost.noarch openstack-tripleo-common-5.2.1-0.20160930181658.40ad7e5.el7ost.noarch puppet-tripleo-5.2.0-1.el7ost.noarch openstack-tripleo-0.0.1-0.20160916135259.4de13b3.el7ost.noarch openstack-tripleo-image-elements-5.0.0-0.20161002235922.14e1f41.el7ost.noarch openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch Was able to deploy 4 Ceph storage nodes on a fresh deployment and was able to scale up from 3 to 4 nodes *** This bug has been marked as a duplicate of bug 1305947 *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |