Red Hat Bugzilla – Bug 1313479
[Heat] NodeUserData cannot scale beyond 3 nodes
Last modified: 2016-12-14 10:25:04 EST
Description of problem:
Running OSPd and having a pre-deployment script run (wipe_disk) for the ceph nodes. I am unable to scale beyond 3 ceph nodes. If I attempt to run > 3, the deployment fails because the wipe_disk task never runs on some of the nodes.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Deploy with a pre-deployment task, and with a node count > 3
Install succeeds with a node count >3
Current fix is to run the deployment with a node count == 3, but when you have 32 nodes you are trying to reach, this becomes painful.
NodeUserData seems to have been propagated to all nodes looking at the stack resources list, but not all of them will actually receive and execute it.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
We need some more information here I think, if the NodeUserData resources are all created OK in heat, and we start all the servers OK, the next step is to look at the cloud-init logs on each node, and at the data retrieved by cloud-init (NodeUserData gets provided to the node via Nova user-data).
I assume either some nodes don't get the data, or are failing silently to run it for some reason.
Another option would be to use OS::TripleO::CephStorageExtraConfigPre instead of NodeUserData, as this provides an error path in the event the script fails to run. This could be made to run only on CREATE, and the deployment would stop if it failed to run on any nodes. Let me know if you need an example of this.
Hey Steve @Giulio should be able to provide that information, he really did the leg work here to determine the scale issue.
This may have happened when we had the rpc response timeout config regression. It trying to reproduce now would be useful
If i had the OSIC lab again I would but I don't :(
Do we have somewhere to reproduce this setup? Since it seems we might have solution.
Jaromir - Maybe the end of July/August for in-house testing. Maybe sooner.
Assigning to Julio since he has already done the legwork here.
The fixed needs to be validated by QE.
Was able to deploy 4 Ceph storage nodes on a fresh deployment and was able to scale up from 3 to 4 nodes
*** This bug has been marked as a duplicate of bug 1305947 ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.