Description of problem: Occasionally the /setup/key/ API endpoint will fail to return the SSH public key to the client. Instead, it prints an error error message. Here is an example where one of my cluster nodes got the failure (the other four nodes have the proper key; it was just this one that was broken.) $ cat /home/ceph-installer/.ssh/authorized_keys {"message": "stdout: \"/var/lib/ceph-installer/.ssh/id_rsa already exists.\nOverwrite (y/n)"} Version-Release number of selected component (if applicable): ceph-installer-1.0.7-1.el7scon How reproducible: Sporadic, maybe 1 out of 10? It seems to be more reproducible with Xenial cluster nodes for some reason (maybe something to do with the speed in which the Xenial images boot). Steps to Reproduce: 1. Install ceph-installer on a brand new system. 2. Boot cluster several nodes and have them all run the /setup/ script at roughly the same time. 3. Check /home/ceph-installer/.ssh/authorized_keys on each cluster node. Actual results: authorized_keys contains the JSON error message from ssh-keygen. Expected results: authorized_keys contains the SSH public key for Ansible.
Created attachment 1154772 [details] `sudo journalctl -u ceph-installer` The installer's log (from systemd-journald) shows the error as well as the two separate ssh-keygen invocations.
Something else to note: I'm booting these nodes in sequential order, but they all start very close to each other. The order in which they boot is: node-1: installer node node-2: mon, has the key node-3: osd, has the error node-4: osd, has the key node-5: osd, has the key So maybe node-2 and node-3 are racing there, and node-2's ssh-keygen step hasn't finished before node-3 initiates the HTTP request, therefore triggering the second (doomed-to-fail) ssh-keygen operation.
A solution here is to run ssh-keygen very early, before the HTTP server will accept any connections from clients. Looks like gunicorn's on_starting() might work?
We can do this when the app is getting started as part of loading the Pecan application. Thanks for debugging this!
Pull request opened https://github.com/ceph/ceph-installer/pull/146
Alfredo mind reviewing https://github.com/ceph/ceph-installer/pull/147 for this as well?
Reviewed and merged
This will be fixed in the upcoming v1.0.9.
Checked with ceph-installer-1.0.11-1.el7scon.noarch and we don't see this issue in our test environment. -> Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754