Bug 1334008

Summary: /setup/key/ endpoint occasionally fails with "/var/lib/ceph-installer/.ssh/id_rsa already exists"
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Ken Dreyer (Red Hat) <kdreyer>
Component: ceph-installerAssignee: Alfredo Deza <adeza>
Status: CLOSED ERRATA QA Contact: sds-qe-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2CC: adeza, aschoen, ceph-eng-bugs, mkudlej, nthomas, sankarshan, sds-qe-bugs
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-installer-1.0.9-1.el7scon Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:50:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
`sudo journalctl -u ceph-installer` none

Description Ken Dreyer (Red Hat) 2016-05-07 02:29:07 UTC
Description of problem:
Occasionally the /setup/key/ API endpoint will fail to return the SSH public key to the client. Instead, it prints an error error message. Here is an example where one of my cluster nodes got the failure (the other four nodes have the proper key; it was just this one that was broken.)

$ cat /home/ceph-installer/.ssh/authorized_keys

{"message": "stdout: \"/var/lib/ceph-installer/.ssh/id_rsa already exists.\nOverwrite (y/n)"}


Version-Release number of selected component (if applicable):
ceph-installer-1.0.7-1.el7scon

How reproducible:
Sporadic, maybe 1 out of 10? It seems to be more reproducible with Xenial cluster nodes for some reason (maybe something to do with the speed in which the Xenial images boot).


Steps to Reproduce:
1. Install ceph-installer on a brand new system.
2. Boot cluster several nodes and have them all run the /setup/ script at roughly the same time.
3. Check /home/ceph-installer/.ssh/authorized_keys on each cluster node.

Actual results:
authorized_keys contains the JSON error message from ssh-keygen.

Expected results:
authorized_keys contains the SSH public key for Ansible.

Comment 1 Ken Dreyer (Red Hat) 2016-05-07 02:31:38 UTC
Created attachment 1154772 [details]
`sudo journalctl -u ceph-installer`

The installer's log (from systemd-journald) shows the error as well as the two separate ssh-keygen invocations.

Comment 2 Ken Dreyer (Red Hat) 2016-05-07 02:36:12 UTC
Something else to note: I'm booting these nodes in sequential order, but they all start very close to each other. The order in which they boot is:

node-1: installer node
node-2: mon, has the key
node-3: osd, has the error
node-4: osd, has the key
node-5: osd, has the key

So maybe node-2 and node-3 are racing there, and node-2's ssh-keygen step hasn't finished before node-3 initiates the HTTP request, therefore triggering the second (doomed-to-fail) ssh-keygen operation.

Comment 3 Ken Dreyer (Red Hat) 2016-05-09 13:47:53 UTC
A solution here is to run ssh-keygen very early, before the HTTP server will accept any connections from clients. Looks like gunicorn's on_starting() might work?

Comment 4 Alfredo Deza 2016-05-09 14:57:36 UTC
We can do this when the app is getting started as part of loading the Pecan application.

Thanks for debugging this!

Comment 5 Alfredo Deza 2016-05-09 15:11:56 UTC
Pull request opened https://github.com/ceph/ceph-installer/pull/146

Comment 6 Ken Dreyer (Red Hat) 2016-05-09 17:16:19 UTC
Alfredo mind reviewing https://github.com/ceph/ceph-installer/pull/147 for this as well?

Comment 7 Alfredo Deza 2016-05-09 17:59:13 UTC
Reviewed and merged

Comment 8 Ken Dreyer (Red Hat) 2016-05-09 18:07:28 UTC
This will be fixed in the upcoming v1.0.9.

Comment 12 Martin Kudlej 2016-06-21 09:31:03 UTC
Checked with ceph-installer-1.0.11-1.el7scon.noarch and we don't see this issue in our test environment. -> Verified

Comment 14 errata-xmlrpc 2016-08-23 19:50:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754