Bug 1334008 - /setup/key/ endpoint occasionally fails with "/var/lib/ceph-installer/.ssh/id_rsa already exists"
Summary: /setup/key/ endpoint occasionally fails with "/var/lib/ceph-installer/.ssh/id...
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat
Component: ceph-installer
Version: 2
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 2
Assignee: Alfredo Deza
QA Contact: sds-qe-bugs
Depends On:
TreeView+ depends on / blocked
Reported: 2016-05-07 02:29 UTC by Ken Dreyer (Red Hat)
Modified: 2016-08-23 19:50 UTC (History)
7 users (show)

Fixed In Version: ceph-installer-1.0.9-1.el7scon
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2016-08-23 19:50:05 UTC
Target Upstream Version:

Attachments (Terms of Use)
`sudo journalctl -u ceph-installer` (2.78 KB, text/plain)
2016-05-07 02:31 UTC, Ken Dreyer (Red Hat)
no flags Details

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1754 0 normal SHIPPED_LIVE New packages: Red Hat Storage Console 2.0 2017-04-18 19:09:06 UTC

Description Ken Dreyer (Red Hat) 2016-05-07 02:29:07 UTC
Description of problem:
Occasionally the /setup/key/ API endpoint will fail to return the SSH public key to the client. Instead, it prints an error error message. Here is an example where one of my cluster nodes got the failure (the other four nodes have the proper key; it was just this one that was broken.)

$ cat /home/ceph-installer/.ssh/authorized_keys

{"message": "stdout: \"/var/lib/ceph-installer/.ssh/id_rsa already exists.\nOverwrite (y/n)"}

Version-Release number of selected component (if applicable):

How reproducible:
Sporadic, maybe 1 out of 10? It seems to be more reproducible with Xenial cluster nodes for some reason (maybe something to do with the speed in which the Xenial images boot).

Steps to Reproduce:
1. Install ceph-installer on a brand new system.
2. Boot cluster several nodes and have them all run the /setup/ script at roughly the same time.
3. Check /home/ceph-installer/.ssh/authorized_keys on each cluster node.

Actual results:
authorized_keys contains the JSON error message from ssh-keygen.

Expected results:
authorized_keys contains the SSH public key for Ansible.

Comment 1 Ken Dreyer (Red Hat) 2016-05-07 02:31:38 UTC
Created attachment 1154772 [details]
`sudo journalctl -u ceph-installer`

The installer's log (from systemd-journald) shows the error as well as the two separate ssh-keygen invocations.

Comment 2 Ken Dreyer (Red Hat) 2016-05-07 02:36:12 UTC
Something else to note: I'm booting these nodes in sequential order, but they all start very close to each other. The order in which they boot is:

node-1: installer node
node-2: mon, has the key
node-3: osd, has the error
node-4: osd, has the key
node-5: osd, has the key

So maybe node-2 and node-3 are racing there, and node-2's ssh-keygen step hasn't finished before node-3 initiates the HTTP request, therefore triggering the second (doomed-to-fail) ssh-keygen operation.

Comment 3 Ken Dreyer (Red Hat) 2016-05-09 13:47:53 UTC
A solution here is to run ssh-keygen very early, before the HTTP server will accept any connections from clients. Looks like gunicorn's on_starting() might work?

Comment 4 Alfredo Deza 2016-05-09 14:57:36 UTC
We can do this when the app is getting started as part of loading the Pecan application.

Thanks for debugging this!

Comment 5 Alfredo Deza 2016-05-09 15:11:56 UTC
Pull request opened https://github.com/ceph/ceph-installer/pull/146

Comment 6 Ken Dreyer (Red Hat) 2016-05-09 17:16:19 UTC
Alfredo mind reviewing https://github.com/ceph/ceph-installer/pull/147 for this as well?

Comment 7 Alfredo Deza 2016-05-09 17:59:13 UTC
Reviewed and merged

Comment 8 Ken Dreyer (Red Hat) 2016-05-09 18:07:28 UTC
This will be fixed in the upcoming v1.0.9.

Comment 12 Martin Kudlej 2016-06-21 09:31:03 UTC
Checked with ceph-installer-1.0.11-1.el7scon.noarch and we don't see this issue in our test environment. -> Verified

Comment 14 errata-xmlrpc 2016-08-23 19:50:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.