Description of problem:
When attempting to import an external Ceph cluster that was deployed by the
OSP Director, the task fails because of a mismatch between the FQDN of
the Monitor nodes (which are the OSP Controllers). The FQDN for the Monitor
nodes reported by the storage console agent do not match the FQDN for the
same nodes reported by the calamari-server. This confuses the Storage Console
because it treats the different FQDN as different nodes.
Version-Release number of selected component (if applicable):
Red Hat Storage Console 2.0
Red Hat Ceph Storage 2.0
Steps to Reproduce:
1. Deploy a Ceph cluster on baremetal hardware using OSP-10 Director
2. Deploy Storage Console in a VM, and console agent on OSP Ceph nodes
3. Attempt to import the cluster into the Storage Console
The import task fails, and the Storage Console displays an error message that
states, "One or more of the hosts in this cluster cannot be found."
The cluster is successfully imported.
The issue is related to Bug #1387797 in that it reveals how the problem
occurs. The calamari-server running on one of the Monitor nodes calls
socket.getfqdn() to determine the FQDN of every node in the cluster, and the
value returned is based on the contents of the calamari-server's /etc/hosts
file. Unfortunately, the FQDN for all Monitor node addresses on the storage
network contain an extra ".storage" subdomain, which in turn causes the FQDN
reported by the calamari-server to contain the ".storage" subdomain.
The problem can be worked around by patching /etc/hosts on the calamari-server
so that the FQDN for the storage network address of each Monitor is the same
as its canonical FQDN (no ".storage" subdomain). However, it should not be
necessary to patch a file that is managed by the OSP Director.
Alan's fix for the /etc/hosts file was working for us but a new issue associated with the storage nodes has appeared with the latest CDN updates.
Previously working ceph packages were:
Current non working packages are:
When one tries to import the cluster we're now seeing an issue reported where the storage hostname contains the string storagemgmt in the fqdn. (e.g. r14-cephstorage-0.storagemgmt.oss.labs). Note this name does not appear on the "Select Monitor Host" GUI page. It shows up on the subsequent "Cluster Summary" page after selecting the r14-controller-0.oss.labs host and clicking continue.
I tried adding the new FQDN to the storage node host file as shown below and still no success.
192.168.170.18 r14-cephstorage-0.oss.labs r14-cephstorage-0.storagemgmt.oss.labs
I checked the salt minions in /etc/salt/pki/master/minions_pre directory as shown below:
-rw-r--r--. 1 root root 451 Mar 21 17:39 r14-cephstorage-0.oss.labs
The filename does not include the string "storagemgmt".
Nor does the storage node return the string when "hostname --fqdn" from the command line:
$ hostname --fqdn
I am not sure how the code is getting this information.
@John: Is r14-cephstorage-0.storagemgmt.oss.labs a FQDN for a cluster network NIC? If yes then the build should really help, here. We mistakingly resolved cluster address instead of public address in calamari and that might explain why you are seeing these mixed-up FQDNs.
btw: Even if it is not the FQDN of the cluster network NIC then the build might still help as it contains few more name resolving fixes.
Yes, It is the FQDN for the cluster network NIC. I have tested a TEST version calamari-server-1.5.4-0.2.TEST.bz1434608.b57245a.el7cp.x86_64 package and it resolves this issue. See BZ-1434874, which is probably a duplic at of this issue.
saved too early. "duplic at" duplicate"
John Willams, thanks for confirming that test package fixes your issue. We're tracking a rebase to calamari v1.5.5 in bz 1430104, which is going through QE now.
If you don't mind, I'm going to close this BZ as a dup of 1430104 so we have a single place to track this issue.
*** This bug has been marked as a duplicate of bug 1430104 ***
When do you expect the patches to be pushed to CDN?
I wanted to close this as a duplicate myself but then I remembered that this bugzilla discusses two issues (the more pressing one with the OSDs and the original one with MONs).
The one with OSDs is fixed by bug 1430104. The other one might have been fixed by some other commit as there have been several commits trying to fix name resolving issues in Calamari recently (all of them are in the v1.5.5 package).
Do you still need to apply the workaround for patching /etc/hosts file? If so, maybe we should re-open for that issue.
>Do you still need to apply the workaround for patching /etc/hosts file? If >so, maybe we should re-open for that issue.
I'm not sure. Let me look into it first and I'll report back here.