Bug 1414918

Summary: Error importing external OSP Ceph cluster due to FQDN mismatch
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Alan Bishop <alan_bishop>
Component: CalamariAssignee: Christina Meno <gmeno>
Calamari sub component: Back-end QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: abishop, arkady_kanevsky, branto, cdevine, ceph-eng-bugs, christopher_dearborn, dcain, gmeno, japplewh, John_walsh, j_t_williams, kdreyer, kurt_hey, lmiccini, morazi, nthomas, randy_perryman, sankarshan, smerrow, sreichar, vereddy
Version: 1.3.3   
Target Milestone: rc   
Target Release: 2.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-06 20:47:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1387797    
Bug Blocks: 1356451    

Description Alan Bishop 2017-01-19 17:44:38 UTC
Description of problem:

When attempting to import an external Ceph cluster that was deployed by the
OSP Director, the task fails because of a mismatch between the FQDN of
the Monitor nodes (which are the OSP Controllers). The FQDN for the Monitor
nodes reported by the storage console agent do not match the FQDN for the
same nodes reported by the calamari-server. This confuses the Storage Console
because it treats the different FQDN as different nodes.

Version-Release number of selected component (if applicable):

Red Hat Storage Console 2.0
Red Hat Ceph Storage 2.0
OSP-10

How reproducible:

Always

Steps to Reproduce:

1. Deploy a Ceph cluster on baremetal hardware using OSP-10 Director
2. Deploy Storage Console in a VM, and console agent on OSP Ceph nodes
3. Attempt to import the cluster into the Storage Console

Actual results:

The import task fails, and the Storage Console displays an error message that
states, "One or more of the hosts in this cluster cannot be found."

Expected results:

The cluster is successfully imported.

Additional info:

The issue is related to Bug #1387797 in that it reveals how the problem
occurs. The calamari-server running on one of the Monitor nodes calls
socket.getfqdn() to determine the FQDN of every node in the cluster, and the
value returned is based on the contents of the calamari-server's /etc/hosts
file. Unfortunately, the FQDN for all Monitor node addresses on the storage
network contain an extra ".storage" subdomain, which in turn causes the FQDN
reported by the calamari-server to contain the ".storage" subdomain.

The problem can be worked around by patching /etc/hosts on the calamari-server
so that the FQDN for the storage network address of each Monitor is the same
as its canonical FQDN (no ".storage" subdomain).  However, it should not be
necessary to patch a file that is managed by the OSP Director.

Comment 2 John Williams 2017-03-21 18:58:15 UTC
Alan's fix for the /etc/hosts file was working for us but a new issue associated with the storage nodes has appeared with the latest CDN updates.  

Previously working ceph packages were: 
ceph-installer-1.0.15-2.el7scon.noarch
ceph-ansible-1.0.5-46.el7scon.noarch
rhscon-ceph-0.0.43-1.el7scon.x86_64 

Current non working packages are: 
ceph-installer-1.2.2-1.el7scon.noarch
ceph-ansible-2.1.9-1.el7scon.noarch
rhscon-ceph-0.0.43-1.el7scon.x86_64

When one tries to import the cluster we're now seeing an issue reported where the storage hostname contains the string storagemgmt in the fqdn. (e.g. r14-cephstorage-0.storagemgmt.oss.labs).  Note this name does not appear on the "Select Monitor Host" GUI page.  It shows up on the subsequent "Cluster Summary" page after selecting the r14-controller-0.oss.labs host and clicking continue.  

I tried adding the new FQDN to the storage node host file as shown below and still no success.  
192.168.170.18 r14-cephstorage-0.oss.labs r14-cephstorage-0.storagemgmt.oss.labs

I checked the salt minions in /etc/salt/pki/master/minions_pre directory as shown below:  
-rw-r--r--. 1 root root  451 Mar 21 17:39 r14-cephstorage-0.oss.labs 

The filename does not include the string "storagemgmt". 

Nor does the storage node return the string when "hostname --fqdn" from the command line:
$ hostname --fqdn
r14-cephstorage-0.oss.labs

I am not sure how the code is getting this information.

Comment 6 Boris Ranto 2017-04-05 20:18:09 UTC
@John: Is r14-cephstorage-0.storagemgmt.oss.labs a FQDN for a cluster network NIC? If yes then the build should really help, here. We mistakingly resolved cluster address instead of public address in calamari and that might explain why you are seeing these mixed-up FQDNs.

btw: Even if it is not the FQDN of the cluster network NIC then the build might still help as it contains few more name resolving fixes.

Comment 7 John Williams 2017-04-05 20:44:27 UTC
Yes, It is the FQDN for the cluster network NIC.  I have tested a TEST version  calamari-server-1.5.4-0.2.TEST.bz1434608.b57245a.el7cp.x86_64 package and it resolves this issue.  See BZ-1434874, which is probably a duplic at of this issue.

Comment 8 John Williams 2017-04-05 20:45:25 UTC
saved too early. "duplic at" duplicate"

Comment 13 Ken Dreyer (Red Hat) 2017-04-06 20:47:23 UTC
John Willams, thanks for confirming that test package fixes your issue. We're tracking a rebase to calamari v1.5.5 in bz 1430104, which is going through QE now.

If you don't mind, I'm going to close this BZ as a dup of 1430104 so we have a single place to track this issue.

*** This bug has been marked as a duplicate of bug 1430104 ***

Comment 14 arkady kanevsky 2017-04-06 20:54:32 UTC
Ken,
When do you expect the patches to be pushed to CDN?

Comment 15 Boris Ranto 2017-04-06 20:57:29 UTC
I wanted to close this as a duplicate myself but then I remembered that this bugzilla discusses two issues (the more pressing one with the OSDs and the original one with MONs).

The one with OSDs is fixed by bug 1430104. The other one might have been fixed by some other commit as there have been several commits trying to fix name resolving issues in Calamari recently (all of them are in the v1.5.5 package).

Do you still need to apply the workaround for patching /etc/hosts file? If so, maybe we should re-open for that issue.

Comment 16 John Williams 2017-04-07 13:47:48 UTC
>Do you still need to apply the workaround for patching /etc/hosts file? If >so, maybe we should re-open for that issue.

I'm not sure.  Let me look into it first and I'll report back here.