Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1387797

Summary:	socket.getnameinfo returns an IP and calamari fails to list all participating OSD in cluster
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Christina Meno <gmeno>
Component:	Calamari	Assignee:	Christina Meno <gmeno>
Calamari sub component:	Back-end	QA Contact:	Vasishta <vashastr>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amarango, ceph-eng-bugs, gmeno, hnallurv, japplewh, kdreyer, vsarmila
Version:	2.1
Target Milestone:	rc
Target Release:	2.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: calamari-server-1.5.0-1.el7cp Ubuntu: calamari_1.5.0-2redhat1xenial	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-14 15:45:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1414918

Description Christina Meno 2016-10-21 22:48:28 UTC

Description of problem:
When doing GET /api/v2/cluster/a471cad3-2036-4607-8ecb-ccf3e4927b22/server

It does return only the 3 Mons and a single OSDs. The single OSD has the following info:
[...]
  {
    "managed": false,
    "last_contact": null,
    "ceph_version": null,
    "backend_addr": "172.17.36.10",
    "hostname": "172",
    "frontend_iface": null,
    "fqdn": "172.17.36.16",
    "boot_time": null,
    "frontend_addr": "172.17.28.10",
    "services": [
[...]

Which might explain the two issues:
 - hostname set at 172 would explain why I only see 1 out of 32 OSD node (all hostname are the same to calamari)
 - and the fqdn that is what is reported by the UI as hostname

Anybody know where these values come from? With that info I can try to workaround the issue. I would assume the optimal config would be have DNS resolution but that's not an option at this time. 



Version-Release number of selected component (if applicable):


How reproducible:
????

Steps to Reproduce:
TBD - Alex

Actual results:
server details list FQDN ad IP and hostname a part of the IP

Expected results:
server details list FQDN as fully qualified domain name and hostname as hostname

Additional info:

Comment 2 Christina Meno 2016-10-21 22:50:42 UTC

Patch ready for testing upstream. https://github.com/ceph/calamari/pull/497

Alex, Would you please help describe steps to reproduce and impact?

Comment 3 Alexandre Marangone 2016-10-24 14:50:14 UTC

To reproduce, you will need to deploy a cluster with no DNS resolution and with shortnames only (i.e. hostname, not hostname.something).
I did try to rename my hosts hostname.something. The bug was not hit.

Let me know if you need more info but I think that's just it.

Comment 4 Christina Meno 2016-11-28 19:41:53 UTC

Boris please work with Martin to get this issue resolved, I hear that he's got a setup that is failing in the same section of code

Comment 6 Christina Meno 2017-01-11 16:02:51 UTC

Boris is on leave. I'm taking this back.

Comment 7 Christina Meno 2017-01-12 23:49:27 UTC

Some context to the larger problem from Alan Bishop:

We’re in the right neighborhood, but the fix for BZ 1387797 won’t address my issue. I looked at the code change (https://github.com/ceph/calamari/pull/497), and it actually highlights the problem. That is, calamari is using socket.getfqdn() to get the server’s FQDN. While that appears to be reasonable, the problem is the value returned is influenced by the contents of the calamari server’s /etc/hosts file. In OSP-10, the hosts file contains different FQDN values for the same host when that host has addresses on multiple networks.

As far as I can tell (from reading the code and observing my system), the Storage Console “discovers” hosts that are running the console agent, and the FQDN they report is their own salt minion_id. That value matches “hostname --fqdn” on the server. However, the calamari server identifies each server using socket.getfqdn(), and the value returned may not match the server’s salt minion_id. Here’s how this is happens in an OpenStack environment.

Every node in the OSP overcloud (OpenStack controllers, computes and storage nodes) has multiple network connections. When the OSP Director (OSPd) deploys an overcloud node, it creates /etc/hosts entries for every IP address associated with that node, and then it pushes updates to *every* other OpenStack node. That way all nodes have host entries for every other node.

The host entries for a storage (OSD) node look like this:

192.168.XXX.aaa overcloud-cephstorage-0.overcloud.localdomain

And the host entries for controller (MON) nodes look like this:

192.168.YYY.bbb overcloud-controller-0.overcloud.localdomain
192.168.XXX.bbb overcloud-controller-0.storage.overcloud.localdomain

Here, XXX is the network associated with the public side of the Ceph cluster, and YYY is the internal OpenStack API network. The actual FQDN of each node (“hostname --fqdn”) depends on the OpenStack node type:
- For controller’s, their FQDN is the one on the YYY network
- For storage nodes, their FQDN is the one on the XXX network

So here’s what happens.
- The salt minion_id for the MON servers (i.e. the OpenStack controllers) will report the YYYY name
-> overcloud-controller-0.overcloud.localdomain
- The calamari server will report socket.getfqdn() on the XXX address, and that will be the XXX name
-> overcloud-controller-0.storage.overcloud.localdomain

This confuses the Storage Console because it treats the two FQDN as two different servers. I don’t think the calamari server is wrong in using socket.getfqdn(). And I don’t think OSP is wrong to be assigning multiple names (with subdomains) to each server in the /etc/hosts file. So where does that leave us?

One idea might be to enhance the Storage Console so that it understands the FQDN reported by the console agent during discovery may not match the FQDN reported by the calamari server, when the only difference is a subdomain. Basically it would treat “host.domain” and “host.subdomain.domain” as the same machine.

Alan

Comment 8 Christina Meno 2017-01-12 23:54:48 UTC

Jeff,

I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that?

cheers,
G

Comment 9 Christina Meno 2017-01-12 23:55:57 UTC

Jeff,

I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that?

cheers,
G

Comment 10 Harish NV Rao 2017-01-17 09:56:08 UTC

@Gregory, 
a) what is the decision on this bug?
b) what are the steps to verify the fix if it's going to be fixed in 2.2

Comment 11 Christina Meno 2017-01-18 15:39:00 UTC

Harish we will fix what Alex reported see comment #3 for b

Comment 15 Vasishta 2017-02-23 14:57:12 UTC

Hi All,

Tried steps mentioned in Comment 3 by Alex. I commented all server info in /etc/resolve.conf. My nodes(VMs) were not configured to have fqdn.

I didn't hit the issue. So, I'm moving this bug to VERIFIED state. 
Please feel free to yell at me if steps I followed are not appropriate.


Regards,
Vasishta

Comment 17 errata-xmlrpc 2017-03-14 15:45:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html