Bug 1387797 - socket.getnameinfo returns an IP and calamari fails to list all participating OSD in cluster
Summary: socket.getnameinfo returns an IP and calamari fails to list all participating...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Calamari
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: 2.2
Assignee: Christina Meno
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks: 1414918
TreeView+ depends on / blocked
 
Reported: 2016-10-21 22:48 UTC by Christina Meno
Modified: 2017-03-14 15:45 UTC (History)
7 users (show)

Fixed In Version: RHEL: calamari-server-1.5.0-1.el7cp Ubuntu: calamari_1.5.0-2redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-14 15:45:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0514 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.2 bug fix and enhancement update 2017-03-21 07:24:26 UTC

Description Christina Meno 2016-10-21 22:48:28 UTC
Description of problem:
When doing GET /api/v2/cluster/a471cad3-2036-4607-8ecb-ccf3e4927b22/server

It does return only the 3 Mons and a single OSDs. The single OSD has the following info:
[...]
  {
    "managed": false,
    "last_contact": null,
    "ceph_version": null,
    "backend_addr": "172.17.36.10",
    "hostname": "172",
    "frontend_iface": null,
    "fqdn": "172.17.36.16",
    "boot_time": null,
    "frontend_addr": "172.17.28.10",
    "services": [
[...]

Which might explain the two issues:
 - hostname set at 172 would explain why I only see 1 out of 32 OSD node (all hostname are the same to calamari)
 - and the fqdn that is what is reported by the UI as hostname

Anybody know where these values come from? With that info I can try to workaround the issue. I would assume the optimal config would be have DNS resolution but that's not an option at this time. 



Version-Release number of selected component (if applicable):


How reproducible:
????

Steps to Reproduce:
TBD - Alex

Actual results:
server details list FQDN ad IP and hostname a part of the IP

Expected results:
server details list FQDN as fully qualified domain name and hostname as hostname

Additional info:

Comment 2 Christina Meno 2016-10-21 22:50:42 UTC
Patch ready for testing upstream. https://github.com/ceph/calamari/pull/497

Alex, Would you please help describe steps to reproduce and impact?

Comment 3 Alexandre Marangone 2016-10-24 14:50:14 UTC
To reproduce, you will need to deploy a cluster with no DNS resolution and with shortnames only (i.e. hostname, not hostname.something).
I did try to rename my hosts hostname.something. The bug was not hit.

Let me know if you need more info but I think that's just it.

Comment 4 Christina Meno 2016-11-28 19:41:53 UTC
Boris please work with Martin to get this issue resolved, I hear that he's got a setup that is failing in the same section of code

Comment 6 Christina Meno 2017-01-11 16:02:51 UTC
Boris is on leave. I'm taking this back.

Comment 7 Christina Meno 2017-01-12 23:49:27 UTC
Some context to the larger problem from Alan Bishop:

We’re in the right neighborhood, but the fix for BZ 1387797 won’t address my issue. I looked at the code change (https://github.com/ceph/calamari/pull/497), and it actually highlights the problem. That is, calamari is using socket.getfqdn() to get the server’s FQDN. While that appears to be reasonable, the problem is the value returned is influenced by the contents of the calamari server’s /etc/hosts file. In OSP-10, the hosts file contains different FQDN values for the same host when that host has addresses on multiple networks.
 
As far as I can tell (from reading the code and observing my system), the Storage Console “discovers” hosts that are running the console agent, and the FQDN they report is their own salt minion_id. That value matches “hostname --fqdn” on the server. However, the calamari server identifies each server using socket.getfqdn(), and the value returned may not match the server’s salt minion_id. Here’s how this is happens in an OpenStack environment.
 
Every node in the OSP overcloud (OpenStack controllers, computes and storage nodes) has multiple network connections. When the OSP Director (OSPd) deploys an overcloud node, it creates /etc/hosts entries for every IP address associated with that node, and then it pushes updates to *every* other OpenStack node. That way all nodes have host entries for every other node.
 
The host entries for a storage (OSD) node look like this:
 
  192.168.XXX.aaa overcloud-cephstorage-0.overcloud.localdomain
 
And the host entries for controller (MON) nodes look like this:
 
  192.168.YYY.bbb overcloud-controller-0.overcloud.localdomain
  192.168.XXX.bbb overcloud-controller-0.storage.overcloud.localdomain
 
Here, XXX is the network associated with the public side of the Ceph cluster, and YYY is the internal OpenStack API network. The actual FQDN of each node (“hostname --fqdn”) depends on the OpenStack node type:
- For controller’s, their FQDN is the one on the YYY network
- For storage nodes, their FQDN is the one on the XXX network
 
So here’s what happens.
- The salt minion_id for the MON servers (i.e. the OpenStack controllers) will report the YYYY name
    -> overcloud-controller-0.overcloud.localdomain
- The calamari server will report socket.getfqdn() on the XXX address, and that will be the XXX name
    -> overcloud-controller-0.storage.overcloud.localdomain
 
This confuses the Storage Console because it treats the two FQDN as two different servers. I don’t think the calamari server is wrong in using socket.getfqdn(). And I don’t think OSP is wrong to be assigning multiple names (with subdomains) to each server in the /etc/hosts file. So where does that leave us?
 
One idea might be to enhance the Storage Console so that it understands the FQDN reported by the console agent during discovery may not match the FQDN reported by the calamari server, when the only difference is a subdomain. Basically it would treat “host.domain” and “host.subdomain.domain” as the same machine.
 
Alan

Comment 8 Christina Meno 2017-01-12 23:54:48 UTC
Jeff,

I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that?

cheers,
G

Comment 9 Christina Meno 2017-01-12 23:55:57 UTC
Jeff,

I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that?

cheers,
G

Comment 10 Harish NV Rao 2017-01-17 09:56:08 UTC
@Gregory, 
a) what is the decision on this bug?
b) what are the steps to verify the fix if it's going to be fixed in 2.2

Comment 11 Christina Meno 2017-01-18 15:39:00 UTC
Harish we will fix what Alex reported see comment #3 for b

Comment 15 Vasishta 2017-02-23 14:57:12 UTC
Hi All,

Tried steps mentioned in Comment 3 by Alex. I commented all server info in /etc/resolve.conf. My nodes(VMs) were not configured to have fqdn.

I didn't hit the issue. So, I'm moving this bug to VERIFIED state. 
Please feel free to yell at me if steps I followed are not appropriate.


Regards,
Vasishta

Comment 17 errata-xmlrpc 2017-03-14 15:45:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html


Note You need to log in before you can comment on or make changes to this bug.