Description of problem: When doing GET /api/v2/cluster/a471cad3-2036-4607-8ecb-ccf3e4927b22/server It does return only the 3 Mons and a single OSDs. The single OSD has the following info: [...] { "managed": false, "last_contact": null, "ceph_version": null, "backend_addr": "172.17.36.10", "hostname": "172", "frontend_iface": null, "fqdn": "172.17.36.16", "boot_time": null, "frontend_addr": "172.17.28.10", "services": [ [...] Which might explain the two issues: - hostname set at 172 would explain why I only see 1 out of 32 OSD node (all hostname are the same to calamari) - and the fqdn that is what is reported by the UI as hostname Anybody know where these values come from? With that info I can try to workaround the issue. I would assume the optimal config would be have DNS resolution but that's not an option at this time. Version-Release number of selected component (if applicable): How reproducible: ???? Steps to Reproduce: TBD - Alex Actual results: server details list FQDN ad IP and hostname a part of the IP Expected results: server details list FQDN as fully qualified domain name and hostname as hostname Additional info:
Patch ready for testing upstream. https://github.com/ceph/calamari/pull/497 Alex, Would you please help describe steps to reproduce and impact?
To reproduce, you will need to deploy a cluster with no DNS resolution and with shortnames only (i.e. hostname, not hostname.something). I did try to rename my hosts hostname.something. The bug was not hit. Let me know if you need more info but I think that's just it.
Boris please work with Martin to get this issue resolved, I hear that he's got a setup that is failing in the same section of code
Boris is on leave. I'm taking this back.
Some context to the larger problem from Alan Bishop: We’re in the right neighborhood, but the fix for BZ 1387797 won’t address my issue. I looked at the code change (https://github.com/ceph/calamari/pull/497), and it actually highlights the problem. That is, calamari is using socket.getfqdn() to get the server’s FQDN. While that appears to be reasonable, the problem is the value returned is influenced by the contents of the calamari server’s /etc/hosts file. In OSP-10, the hosts file contains different FQDN values for the same host when that host has addresses on multiple networks. As far as I can tell (from reading the code and observing my system), the Storage Console “discovers” hosts that are running the console agent, and the FQDN they report is their own salt minion_id. That value matches “hostname --fqdn” on the server. However, the calamari server identifies each server using socket.getfqdn(), and the value returned may not match the server’s salt minion_id. Here’s how this is happens in an OpenStack environment. Every node in the OSP overcloud (OpenStack controllers, computes and storage nodes) has multiple network connections. When the OSP Director (OSPd) deploys an overcloud node, it creates /etc/hosts entries for every IP address associated with that node, and then it pushes updates to *every* other OpenStack node. That way all nodes have host entries for every other node. The host entries for a storage (OSD) node look like this: 192.168.XXX.aaa overcloud-cephstorage-0.overcloud.localdomain And the host entries for controller (MON) nodes look like this: 192.168.YYY.bbb overcloud-controller-0.overcloud.localdomain 192.168.XXX.bbb overcloud-controller-0.storage.overcloud.localdomain Here, XXX is the network associated with the public side of the Ceph cluster, and YYY is the internal OpenStack API network. The actual FQDN of each node (“hostname --fqdn”) depends on the OpenStack node type: - For controller’s, their FQDN is the one on the YYY network - For storage nodes, their FQDN is the one on the XXX network So here’s what happens. - The salt minion_id for the MON servers (i.e. the OpenStack controllers) will report the YYYY name -> overcloud-controller-0.overcloud.localdomain - The calamari server will report socket.getfqdn() on the XXX address, and that will be the XXX name -> overcloud-controller-0.storage.overcloud.localdomain This confuses the Storage Console because it treats the two FQDN as two different servers. I don’t think the calamari server is wrong in using socket.getfqdn(). And I don’t think OSP is wrong to be assigning multiple names (with subdomains) to each server in the /etc/hosts file. So where does that leave us? One idea might be to enhance the Storage Console so that it understands the FQDN reported by the console agent during discovery may not match the FQDN reported by the calamari server, when the only difference is a subdomain. Basically it would treat “host.domain” and “host.subdomain.domain” as the same machine. Alan
Jeff, I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that? cheers, G
@Gregory, a) what is the decision on this bug? b) what are the steps to verify the fix if it's going to be fixed in 2.2
Harish we will fix what Alex reported see comment #3 for b
Hi All, Tried steps mentioned in Comment 3 by Alex. I commented all server info in /etc/resolve.conf. My nodes(VMs) were not configured to have fqdn. I didn't hit the issue. So, I'm moving this bug to VERIFIED state. Please feel free to yell at me if steps I followed are not appropriate. Regards, Vasishta
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html