Bug 1387797
Summary: | socket.getnameinfo returns an IP and calamari fails to list all participating OSD in cluster | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Christina Meno <gmeno> |
Component: | Calamari | Assignee: | Christina Meno <gmeno> |
Calamari sub component: | Back-end | QA Contact: | Vasishta <vashastr> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | amarango, ceph-eng-bugs, gmeno, hnallurv, japplewh, kdreyer, vsarmila |
Version: | 2.1 | ||
Target Milestone: | rc | ||
Target Release: | 2.2 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | RHEL: calamari-server-1.5.0-1.el7cp Ubuntu: calamari_1.5.0-2redhat1xenial | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-14 15:45:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1414918 |
Description
Christina Meno
2016-10-21 22:48:28 UTC
Patch ready for testing upstream. https://github.com/ceph/calamari/pull/497 Alex, Would you please help describe steps to reproduce and impact? To reproduce, you will need to deploy a cluster with no DNS resolution and with shortnames only (i.e. hostname, not hostname.something). I did try to rename my hosts hostname.something. The bug was not hit. Let me know if you need more info but I think that's just it. Boris please work with Martin to get this issue resolved, I hear that he's got a setup that is failing in the same section of code Boris is on leave. I'm taking this back. Some context to the larger problem from Alan Bishop: We’re in the right neighborhood, but the fix for BZ 1387797 won’t address my issue. I looked at the code change (https://github.com/ceph/calamari/pull/497), and it actually highlights the problem. That is, calamari is using socket.getfqdn() to get the server’s FQDN. While that appears to be reasonable, the problem is the value returned is influenced by the contents of the calamari server’s /etc/hosts file. In OSP-10, the hosts file contains different FQDN values for the same host when that host has addresses on multiple networks. As far as I can tell (from reading the code and observing my system), the Storage Console “discovers” hosts that are running the console agent, and the FQDN they report is their own salt minion_id. That value matches “hostname --fqdn” on the server. However, the calamari server identifies each server using socket.getfqdn(), and the value returned may not match the server’s salt minion_id. Here’s how this is happens in an OpenStack environment. Every node in the OSP overcloud (OpenStack controllers, computes and storage nodes) has multiple network connections. When the OSP Director (OSPd) deploys an overcloud node, it creates /etc/hosts entries for every IP address associated with that node, and then it pushes updates to *every* other OpenStack node. That way all nodes have host entries for every other node. The host entries for a storage (OSD) node look like this: 192.168.XXX.aaa overcloud-cephstorage-0.overcloud.localdomain And the host entries for controller (MON) nodes look like this: 192.168.YYY.bbb overcloud-controller-0.overcloud.localdomain 192.168.XXX.bbb overcloud-controller-0.storage.overcloud.localdomain Here, XXX is the network associated with the public side of the Ceph cluster, and YYY is the internal OpenStack API network. The actual FQDN of each node (“hostname --fqdn”) depends on the OpenStack node type: - For controller’s, their FQDN is the one on the YYY network - For storage nodes, their FQDN is the one on the XXX network So here’s what happens. - The salt minion_id for the MON servers (i.e. the OpenStack controllers) will report the YYYY name -> overcloud-controller-0.overcloud.localdomain - The calamari server will report socket.getfqdn() on the XXX address, and that will be the XXX name -> overcloud-controller-0.storage.overcloud.localdomain This confuses the Storage Console because it treats the two FQDN as two different servers. I don’t think the calamari server is wrong in using socket.getfqdn(). And I don’t think OSP is wrong to be assigning multiple names (with subdomains) to each server in the /etc/hosts file. So where does that leave us? One idea might be to enhance the Storage Console so that it understands the FQDN reported by the console agent during discovery may not match the FQDN reported by the calamari server, when the only difference is a subdomain. Basically it would treat “host.domain” and “host.subdomain.domain” as the same machine. Alan Jeff, I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that? cheers, G Jeff, I've seen a thread discussing this bug, but I'm a little confused on what specifically you need in addition to what this patch does. Would you please help me understand that? cheers, G @Gregory, a) what is the decision on this bug? b) what are the steps to verify the fix if it's going to be fixed in 2.2 Harish we will fix what Alex reported see comment #3 for b Hi All, Tried steps mentioned in Comment 3 by Alex. I commented all server info in /etc/resolve.conf. My nodes(VMs) were not configured to have fqdn. I didn't hit the issue. So, I'm moving this bug to VERIFIED state. Please feel free to yell at me if steps I followed are not appropriate. Regards, Vasishta Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html |