Description of problem: I have 3 osd nodes in which two osd nodes have in the ceph.conf the entry "osd crush location hook = /usr/bin/calamari-crush-location". On these two nodes, if I do "service ceph start osd.<id>", I get following error messages but the daemon starts successfully. [cephuser@magna086 ceph]$ sudo service ceph start osd.4 === osd.4 === ERROR:calamari_osd_location:Error 2 running ceph config-key get:'libust[24687/24687]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) Error ENOENT: error obtaining 'daemon-private/osd.4/v1/calamari/osd_crush_location': (2) No such file or directory' ERROR:calamari_osd_location:Error 1 running ceph config-key get:'2015-10-23 09:11:00.642080 7fafec7e4700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-10-23 09:11:00.642083 7fafec7e4700 0 librados: client.admin initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound' libust[24763/24763]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) create-or-move updated item name 'osd.4' weight 0.9 at location {host=magna086} to crush map Starting Ceph osd.4 on magna086... Running as unit run-24806.service. Version-Release number of selected component (if applicable): 0.94.3 How reproducible: always Steps to Reproduce: 1. Ensure /etc/ceph/ceph.conf has the entry for "osd crush location hook" on a osd node 2. Restart an osd on such node Actual results: ERROR:calamari_osd_location:Error 2 running ceph config-key get:'libust[24687/24687]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) Error ENOENT: error obtaining 'daemon-private/osd.4/v1/calamari/osd_crush_location': (2) No such file or directory' ERROR:calamari_osd_location:Error 1 running ceph config-key get:'2015-10-23 09:11:00.642080 7fafec7e4700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-10-23 09:11:00.642083 7fafec7e4700 0 librados: client.admin initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound' Expected results: osd daemon starts without above mentioned error messages Additional info:
These error messages are legitimate but could be toned down as the OSD starts fine. The real bug here is why only two of the three OSD nodes are configured with a crush location hook. lack of the location hook will only effect users who alter their CRUSH map with calamari to place an OSD in a host node named with a different name than $(hostname -s) reports. In that case if the OSD got restarted it might not stick to the location previously specified. I'm suggesting this get fixed in 1.3.2 since it doesn't prevent OSDs from starting or getting placed correctly in most instances.
patch upstream: https://github.com/ceph/calamari/pull/435/commits/798625c24d8f8b1710ce03c3824ddef172e2ec35
Gregory, upstream patch is not merged and it is closed. Could you please let us know the correct patch for this bug.
@Manohar, Boris and I discussed over IRC and came up with following steps: 1) Stop an osd daemon 2) copy /opt/calamari/salt/salt/base/calamari-crush-location.py to /usr/bin/calamari-crush-location 3) In the ceph.conf, add the entry "osd crush location hook = /usr/bin/calamari-crush-location" 4) Distribute this conf file to all nodes in cluster 5) Start the stopped osd daemon Expected: no error message should be printed. osd should start successfully. cluster should show active+clean state. Please execute the above steps and if they work fine, then move the BZ to verified state.
New PR: https://github.com/ceph/calamari/pull/530
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0626