Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1360467 - Newly added but never started OSDs cause osdmap update issues for Calamari
Newly added but never started OSDs cause osdmap update issues for Calamari
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Calamari (Show other bugs)
1.3.2
x86_64 Linux
unspecified Severity medium
: rc
: 1.3.3
Assigned To: Boris Ranto
Ramakrishnan Periyasamy
Bara Ancincova
:
Depends On:
Blocks: 1372735
  Show dependency treegraph
 
Reported: 2016-07-26 16:20 EDT by Justin Bautista
Modified: 2018-02-23 07:13 EST (History)
6 users (show)

See Also:
Fixed In Version: RHEL: calamari-server-1.3.3-2.el7cp Ubuntu: calamari-server_1.3.3-3redhat1trusty
Doc Type: Bug Fix
Doc Text:
.Calamari now correctly handles manually added OSDs that do not have "ceph-osd" running Previously, when OSD nodes were added manually to the Calamari server but the `ceph-osd` daemon was not started on the nodes, the Calamari server returned error messages and stopped updating statuses for the rest of the OSD nodes. The underlying source code has been modified, and Calamari now handles such OSDs properly.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-29 09:00:27 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1972 normal SHIPPED_LIVE Moderate: Red Hat Ceph Storage 1.3.3 security, bug fix, and enhancement update 2016-09-29 12:51:21 EDT

  None (edit)
Description Justin Bautista 2016-07-26 16:20:34 EDT
Description of problem:

Newly added but never started OSDs cause osdmap update issues for Calamari (OSDs are listed in Web UI, but are grayed out)

Version-Release number of selected component (if applicable):

calamari-server-1.3.3-1.el7cp.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Add OSDs to ceph cluster without starting them
2. Add OSD host them to calamari for monitoring

Actual results:

OSDs are grayed out in Calamari UI

Expected results:

OSDs should be listed/monitored (not grayed out)
Comment 3 Gregory Meno 2016-08-17 14:18:50 EDT
https://github.com/ceph/calamari/pull/483
Comment 4 Gregory Meno 2016-08-17 14:20:00 EDT
Federico,

I'd like to ship this already fixed issue in 1.3.3
Comment 6 Gregory Meno 2016-08-19 09:30:48 EDT
Steps to test:

1. Setup a ceph cluster and calamari.
2. add an osd like here BUT DO NOT start the OSD daemon http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#adding-an-osd-manual
3. observe calamari will no longer update osd status from that node.
Comment 12 Ramakrishnan Periyasamy 2016-09-15 05:06:36 EDT
Based on below observation, it is not clear either OSD should be visible in some sections or not.

1. In Workbench section osd is listed as status down (Red Color)
2. In Manage > OSD section OSD is not visible.

Yet to get reply for Comment10. Marking this bug as Failed verification from QE, since comment6 says OSD should not be visible.
Comment 13 Gregory Meno 2016-09-15 17:35:22 EDT
What I mean in c6 step 3 is that reporting on the state of OSD 10 will no longer happen before applying the fix from this BZ. OSD 9 is in a state that causes the error.

Verification should not be based on the fact that it is visible in the UI. Does this make sense?

so to confirm that you have a fix you should be able to see changes in state to OSD10
Comment 14 Ramakrishnan Periyasamy 2016-09-16 03:23:33 EDT
(In reply to Gregory Meno from comment #13)
> What I mean in c6 step 3 is that reporting on the state of OSD 10 will no
> longer happen before applying the fix from this BZ. OSD 9 is in a state that
> causes the error.

I would like clarify few things below. May be based on that we could take this defect to the closure.

Please note that osd.9 is under test and Not osd.10.

Initially my test setup had 11 osds. I removed osd.9 from it. Added back to the cluster to verify the fix. As per the steps, i did not start osd.9 after adding. As per the fix, osd.9 is expected not to be visible in calamari.

Please correct me, if am wrong.

> 
> Verification should not be based on the fact that it is visible in the UI.
> Does this make sense?
> 
> so to confirm that you have a fix you should be able to see changes in state
> to OSD10
Comment 15 Gregory Meno 2016-09-16 13:24:59 EDT
I can understand why you feel that way. That being said it actually is calamari's ceph.py module on magna108 that is under test. osd.9 is the cause of the failure. The way to verify the fix is to observe updates to the state of any other osd.

That module is responsible for reporting cluster status. If the failure is present you will not get cluster status updates.

The UI will report osd.9 and is expected to do so. The purpose of calamari is to help manage and monitor ceph, if we hide nodes/osds in bad states we might not know there are problems.

Hope this helps,
Gregory
Comment 16 Ramakrishnan Periyasamy 2016-09-19 07:54:30 EDT
Steps followed to verify this bug.

1. Configured ceph cluster and calamari with 9 OSD's and 1 mon
2. OSD's are listed in OSD tree and in Calamari UI with status up 
3. Added one OSD(osd.9) from new OSD host using "ceph-deploy prepare" but not ran "ceph-deploy acitvate" to make sure OSD service is not started.
4. Verified OSD tree where OSD was listed under new host(crush bucket) and OSD status was not up.
5. Verified in Calamari UI, all the existing OSD's status was good and none got greyed out.
6. Activated the service for OSD(osd.9) and added one more OSD(osd.10) to cluster using ceph-deploy.
7. Now all the 11 OSD's were listed in Calamari UI with status up.
Comment 17 Gregory Meno 2016-09-19 13:36:12 EDT
The only concern I have here is that in c16 you list slightly different steps to reproduce, using ceph-deploy vs. doing the OSD setup by hand.
I would be satisfied with this alternative if you would be willing to confirm the negative side of the test e.g. steps in c16 with a previous version result in a reproduction of the error. What do you think?
Comment 23 errata-xmlrpc 2016-09-29 09:00:27 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html

Note You need to log in before you can comment on or make changes to this bug.