Bug 1360467

Summary: Newly added but never started OSDs cause osdmap update issues for Calamari
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Justin Bautista <jbautist>
Component: CalamariAssignee: Boris Ranto <branto>
Calamari sub component: Back-end QA Contact: Ramakrishnan Periyasamy <rperiyas>
Status: CLOSED ERRATA Docs Contact: Bara Ancincova <bancinco>
Severity: medium    
Priority: unspecified CC: ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, rperiyas
Version: 1.3.2   
Target Milestone: rc   
Target Release: 1.3.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: calamari-server-1.3.3-2.el7cp Ubuntu: calamari-server_1.3.3-3redhat1trusty Doc Type: Bug Fix
Doc Text:
.Calamari now correctly handles manually added OSDs that do not have "ceph-osd" running Previously, when OSD nodes were added manually to the Calamari server but the `ceph-osd` daemon was not started on the nodes, the Calamari server returned error messages and stopped updating statuses for the rest of the OSD nodes. The underlying source code has been modified, and Calamari now handles such OSDs properly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-29 13:00:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1372735    

Description Justin Bautista 2016-07-26 20:20:34 UTC
Description of problem:

Newly added but never started OSDs cause osdmap update issues for Calamari (OSDs are listed in Web UI, but are grayed out)

Version-Release number of selected component (if applicable):

calamari-server-1.3.3-1.el7cp.x86_64

How reproducible:

Always

Steps to Reproduce:
1. Add OSDs to ceph cluster without starting them
2. Add OSD host them to calamari for monitoring

Actual results:

OSDs are grayed out in Calamari UI

Expected results:

OSDs should be listed/monitored (not grayed out)

Comment 3 Christina Meno 2016-08-17 18:18:50 UTC
https://github.com/ceph/calamari/pull/483

Comment 4 Christina Meno 2016-08-17 18:20:00 UTC
Federico,

I'd like to ship this already fixed issue in 1.3.3

Comment 6 Christina Meno 2016-08-19 13:30:48 UTC
Steps to test:

1. Setup a ceph cluster and calamari.
2. add an osd like here BUT DO NOT start the OSD daemon http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#adding-an-osd-manual
3. observe calamari will no longer update osd status from that node.

Comment 12 Ramakrishnan Periyasamy 2016-09-15 09:06:36 UTC
Based on below observation, it is not clear either OSD should be visible in some sections or not.

1. In Workbench section osd is listed as status down (Red Color)
2. In Manage > OSD section OSD is not visible.

Yet to get reply for Comment10. Marking this bug as Failed verification from QE, since comment6 says OSD should not be visible.

Comment 13 Christina Meno 2016-09-15 21:35:22 UTC
What I mean in c6 step 3 is that reporting on the state of OSD 10 will no longer happen before applying the fix from this BZ. OSD 9 is in a state that causes the error.

Verification should not be based on the fact that it is visible in the UI. Does this make sense?

so to confirm that you have a fix you should be able to see changes in state to OSD10

Comment 14 Ramakrishnan Periyasamy 2016-09-16 07:23:33 UTC
(In reply to Gregory Meno from comment #13)
> What I mean in c6 step 3 is that reporting on the state of OSD 10 will no
> longer happen before applying the fix from this BZ. OSD 9 is in a state that
> causes the error.

I would like clarify few things below. May be based on that we could take this defect to the closure.

Please note that osd.9 is under test and Not osd.10.

Initially my test setup had 11 osds. I removed osd.9 from it. Added back to the cluster to verify the fix. As per the steps, i did not start osd.9 after adding. As per the fix, osd.9 is expected not to be visible in calamari.

Please correct me, if am wrong.

> 
> Verification should not be based on the fact that it is visible in the UI.
> Does this make sense?
> 
> so to confirm that you have a fix you should be able to see changes in state
> to OSD10

Comment 15 Christina Meno 2016-09-16 17:24:59 UTC
I can understand why you feel that way. That being said it actually is calamari's ceph.py module on magna108 that is under test. osd.9 is the cause of the failure. The way to verify the fix is to observe updates to the state of any other osd.

That module is responsible for reporting cluster status. If the failure is present you will not get cluster status updates.

The UI will report osd.9 and is expected to do so. The purpose of calamari is to help manage and monitor ceph, if we hide nodes/osds in bad states we might not know there are problems.

Hope this helps,
Gregory

Comment 16 Ramakrishnan Periyasamy 2016-09-19 11:54:30 UTC
Steps followed to verify this bug.

1. Configured ceph cluster and calamari with 9 OSD's and 1 mon
2. OSD's are listed in OSD tree and in Calamari UI with status up 
3. Added one OSD(osd.9) from new OSD host using "ceph-deploy prepare" but not ran "ceph-deploy acitvate" to make sure OSD service is not started.
4. Verified OSD tree where OSD was listed under new host(crush bucket) and OSD status was not up.
5. Verified in Calamari UI, all the existing OSD's status was good and none got greyed out.
6. Activated the service for OSD(osd.9) and added one more OSD(osd.10) to cluster using ceph-deploy.
7. Now all the 11 OSD's were listed in Calamari UI with status up.

Comment 17 Christina Meno 2016-09-19 17:36:12 UTC
The only concern I have here is that in c16 you list slightly different steps to reproduce, using ceph-deploy vs. doing the OSD setup by hand.
I would be satisfied with this alternative if you would be willing to confirm the negative side of the test e.g. steps in c16 with a previous version result in a reproduction of the error. What do you think?

Comment 23 errata-xmlrpc 2016-09-29 13:00:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html