Description of problem:
Newly added but never started OSDs cause osdmap update issues for Calamari (OSDs are listed in Web UI, but are grayed out)
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Add OSDs to ceph cluster without starting them
2. Add OSD host them to calamari for monitoring
OSDs are grayed out in Calamari UI
OSDs should be listed/monitored (not grayed out)
I'd like to ship this already fixed issue in 1.3.3
Steps to test:
1. Setup a ceph cluster and calamari.
2. add an osd like here BUT DO NOT start the OSD daemon http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-osds/#adding-an-osd-manual
3. observe calamari will no longer update osd status from that node.
Based on below observation, it is not clear either OSD should be visible in some sections or not.
1. In Workbench section osd is listed as status down (Red Color)
2. In Manage > OSD section OSD is not visible.
Yet to get reply for Comment10. Marking this bug as Failed verification from QE, since comment6 says OSD should not be visible.
What I mean in c6 step 3 is that reporting on the state of OSD 10 will no longer happen before applying the fix from this BZ. OSD 9 is in a state that causes the error.
Verification should not be based on the fact that it is visible in the UI. Does this make sense?
so to confirm that you have a fix you should be able to see changes in state to OSD10
(In reply to Gregory Meno from comment #13)
> What I mean in c6 step 3 is that reporting on the state of OSD 10 will no
> longer happen before applying the fix from this BZ. OSD 9 is in a state that
> causes the error.
I would like clarify few things below. May be based on that we could take this defect to the closure.
Please note that osd.9 is under test and Not osd.10.
Initially my test setup had 11 osds. I removed osd.9 from it. Added back to the cluster to verify the fix. As per the steps, i did not start osd.9 after adding. As per the fix, osd.9 is expected not to be visible in calamari.
Please correct me, if am wrong.
> Verification should not be based on the fact that it is visible in the UI.
> Does this make sense?
> so to confirm that you have a fix you should be able to see changes in state
> to OSD10
I can understand why you feel that way. That being said it actually is calamari's ceph.py module on magna108 that is under test. osd.9 is the cause of the failure. The way to verify the fix is to observe updates to the state of any other osd.
That module is responsible for reporting cluster status. If the failure is present you will not get cluster status updates.
The UI will report osd.9 and is expected to do so. The purpose of calamari is to help manage and monitor ceph, if we hide nodes/osds in bad states we might not know there are problems.
Hope this helps,
Steps followed to verify this bug.
1. Configured ceph cluster and calamari with 9 OSD's and 1 mon
2. OSD's are listed in OSD tree and in Calamari UI with status up
3. Added one OSD(osd.9) from new OSD host using "ceph-deploy prepare" but not ran "ceph-deploy acitvate" to make sure OSD service is not started.
4. Verified OSD tree where OSD was listed under new host(crush bucket) and OSD status was not up.
5. Verified in Calamari UI, all the existing OSD's status was good and none got greyed out.
6. Activated the service for OSD(osd.9) and added one more OSD(osd.10) to cluster using ceph-deploy.
7. Now all the 11 OSD's were listed in Calamari UI with status up.
The only concern I have here is that in c16 you list slightly different steps to reproduce, using ceph-deploy vs. doing the OSD setup by hand.
I would be satisfied with this alternative if you would be willing to confirm the negative side of the test e.g. steps in c16 with a previous version result in a reproduction of the error. What do you think?
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.