Created attachment 1182823 [details] cluster dashboard Description of problem: OSD card on any dashboard could show bad status. When some OSDs went down it was not displayed properly in the card, all OSDs look ok. Ceph status: osdmap e83: 8 osds: 6 up, 6 in API status: "slucount": { "criticalAlerts": 0 "down": 0 "error": 2 "nearfull": 0 "total": 8 } Version-Release number of selected component (if applicable): rhscon-core-0.0.34-1.el7scon.x86_64 rhscon-ui-0.0.48-1.el7scon.noarch rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-ceph-0.0.33-1.el7scon.x86_64 How reproducible: 100% Steps to Reproduce: 1. Stop some host with OSD role 2. wait a while until all related events are present in usm 3. go to any dashboard Actual results: It looks like all osds are fine. Expected results: There should be displayed that some osds are down. Additional info:
Does the calamari shows the OSDs in Down state?
Also check ceph -s command on MOn whether the OSDs shows as up/down?
(In reply to Nishanth Thomas from comment #2) > Also check ceph -s command on MOn whether the OSDs shows as up/down? The line in description about ceph status is from 'ceph -s' output. I presume calamari shows osds as down, because on cluster OSDs tab they are in that state.
If a node that has osd's is down, USM back-end considers status of such osd's to be error, please use the "red multiplication" sign as in Pgs card in slide @ https://docs.google.com/presentation/d/1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12 to indicate the error status of osd.
UI should consider the "error" field to show the number of osds down. As mentioned in comment 4 USM considers the osds that are down to be error.
Currently, it appears that USM "marks" an OSD in ERROR in the following (per Anmol & Nishant): [1] The server not being able to communicate to a node can be due to any of the reasons below: a. Salt communication broken. b. Node actually down [2] USM server can get to know if the OSD is actually down or not only when the server tries to sync the cluster details freshly using the calamari apis. This happens once in 24 hours(for performance reasons). The OSD would be marked down after the sync based on response from calamari api. [3] Between the time USM tries to sync the cluster details and the USM detects the node to be down, the usm server marks the OSDs from the in-accessible node as being in error state. This is because USM cannot detect(until the sync runs) if the node and hence the osd's contributed by the node are actually down or its just the salt-communication broken and hence node is inaccessible. My current thinking is that if Salt communications is broken and we don't really know if the OSD down or not, is to indicate the OSD is Unknown (so that we don't cause a false positive in causing potential panic to users in thinking that an OSD in error will cause data loss or data unrecoverability. If the node is actually down, the OSD should be marked as down too.
(In reply to Ju Lim from comment #6) > Currently, it appears that USM "marks" an OSD in ERROR in the following (per > Anmol & Nishant): > > [1] The server not being able to communicate to a node can be due to any of > the reasons below: > a. Salt communication broken. > b. Node actually down > > [2] USM server can get to know if the OSD is actually down or not only when > the server tries to sync the cluster details freshly using the calamari apis. > This happens once in 24 hours(for performance reasons). > The OSD would be marked down after the sync based on response from > calamari api. > > [3] Between the time USM tries to sync the cluster details and the USM > detects the node to be down, the usm server marks the OSDs from the > in-accessible node as being in error state. > > This is because USM cannot detect(until the sync runs) if the node and > hence the osd's contributed by the node are actually down or its just the > salt-communication broken and hence node is inaccessible. > > My current thinking is that if Salt communications is broken and we don't > really know if the OSD down or not, is to indicate the OSD is Unknown (so > that we don't cause a false positive in causing potential panic to users in > thinking that an OSD in error will cause data loss or data unrecoverability. > > If the node is actually down, the OSD should be marked as down too. We don't update the OSD status based on the status(or connectivity) of the host. For OSD status we solely depend upon the event sent by calamari regarding the osd status change or the daily sync which uses calamari api. And calamari raises an event saying an OSD is down when the host on which it is residing goes down. So the issue of raising false positive wont arise. The issue in this bug is, in USM backend we have following status for OSD: OK - if osd is UP and IN WARNING - if osd is UP and OUT ERROR - if osd is DOWN UNKNOWN - if USM does not recognize the status from calamari OSD Summary API fields(dashboard uses this API): { "criticalAlerts": 0 "down": 0 "error": 2 "nearfull": 0 "total": 8 } There is an issue while mapping the the status in USM backend and summary API. Currently "error" field in summary API is mapped to ERROR status of backend(which effectively means osd DOWN). But UI is using "down" field of summary API to show the count of down OSDs. Having both "down" and "error" in summary API is confusing. So to avoid this confusion we can have direct mapping of status in USM backend with summary API. NEW fields of summary API: { "criticalAlerts": 0 "ok": count of osds with ok status "warning": count of osds with warning status "error": count of osds with error status "unknown": count of osds with unknown status "nearfull": 0 "total": 8 } And UI can use the "error" field of summary API to show the count of down OSDs.
Tested with ceph-ansible-1.0.5-33.el7scon.noarch ceph-installer-1.0.15-2.el7scon.noarch rhscon-ceph-0.0.43-1.el7scon.x86_64 rhscon-core-0.0.44-1.el7scon.x86_64 rhscon-core-selinux-0.0.43-1.el7scon.noarch rhscon-ui-0.0.58-1.el7scon.noarch and it works.
Hi Anmol, I have edited the doc text for this bug. Kindly review and approve the text to be included in the async errata. Bobb
Looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2082