Bug 1359129

Summary: Bad OSD status
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Lubos Trilety <ltrilety>
Component: UIAssignee: kamlesh <kaverma>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2CC: anbabu, dnarayan, julim, kchidamb, ltrilety, mkudlej, nthomas, rghatvis, sankarshan, vsarmila
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhscon-core-0.0.44-1.el7scon.x86_64, rhscon-ui-0.0.58-1.el7scon.noarch Doc Type: Bug Fix
Doc Text:
Previously, there was a delay by Calamari to reflect correct values for OSD status, whether offline or running. Hence, the OSD error count on the dashboard were incorrectly displayed. With this update, the issue has been fixed and the dashboard displays the correct, real time OSD status.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-19 15:20:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1357777    
Attachments:
Description Flags
cluster dashboard none

Description Lubos Trilety 2016-07-22 11:30:10 UTC
Created attachment 1182823 [details]
cluster dashboard

Description of problem:
OSD card on any dashboard could show bad status. When some OSDs went down it was not displayed properly in the card, all OSDs look ok.

Ceph status:
osdmap e83: 8 osds: 6 up, 6 in

API status:
"slucount": {
"criticalAlerts": 0
"down": 0
"error": 2
"nearfull": 0
"total": 8
}

Version-Release number of selected component (if applicable):
rhscon-core-0.0.34-1.el7scon.x86_64
rhscon-ui-0.0.48-1.el7scon.noarch
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-ceph-0.0.33-1.el7scon.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Stop some host with OSD role
2. wait a while until all related events are present in usm
3. go to any dashboard

Actual results:
It looks like all osds are fine.

Expected results:
There should be displayed that some osds are down.

Additional info:

Comment 1 Nishanth Thomas 2016-07-22 11:58:41 UTC
Does the calamari shows the OSDs in Down state?

Comment 2 Nishanth Thomas 2016-07-22 11:59:26 UTC
Also check ceph -s command on MOn whether the OSDs shows as up/down?

Comment 3 Lubos Trilety 2016-07-22 12:33:41 UTC
(In reply to Nishanth Thomas from comment #2)
> Also check ceph -s command on MOn whether the OSDs shows as up/down?

The line in description about ceph status is from 'ceph -s' output.
I presume calamari shows osds as down, because on cluster OSDs tab they are in that state.

Comment 4 Ju Lim 2016-09-08 14:01:09 UTC
If a node that has osd's is down, USM back-end considers status of such osd's to be error, please use the "red multiplication" sign as in Pgs card in slide @
https://docs.google.com/presentation/d/1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12 to indicate the error status of osd.

Comment 5 Darshan 2016-09-16 11:30:25 UTC
UI should consider the "error" field to show the number of osds down. As mentioned in comment 4 USM considers the osds that are down to be error.

Comment 6 Ju Lim 2016-09-16 13:39:41 UTC
Currently, it appears that USM "marks" an OSD in ERROR in the following (per Anmol & Nishant):

[1] The server not being able to communicate to a node can be due to any of the reasons below:
   a. Salt communication broken.
   b. Node actually down

[2] USM server can get to know if the OSD is actually down or not only when the server tries to sync the cluster details freshly using the calamari apis.
   This happens once in 24 hours(for performance reasons).
   The OSD would be marked down after the sync based on response from calamari api.

[3] Between the time USM tries to sync the cluster details and the USM detects the node to be down, the usm server marks the OSDs from the in-accessible node as being in error state.

   This is because USM cannot detect(until the sync runs) if the node and hence the osd's contributed by the node are actually down or its just the salt-communication broken and hence node is inaccessible.

My current thinking is that if Salt communications is broken and we don't really know if the OSD down or not, is to indicate the OSD is Unknown (so that we don't cause a false positive in causing potential panic to users in thinking that an OSD in error will cause data loss or data unrecoverability.

If the node is actually down, the OSD should be marked as down too.

Comment 7 Darshan 2016-09-19 07:16:15 UTC
(In reply to Ju Lim from comment #6)
> Currently, it appears that USM "marks" an OSD in ERROR in the following (per
> Anmol & Nishant):
> 
> [1] The server not being able to communicate to a node can be due to any of
> the reasons below:
>    a. Salt communication broken.
>    b. Node actually down
> 
> [2] USM server can get to know if the OSD is actually down or not only when
> the server tries to sync the cluster details freshly using the calamari apis.
>    This happens once in 24 hours(for performance reasons).
>    The OSD would be marked down after the sync based on response from
> calamari api.
> 
> [3] Between the time USM tries to sync the cluster details and the USM
> detects the node to be down, the usm server marks the OSDs from the
> in-accessible node as being in error state.
> 
>    This is because USM cannot detect(until the sync runs) if the node and
> hence the osd's contributed by the node are actually down or its just the
> salt-communication broken and hence node is inaccessible.
> 
> My current thinking is that if Salt communications is broken and we don't
> really know if the OSD down or not, is to indicate the OSD is Unknown (so
> that we don't cause a false positive in causing potential panic to users in
> thinking that an OSD in error will cause data loss or data unrecoverability.
> 
> If the node is actually down, the OSD should be marked as down too.

We don't update the OSD status based on the status(or connectivity) of the host. For OSD status we solely depend upon the event sent by calamari regarding the osd status change or the daily sync which uses calamari api. And calamari raises an event saying an OSD is down when the host on which it is residing goes down.  So the issue of raising false positive wont arise.

The issue in this bug is, in USM backend we have following status for OSD:
OK - if osd is UP and IN
WARNING - if osd is UP and OUT
ERROR - if osd is DOWN
UNKNOWN - if USM does not recognize the status from calamari

OSD Summary API fields(dashboard uses this API):
{
"criticalAlerts": 0
"down": 0
"error": 2
"nearfull": 0
"total": 8
}

There is an issue while mapping the the status in USM backend and summary API. Currently "error" field in summary API is mapped to ERROR status of backend(which effectively means osd DOWN). But UI is using "down" field of summary API to show the count of down OSDs. Having both "down" and "error" in summary API is confusing. So to avoid this confusion we can have direct mapping of status in USM backend with summary API.

NEW fields of summary API:

{
"criticalAlerts": 0
"ok": count of osds with ok status
"warning": count of osds with warning status
"error": count of osds with error status
"unknown": count of osds with unknown status
"nearfull": 0
"total": 8
}

And UI can use the "error" field of summary API to show the count of down OSDs.

Comment 8 Martin Kudlej 2016-09-30 10:17:43 UTC
Tested with
ceph-ansible-1.0.5-33.el7scon.noarch
ceph-installer-1.0.15-2.el7scon.noarch
rhscon-ceph-0.0.43-1.el7scon.x86_64
rhscon-core-0.0.44-1.el7scon.x86_64
rhscon-core-selinux-0.0.43-1.el7scon.noarch
rhscon-ui-0.0.58-1.el7scon.noarch
and it works.

Comment 9 Rakesh 2016-10-17 11:07:01 UTC
Hi Anmol,

I have edited the doc text for this bug. Kindly review and approve the text to be included in the async errata.

Bobb

Comment 10 anmol babu 2016-10-17 11:08:44 UTC
Looks good to me

Comment 11 errata-xmlrpc 2016-10-19 15:20:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082