1359129 – Bad OSD status

Bug 1359129 - Bad OSD status

Summary: Bad OSD status

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	UI
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	2
Assignee:	kamlesh
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Console-2-Async
TreeView+	depends on / blocked

Reported:	2016-07-22 11:30 UTC by Lubos Trilety
Modified:	2016-10-19 15:20 UTC (History)
CC List:	10 users (show)
Fixed In Version:	rhscon-core-0.0.44-1.el7scon.x86_64, rhscon-ui-0.0.58-1.el7scon.noarch
Doc Type:	Bug Fix
Doc Text:	Previously, there was a delay by Calamari to reflect correct values for OSD status, whether offline or running. Hence, the OSD error count on the dashboard were incorrectly displayed. With this update, the issue has been fixed and the dashboard displays the correct, real time OSD status.
Clone Of:
Environment:
Last Closed:	2016-10-19 15:20:52 UTC
Embargoed:

Attachments	(Terms of Use)
cluster dashboard (148.82 KB, image/png) 2016-07-22 11:30 UTC, Lubos Trilety	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	294905	None	None	None	2016-09-30 06:44:57 UTC
Gerrithub.io	296124	None	None	None	2016-09-30 06:44:03 UTC
Red Hat Product Errata	RHSA-2016:2082	normal	SHIPPED_LIVE	Moderate: Red Hat Storage Console 2 security and bug fix update	2017-04-18 19:29:02 UTC

Description Lubos Trilety 2016-07-22 11:30:10 UTC

Created attachment 1182823 [details]
cluster dashboard

Description of problem:
OSD card on any dashboard could show bad status. When some OSDs went down it was not displayed properly in the card, all OSDs look ok.

Ceph status:
osdmap e83: 8 osds: 6 up, 6 in

API status:
"slucount": {
"criticalAlerts": 0
"down": 0
"error": 2
"nearfull": 0
"total": 8
}

Version-Release number of selected component (if applicable):
rhscon-core-0.0.34-1.el7scon.x86_64
rhscon-ui-0.0.48-1.el7scon.noarch
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-ceph-0.0.33-1.el7scon.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Stop some host with OSD role
2. wait a while until all related events are present in usm
3. go to any dashboard

Actual results:
It looks like all osds are fine.

Expected results:
There should be displayed that some osds are down.

Additional info:

Comment 1 Nishanth Thomas 2016-07-22 11:58:41 UTC

Does the calamari shows the OSDs in Down state?

Comment 2 Nishanth Thomas 2016-07-22 11:59:26 UTC

Also check ceph -s command on MOn whether the OSDs shows as up/down?

Comment 3 Lubos Trilety 2016-07-22 12:33:41 UTC

(In reply to Nishanth Thomas from comment #2)
> Also check ceph -s command on MOn whether the OSDs shows as up/down?

The line in description about ceph status is from 'ceph -s' output.
I presume calamari shows osds as down, because on cluster OSDs tab they are in that state.

Comment 4 Ju Lim 2016-09-08 14:01:09 UTC

If a node that has osd's is down, USM back-end considers status of such osd's to be error, please use the "red multiplication" sign as in Pgs card in slide @
https://docs.google.com/presentation/d/1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12 to indicate the error status of osd.

Comment 5 Darshan 2016-09-16 11:30:25 UTC

UI should consider the "error" field to show the number of osds down. As mentioned in comment 4 USM considers the osds that are down to be error.

Comment 6 Ju Lim 2016-09-16 13:39:41 UTC

Currently, it appears that USM "marks" an OSD in ERROR in the following (per Anmol & Nishant):

[1] The server not being able to communicate to a node can be due to any of the reasons below:
   a. Salt communication broken.
   b. Node actually down

[2] USM server can get to know if the OSD is actually down or not only when the server tries to sync the cluster details freshly using the calamari apis.
   This happens once in 24 hours(for performance reasons).
   The OSD would be marked down after the sync based on response from calamari api.

[3] Between the time USM tries to sync the cluster details and the USM detects the node to be down, the usm server marks the OSDs from the in-accessible node as being in error state.

   This is because USM cannot detect(until the sync runs) if the node and hence the osd's contributed by the node are actually down or its just the salt-communication broken and hence node is inaccessible.

My current thinking is that if Salt communications is broken and we don't really know if the OSD down or not, is to indicate the OSD is Unknown (so that we don't cause a false positive in causing potential panic to users in thinking that an OSD in error will cause data loss or data unrecoverability.

If the node is actually down, the OSD should be marked as down too.

Comment 7 Darshan 2016-09-19 07:16:15 UTC

(In reply to Ju Lim from comment #6)
> Currently, it appears that USM "marks" an OSD in ERROR in the following (per
> Anmol & Nishant):
> 
> [1] The server not being able to communicate to a node can be due to any of
> the reasons below:
>    a. Salt communication broken.
>    b. Node actually down
> 
> [2] USM server can get to know if the OSD is actually down or not only when
> the server tries to sync the cluster details freshly using the calamari apis.
>    This happens once in 24 hours(for performance reasons).
>    The OSD would be marked down after the sync based on response from
> calamari api.
> 
> [3] Between the time USM tries to sync the cluster details and the USM
> detects the node to be down, the usm server marks the OSDs from the
> in-accessible node as being in error state.
> 
>    This is because USM cannot detect(until the sync runs) if the node and
> hence the osd's contributed by the node are actually down or its just the
> salt-communication broken and hence node is inaccessible.
> 
> My current thinking is that if Salt communications is broken and we don't
> really know if the OSD down or not, is to indicate the OSD is Unknown (so
> that we don't cause a false positive in causing potential panic to users in
> thinking that an OSD in error will cause data loss or data unrecoverability.
> 
> If the node is actually down, the OSD should be marked as down too.

We don't update the OSD status based on the status(or connectivity) of the host. For OSD status we solely depend upon the event sent by calamari regarding the osd status change or the daily sync which uses calamari api. And calamari raises an event saying an OSD is down when the host on which it is residing goes down.  So the issue of raising false positive wont arise.

The issue in this bug is, in USM backend we have following status for OSD:
OK - if osd is UP and IN
WARNING - if osd is UP and OUT
ERROR - if osd is DOWN
UNKNOWN - if USM does not recognize the status from calamari

OSD Summary API fields(dashboard uses this API):
{
"criticalAlerts": 0
"down": 0
"error": 2
"nearfull": 0
"total": 8
}

There is an issue while mapping the the status in USM backend and summary API. Currently "error" field in summary API is mapped to ERROR status of backend(which effectively means osd DOWN). But UI is using "down" field of summary API to show the count of down OSDs. Having both "down" and "error" in summary API is confusing. So to avoid this confusion we can have direct mapping of status in USM backend with summary API.

NEW fields of summary API:

{
"criticalAlerts": 0
"ok": count of osds with ok status
"warning": count of osds with warning status
"error": count of osds with error status
"unknown": count of osds with unknown status
"nearfull": 0
"total": 8
}

And UI can use the "error" field of summary API to show the count of down OSDs.

Comment 8 Martin Kudlej 2016-09-30 10:17:43 UTC

Tested with
ceph-ansible-1.0.5-33.el7scon.noarch
ceph-installer-1.0.15-2.el7scon.noarch
rhscon-ceph-0.0.43-1.el7scon.x86_64
rhscon-core-0.0.44-1.el7scon.x86_64
rhscon-core-selinux-0.0.43-1.el7scon.noarch
rhscon-ui-0.0.58-1.el7scon.noarch
and it works.

Comment 9 Rakesh 2016-10-17 11:07:01 UTC

Hi Anmol,

I have edited the doc text for this bug. Kindly review and approve the text to be included in the async errata.

Bobb

Comment 10 anmol babu 2016-10-17 11:08:44 UTC

Looks good to me

Comment 11 errata-xmlrpc 2016-10-19 15:20:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082

Note You need to log in before you can comment on or make changes to this bug.