Bug 2022693

Summary: [RFE] ODF health should reflect the health of Ceph + NooBaa
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Bipin Kunal <bkunal>
Component: management-consoleAssignee: Bipul Adhikari <badhikar>
Status: CLOSED ERRATA QA Contact: Mugdha Soni <musoni>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: afrahman, badhikar, etamir, jefbrown, madam, mmuench, muagarwa, nthomas, ocs-bugs, odf-bz-bot, rcyriac, rperiyas, shilpsha, ygalanti
Target Milestone: ---Keywords: AutomationBackLog, FutureFeature
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-156 Doc Type: Enhancement
Doc Text:
.View the Block and File or Object Service subcomponents on the ODF Dashboard With this update, you can view the information of the ODF subcomponents, Block and File or Object Service, whenever any of it is down on the ODF Dashboard.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-13 18:50:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2056571    
Attachments:
Description Flags
screencast none

Description Bipin Kunal 2021-11-12 11:02:00 UTC
Created attachment 1841421 [details]
screencast

Description of problem (please be detailed as possible and provide log
snippests):

Right now the ODF cluster health is just the relection of ceph health and doesn't reflect NooBaa health. Ideally it should reflect the status taking in account both Ceph and NooBaa

Version of all relevant components (if applicable):
OCP-4.9.0
OCS-quay.io/rhceph-dev/ocs-registry:4.9.0-233.ci

But this is applicable for all versions. 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

ODF health says Green despite NooBaa/Object is unhealthy

screencast attached

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
No

Comment 5 Bipul Adhikari 2021-12-09 06:39:12 UTC
The fix for this will be part of ODF 4.10.0(stretch goal). This bug needs major changes on how ODF dashboard works. Just changing query is not enough. We need to also make changes on the UXD side. Multiple PRs will be sent to fix this issue.

Comment 6 Bipul Adhikari 2022-01-05 08:05:23 UTC
UX changes that are planned: 

Show muted text under the status.
This muted text would say which SubSystem(Noobaa/Ceph) is down. 
When everything is Okay we will not show this muted text.

We will take both SubSystem's health aggregated via extension points in UI( no changes in std metrics ).

UI requires extensive changes we are trying to achieve this by the FF date.

Comment 7 Yuval 2022-01-12 09:10:17 UTC
So some comments: 
1. I agree that its a bit wired when you see in the list that the storage system is in an error state because of an issue on the MCG side and when you drill down to the system you see the block and file overview and everything is ok there. 
2. I agree with bipul suggestion to add a descriptive text to explain what is wrong, maybe we can make the status clikcable in 4.11 and add more clear text about the subsystem and point the user to the right overview.

Comment 8 Bipin Kunal 2022-02-01 07:41:07 UTC
Bipul,
  Any update on the progress?

-Bipin

Comment 9 Bipul Adhikari 2022-02-15 15:08:36 UTC
The fix is now available.

Comment 15 Mugdha Soni 2022-03-22 09:05:49 UTC
Tested with the following builds:-

OCP : 4.10.0-0.nightly-2022-03-19-230512
ODF : 4.10.0-198

Following were the steps taken:

(a) Successfully deployed ODF cluster and bought down one worker node .

    NAME              STATUS     ROLES    AGE   VERSION
compute-0         NotReady   worker   24h   v1.23.3+e419edf
compute-1         Ready      worker   24h   v1.23.3+e419edf
compute-2         Ready      worker   24h   v1.23.3+e419edf
control-plane-0   Ready      master   25h   v1.23.3+e419edf
control-plane-1   Ready      master   25h   v1.23.3+e419edf
control-plane-2   Ready      master   25h   v1.23.3+e419edf

The alerts were present in data foundation details page and the screenshot for the same are present in comment #13 and comment #14.

Do we need to validate the fix with some other scenarios or can we move it verified based on this test scenario ?


     
Thanks and Regards
Mugdha

Comment 17 Bipul Adhikari 2022-03-22 11:24:51 UTC
Can you test for MCG as well? You could bring Noobaa into error state by creating a backing store and messing it up.

Comment 18 Mugdha Soni 2022-03-29 13:27:06 UTC
Tested the step mentioned in comment#17 with the following builds :-

(a) OCP : 4.10.0-0.nightly-2022-03-27-074444
(b) ODF : 4.10.0-210

THe following steps were performed:-

(a) Deleted the target bucket of the default backing store.


**Observations**

(a) Alert genertated "A NooBaa bucket first.bucket is in error state for more than 5m"
    Alert Name : NooBaaBucketErrorState


Screenshots are available at "https://docs.google.com/document/d/1fHUupVhplWKjNr1BUuErcRg22wMYnwrTeihM0jzrUm4/edit?usp=sharing".

Since the alerts are triggering for MCG also i believe the bug is good to be verified .


Thanks and Regards

Mugdha Soni

Comment 19 Mugdha Soni 2022-03-30 05:53:46 UTC
Based on comment #15 and comment #18 moving the bug to verified state .


Thanks and Regards
Mugdha Soni

Comment 20 Mudit Agarwal 2022-03-31 15:00:31 UTC
Pls add doc text

Comment 24 errata-xmlrpc 2022-04-13 18:50:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372