Created attachment 1841421 [details] screencast Description of problem (please be detailed as possible and provide log snippests): Right now the ODF cluster health is just the relection of ceph health and doesn't reflect NooBaa health. Ideally it should reflect the status taking in account both Ceph and NooBaa Version of all relevant components (if applicable): OCP-4.9.0 OCS-quay.io/rhceph-dev/ocs-registry:4.9.0-233.ci But this is applicable for all versions. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ODF health says Green despite NooBaa/Object is unhealthy screencast attached Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: No
The fix for this will be part of ODF 4.10.0(stretch goal). This bug needs major changes on how ODF dashboard works. Just changing query is not enough. We need to also make changes on the UXD side. Multiple PRs will be sent to fix this issue.
UX changes that are planned: Show muted text under the status. This muted text would say which SubSystem(Noobaa/Ceph) is down. When everything is Okay we will not show this muted text. We will take both SubSystem's health aggregated via extension points in UI( no changes in std metrics ). UI requires extensive changes we are trying to achieve this by the FF date.
So some comments: 1. I agree that its a bit wired when you see in the list that the storage system is in an error state because of an issue on the MCG side and when you drill down to the system you see the block and file overview and everything is ok there. 2. I agree with bipul suggestion to add a descriptive text to explain what is wrong, maybe we can make the status clikcable in 4.11 and add more clear text about the subsystem and point the user to the right overview.
Bipul, Any update on the progress? -Bipin
The fix is now available.
Tested with the following builds:- OCP : 4.10.0-0.nightly-2022-03-19-230512 ODF : 4.10.0-198 Following were the steps taken: (a) Successfully deployed ODF cluster and bought down one worker node . NAME STATUS ROLES AGE VERSION compute-0 NotReady worker 24h v1.23.3+e419edf compute-1 Ready worker 24h v1.23.3+e419edf compute-2 Ready worker 24h v1.23.3+e419edf control-plane-0 Ready master 25h v1.23.3+e419edf control-plane-1 Ready master 25h v1.23.3+e419edf control-plane-2 Ready master 25h v1.23.3+e419edf The alerts were present in data foundation details page and the screenshot for the same are present in comment #13 and comment #14. Do we need to validate the fix with some other scenarios or can we move it verified based on this test scenario ? Thanks and Regards Mugdha
Can you test for MCG as well? You could bring Noobaa into error state by creating a backing store and messing it up.
Tested the step mentioned in comment#17 with the following builds :- (a) OCP : 4.10.0-0.nightly-2022-03-27-074444 (b) ODF : 4.10.0-210 THe following steps were performed:- (a) Deleted the target bucket of the default backing store. **Observations** (a) Alert genertated "A NooBaa bucket first.bucket is in error state for more than 5m" Alert Name : NooBaaBucketErrorState Screenshots are available at "https://docs.google.com/document/d/1fHUupVhplWKjNr1BUuErcRg22wMYnwrTeihM0jzrUm4/edit?usp=sharing". Since the alerts are triggering for MCG also i believe the bug is good to be verified . Thanks and Regards Mugdha Soni
Based on comment #15 and comment #18 moving the bug to verified state . Thanks and Regards Mugdha Soni
Pls add doc text
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372