Description of problem (please be detailed as possible and provide log snippests): As discussed recently in the ODF operators weekly, the Rook community has determined that the prometheus alerts should no longer be installed by the Rook operator with the CephCluster CR option "monitoring.enabled: true". - The alerts now need to be created by the OCS operator. Code that was in rook for creating the rules should be straight forward to move over to the OCS operator - The alerts will be customizable in the future by - The "monitoring.enabled: true" will only create resources around prometheus, but not the alerts - This change allows the alerts upstream to be updated more aggressively to match the ceph repo, while the downstream alerts can be updated when QE is ready to sign off. - The prometheus rules will be installed upstream with the helm chart The upstream Rook change is found here: https://github.com/rook/rook/pull/9837 Version of all relevant components (if applicable): The change affects 4.11 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Alerts will not be available after the next time Rook is sync'd until this fix is made. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes
To finish the second bullet above... - The alerts will be customizable in the future by an OCP feature described in this enhancement: https://github.com/openshift/enhancements/pull/958 Customization upstream will be based on a helm chart post processor.
Do I understand it right that unless QE needs to disable or tweak monitoring, the default behavior doesn't change, thus this is mostly regression testing (monitoring should still works as it did before)?
Correct, a regression test will be sufficient since the alerts are expected to remain the same. Also if you could test during an upgrade from 4.10 --> 4.11 that the alerts are still preserved. The PrometheusRule CR created now by the OCS operator has a slightly different resource name (prometheus-ceph-rules) than the CR that had been created by rook. The rules CR previously had the major ceph version in the name (v14 or v16) that survived ODF upgrades, but I just want to confirm it's ok in this case as well, thanks.
Thanks, so besides standard regression testing, we will also run alerting tests after an upgrade.
Moving to verified based on test results of applicable alerting tests for ODF 4.11.0 RC3 build (run ID 1660738848).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156