Description of problem (please be detailed as possible and provide log snippests): In the newly created alert `CephMonLowNumber` the alert is raised based on the `label_failure_domain_zones`, that takes into account the failure zones only, present in the cluster. But, if the cluster doesn't have a failure zones set for example platforms like vSphere/BM, where we have rack/host based failure domains the alert is not shown. Version of all relevant components (if applicable): 4.15 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? N.A Is there any workaround available to the best of your knowledge? Yes Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: N.A Steps to Reproduce: 1. Create a 5 or more nodes rack/host based failure domain cluster 2. No alert will be fired for low mon count in the cluster Actual results: The "CephMonLowNumber" alert is not shown to the user. Expected results: The "CephMonLowNumber" alert should be shown to the user
Verified with ODF build 4.15.0-134 and OCP 4.15. CephMonLowNumber alert is triggered when all the worker nodes conatin label openshift-storage (oc label node compute-5 cluster.ocs.openshift.io/openshift-storage="") and are labelled as different racks using command ( oc label node compute-3 topology.rook.io/rack=rack3 --overwrite=true) [jopinto@jopinto 5mon]$ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 15h v1.28.6+f1618d5 compute-1 Ready worker 15h v1.28.6+f1618d5 compute-2 Ready worker 15h v1.28.6+f1618d5 compute-3 Ready worker 15h v1.28.6+f1618d5 compute-4 Ready worker 15h v1.28.6+f1618d5 compute-5 Ready worker 15h v1.28.6+f1618d5 control-plane-0 Ready control-plane,master 15h v1.28.6+f1618d5 control-plane-1 Ready control-plane,master 15h v1.28.6+f1618d5 control-plane-2 Ready control-plane,master 15h v1.28.6+f1618d5 [jopinto@jopinto 5mon]$ oc get storagecluster -o yaml -n openshift-storage apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2024-02-14T07:08:20Z" finalizers: - storagecluster.ocs.openshift.io generation: 3 managedFields: - apiVersion: ocs.openshift.io/v1 ..... currentMonCount: 3 failureDomain: rack failureDomainKey: topology.rook.io/rack failureDomainValues: - rack0 - rack1 - rack3 - rack4 - rack5 kmsServerConnection: {} lastAppliedResourceProfile: balanced nodeTopologies: labels: kubernetes.io/hostname: - compute-0 - compute-1 - compute-2 - compute-3 - compute-4 - compute-5 topology.rook.io/rack: - rack0 - rack1 - rack3 - rack4 - rack5 phase: Ready relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: "545068" uid: 8834765c-9c1e-452c-9249-ccda00361b6e - apiVersion: noobaa.io/v1alpha1 kind: NooBaa name: noobaa namespace: openshift-storage resourceVersion: "545238" uid: fbb90d4e-f3ec-4cf3-bfd9-6cbbe5a3ae29 version: 4.15.0 kind: List metadata: resourceVersion: "" selfLink: "" oc-nopenshift-monitoringexec-cprometheusprometheus-k8s-0--curl-s'http://localhost:9090/api/v1/alerts'|grepmon{ "status": "success", "data": { "alerts": [ { "labels": { "alertname": "CephMonLowNumber", "container": "ocs-metrics-exporter", "endpoint": "metrics", "exported_namespace": "openshift-storage", "failure_domain": "rack", "instance": "10.130.2.22:8080", "job": "ocs-metrics-exporter", "managedBy": "ocs-storagecluster", "name": "ocs-storagecluster", "namespace": "openshift-storage", "pod": "ocs-metrics-exporter-8bf58c567-f5wrk", "service": "ocs-metrics-exporter", "severity": "info" }, "annotations": { "description": "The number of node failure zones available (5) allow to increase the number of Ceph monitors from 3 to 5 in order to improve cluster resilience.", "message": "The current number of Ceph monitors can be increased in order to improve cluster resilience.", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMonLowNumber.md", "severity_level": "info", "storage_type": "ceph" }, "state": "firing", "activeAt": "2024-02-14T09:26:10.40133668Z", "value": "-2e+00" }]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383