Bug 2259852

Summary: Alert "CephMonLowNumber" not triggered for rack,host based failure domains
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Nikhil Ladha <nladha>
Component: ocs-operatorAssignee: umanga <uchapaga>
Status: CLOSED ERRATA QA Contact: Joy John Pinto <jopinto>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.15CC: muagarwa, ngowda, nladha, nthomas, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-134 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:32:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2260340    

Description Nikhil Ladha 2024-01-23 11:05:01 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In the newly created alert `CephMonLowNumber` the alert is raised based on the `label_failure_domain_zones`, that takes into account the failure zones only, present in the cluster. But, if the cluster doesn't have a failure zones set for example platforms like vSphere/BM, where we have rack/host based failure domains the alert is not shown.

Version of all relevant components (if applicable):
4.15

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
N.A

Is there any workaround available to the best of your knowledge?
Yes

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
N.A

Steps to Reproduce:
1. Create a 5 or more nodes rack/host based failure domain cluster
2. No alert will be fired for low mon count in the cluster


Actual results:
The "CephMonLowNumber" alert is not shown to the user.

Expected results:
The "CephMonLowNumber" alert should be shown to the user

Comment 9 Joy John Pinto 2024-02-14 09:33:05 UTC
Verified with ODF build 4.15.0-134 and OCP 4.15.

CephMonLowNumber alert is triggered when all the worker nodes conatin label openshift-storage (oc label node compute-5 cluster.ocs.openshift.io/openshift-storage="") and are labelled as different racks using command ( oc label node compute-3 topology.rook.io/rack=rack3 --overwrite=true)

[jopinto@jopinto 5mon]$ oc get nodes
NAME              STATUS   ROLES                  AGE   VERSION
compute-0         Ready    worker                 15h   v1.28.6+f1618d5
compute-1         Ready    worker                 15h   v1.28.6+f1618d5
compute-2         Ready    worker                 15h   v1.28.6+f1618d5
compute-3         Ready    worker                 15h   v1.28.6+f1618d5
compute-4         Ready    worker                 15h   v1.28.6+f1618d5
compute-5         Ready    worker                 15h   v1.28.6+f1618d5
control-plane-0   Ready    control-plane,master   15h   v1.28.6+f1618d5
control-plane-1   Ready    control-plane,master   15h   v1.28.6+f1618d5
control-plane-2   Ready    control-plane,master   15h   v1.28.6+f1618d5

[jopinto@jopinto 5mon]$ oc get storagecluster -o yaml -n openshift-storage
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2024-02-14T07:08:20Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 3
    managedFields:
    - apiVersion: ocs.openshift.io/v1
.....
    currentMonCount: 3
    failureDomain: rack
    failureDomainKey: topology.rook.io/rack
    failureDomainValues:
    - rack0
    - rack1
    - rack3
    - rack4
    - rack5
    kmsServerConnection: {}
    lastAppliedResourceProfile: balanced
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        - compute-3
        - compute-4
        - compute-5
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack3
        - rack4
        - rack5
    phase: Ready
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "545068"
      uid: 8834765c-9c1e-452c-9249-ccda00361b6e
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "545238"
      uid: fbb90d4e-f3ec-4cf3-bfd9-6cbbe5a3ae29
    version: 4.15.0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

oc-nopenshift-monitoringexec-cprometheusprometheus-k8s-0--curl-s'http://localhost:9090/api/v1/alerts'|grepmon{
  "status": "success",
  "data": {
    "alerts": [
      {
        "labels": {
          "alertname": "CephMonLowNumber",
          "container": "ocs-metrics-exporter",
          "endpoint": "metrics",
          "exported_namespace": "openshift-storage",
          "failure_domain": "rack",
          "instance": "10.130.2.22:8080",
          "job": "ocs-metrics-exporter",
          "managedBy": "ocs-storagecluster",
          "name": "ocs-storagecluster",
          "namespace": "openshift-storage",
          "pod": "ocs-metrics-exporter-8bf58c567-f5wrk",
          "service": "ocs-metrics-exporter",
          "severity": "info"
        },
        "annotations": {
          "description": "The number of node failure zones available (5) allow to increase the number of Ceph monitors from 3 to 5 in order to improve cluster resilience.",
          "message": "The current number of Ceph monitors can be increased in order to improve cluster resilience.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMonLowNumber.md",
          "severity_level": "info",
          "storage_type": "ceph"
        },
        "state": "firing",
        "activeAt": "2024-02-14T09:26:10.40133668Z",
        "value": "-2e+00"
      }]

Comment 10 errata-xmlrpc 2024-03-19 15:32:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383