2259852 – Alert "CephMonLowNumber" not triggered for rack,host based failure domains

Bug 2259852 - Alert "CephMonLowNumber" not triggered for rack,host based failure domains

Summary: Alert "CephMonLowNumber" not triggered for rack,host based failure domains

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	umanga
QA Contact:	Joy John Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260340
TreeView+	depends on / blocked

Reported:	2024-01-23 11:05 UTC by Nikhil Ladha
Modified:	2024-03-19 15:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.15.0-134
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:32:05 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2430	None	open	Fix CephMonLowNumber alert	2024-01-30 11:38:10 UTC
Github	red-hat-storage ocs-operator pull 2432	None	open	Bug 2259852: [release-4.15] Fix CephMonLowNumber alert	2024-01-30 13:13:22 UTC
Github	red-hat-storage ocs-operator pull 2443	None	open	Bug 2259852: [release-4.15] stop CephMonLowNumber alert when MON count is 5	2024-02-05 07:33:51 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:32:12 UTC

Description Nikhil Ladha 2024-01-23 11:05:01 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In the newly created alert `CephMonLowNumber` the alert is raised based on the `label_failure_domain_zones`, that takes into account the failure zones only, present in the cluster. But, if the cluster doesn't have a failure zones set for example platforms like vSphere/BM, where we have rack/host based failure domains the alert is not shown.

Version of all relevant components (if applicable):
4.15

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
N.A

Is there any workaround available to the best of your knowledge?
Yes

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
N.A

Steps to Reproduce:
1. Create a 5 or more nodes rack/host based failure domain cluster
2. No alert will be fired for low mon count in the cluster


Actual results:
The "CephMonLowNumber" alert is not shown to the user.

Expected results:
The "CephMonLowNumber" alert should be shown to the user

Comment 9 Joy John Pinto 2024-02-14 09:33:05 UTC

Verified with ODF build 4.15.0-134 and OCP 4.15.

CephMonLowNumber alert is triggered when all the worker nodes conatin label openshift-storage (oc label node compute-5 cluster.ocs.openshift.io/openshift-storage="") and are labelled as different racks using command ( oc label node compute-3 topology.rook.io/rack=rack3 --overwrite=true)

[jopinto@jopinto 5mon]$ oc get nodes
NAME              STATUS   ROLES                  AGE   VERSION
compute-0         Ready    worker                 15h   v1.28.6+f1618d5
compute-1         Ready    worker                 15h   v1.28.6+f1618d5
compute-2         Ready    worker                 15h   v1.28.6+f1618d5
compute-3         Ready    worker                 15h   v1.28.6+f1618d5
compute-4         Ready    worker                 15h   v1.28.6+f1618d5
compute-5         Ready    worker                 15h   v1.28.6+f1618d5
control-plane-0   Ready    control-plane,master   15h   v1.28.6+f1618d5
control-plane-1   Ready    control-plane,master   15h   v1.28.6+f1618d5
control-plane-2   Ready    control-plane,master   15h   v1.28.6+f1618d5

[jopinto@jopinto 5mon]$ oc get storagecluster -o yaml -n openshift-storage
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2024-02-14T07:08:20Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 3
    managedFields:
    - apiVersion: ocs.openshift.io/v1
.....
    currentMonCount: 3
    failureDomain: rack
    failureDomainKey: topology.rook.io/rack
    failureDomainValues:
    - rack0
    - rack1
    - rack3
    - rack4
    - rack5
    kmsServerConnection: {}
    lastAppliedResourceProfile: balanced
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - compute-0
        - compute-1
        - compute-2
        - compute-3
        - compute-4
        - compute-5
        topology.rook.io/rack:
        - rack0
        - rack1
        - rack3
        - rack4
        - rack5
    phase: Ready
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "545068"
      uid: 8834765c-9c1e-452c-9249-ccda00361b6e
    - apiVersion: noobaa.io/v1alpha1
      kind: NooBaa
      name: noobaa
      namespace: openshift-storage
      resourceVersion: "545238"
      uid: fbb90d4e-f3ec-4cf3-bfd9-6cbbe5a3ae29
    version: 4.15.0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

oc-nopenshift-monitoringexec-cprometheusprometheus-k8s-0--curl-s'http://localhost:9090/api/v1/alerts'|grepmon{
  "status": "success",
  "data": {
    "alerts": [
      {
        "labels": {
          "alertname": "CephMonLowNumber",
          "container": "ocs-metrics-exporter",
          "endpoint": "metrics",
          "exported_namespace": "openshift-storage",
          "failure_domain": "rack",
          "instance": "10.130.2.22:8080",
          "job": "ocs-metrics-exporter",
          "managedBy": "ocs-storagecluster",
          "name": "ocs-storagecluster",
          "namespace": "openshift-storage",
          "pod": "ocs-metrics-exporter-8bf58c567-f5wrk",
          "service": "ocs-metrics-exporter",
          "severity": "info"
        },
        "annotations": {
          "description": "The number of node failure zones available (5) allow to increase the number of Ceph monitors from 3 to 5 in order to improve cluster resilience.",
          "message": "The current number of Ceph monitors can be increased in order to improve cluster resilience.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMonLowNumber.md",
          "severity_level": "info",
          "storage_type": "ceph"
        },
        "state": "firing",
        "activeAt": "2024-02-14T09:26:10.40133668Z",
        "value": "-2e+00"
      }]

Comment 10 errata-xmlrpc 2024-03-19 15:32:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.