Bug 2259852
| Summary: | Alert "CephMonLowNumber" not triggered for rack,host based failure domains | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Nikhil Ladha <nladha> |
| Component: | ocs-operator | Assignee: | umanga <uchapaga> |
| Status: | CLOSED ERRATA | QA Contact: | Joy John Pinto <jopinto> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.15 | CC: | muagarwa, ngowda, nladha, nthomas, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.15.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.15.0-134 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-19 15:32:05 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2260340 | ||
|
Description
Nikhil Ladha
2024-01-23 11:05:01 UTC
Verified with ODF build 4.15.0-134 and OCP 4.15.
CephMonLowNumber alert is triggered when all the worker nodes conatin label openshift-storage (oc label node compute-5 cluster.ocs.openshift.io/openshift-storage="") and are labelled as different racks using command ( oc label node compute-3 topology.rook.io/rack=rack3 --overwrite=true)
[jopinto@jopinto 5mon]$ oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0 Ready worker 15h v1.28.6+f1618d5
compute-1 Ready worker 15h v1.28.6+f1618d5
compute-2 Ready worker 15h v1.28.6+f1618d5
compute-3 Ready worker 15h v1.28.6+f1618d5
compute-4 Ready worker 15h v1.28.6+f1618d5
compute-5 Ready worker 15h v1.28.6+f1618d5
control-plane-0 Ready control-plane,master 15h v1.28.6+f1618d5
control-plane-1 Ready control-plane,master 15h v1.28.6+f1618d5
control-plane-2 Ready control-plane,master 15h v1.28.6+f1618d5
[jopinto@jopinto 5mon]$ oc get storagecluster -o yaml -n openshift-storage
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
annotations:
uninstall.ocs.openshift.io/cleanup-policy: delete
uninstall.ocs.openshift.io/mode: graceful
creationTimestamp: "2024-02-14T07:08:20Z"
finalizers:
- storagecluster.ocs.openshift.io
generation: 3
managedFields:
- apiVersion: ocs.openshift.io/v1
.....
currentMonCount: 3
failureDomain: rack
failureDomainKey: topology.rook.io/rack
failureDomainValues:
- rack0
- rack1
- rack3
- rack4
- rack5
kmsServerConnection: {}
lastAppliedResourceProfile: balanced
nodeTopologies:
labels:
kubernetes.io/hostname:
- compute-0
- compute-1
- compute-2
- compute-3
- compute-4
- compute-5
topology.rook.io/rack:
- rack0
- rack1
- rack3
- rack4
- rack5
phase: Ready
relatedObjects:
- apiVersion: ceph.rook.io/v1
kind: CephCluster
name: ocs-storagecluster-cephcluster
namespace: openshift-storage
resourceVersion: "545068"
uid: 8834765c-9c1e-452c-9249-ccda00361b6e
- apiVersion: noobaa.io/v1alpha1
kind: NooBaa
name: noobaa
namespace: openshift-storage
resourceVersion: "545238"
uid: fbb90d4e-f3ec-4cf3-bfd9-6cbbe5a3ae29
version: 4.15.0
kind: List
metadata:
resourceVersion: ""
selfLink: ""
oc-nopenshift-monitoringexec-cprometheusprometheus-k8s-0--curl-s'http://localhost:9090/api/v1/alerts'|grepmon{
"status": "success",
"data": {
"alerts": [
{
"labels": {
"alertname": "CephMonLowNumber",
"container": "ocs-metrics-exporter",
"endpoint": "metrics",
"exported_namespace": "openshift-storage",
"failure_domain": "rack",
"instance": "10.130.2.22:8080",
"job": "ocs-metrics-exporter",
"managedBy": "ocs-storagecluster",
"name": "ocs-storagecluster",
"namespace": "openshift-storage",
"pod": "ocs-metrics-exporter-8bf58c567-f5wrk",
"service": "ocs-metrics-exporter",
"severity": "info"
},
"annotations": {
"description": "The number of node failure zones available (5) allow to increase the number of Ceph monitors from 3 to 5 in order to improve cluster resilience.",
"message": "The current number of Ceph monitors can be increased in order to improve cluster resilience.",
"runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMonLowNumber.md",
"severity_level": "info",
"storage_type": "ceph"
},
"state": "firing",
"activeAt": "2024-02-14T09:26:10.40133668Z",
"value": "-2e+00"
}]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 |