Description of problem: rook-ceph-mon-pdb pdb under openshift-storage, currentHealthy=0, desiredHealthy=0, this would trigger PodDisruptionBudgetAtLimit alert. # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PodDisruptionBudgetAtLimit"}' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "ALERTS", "alertname": "PodDisruptionBudgetAtLimit", "alertstate": "firing", "namespace": "openshift-storage", "poddisruptionbudget": "rook-ceph-mon-pdb", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "value": [ 1653373189.529, "1" ] } ] } } - alert: PodDisruptionBudgetAtLimit annotations: description: The pod disruption budget is at minimum disruptions allowed level. The number of current healthy pods is equal to desired healthy pods. summary: The pod disruption budget is preventing further disruption to pods. expr: | max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy == kube_poddisruptionbudget_status_desired_healthy) for: 60m labels: severity: warning # oc -n openshift-storage get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mon-pdb N/A 1 0 21h # oc -n openshift-storage get pdb rook-ceph-mon-pdb -oyaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2022-05-23T08:24:23Z" generation: 1 name: rook-ceph-mon-pdb namespace: openshift-storage resourceVersion: "335364" uid: bce93fc1-1ded-4930-8535-1b67618bbf51 spec: maxUnavailable: 1 selector: matchLabels: app: rook-ceph-mon status: conditions: - lastTransitionTime: "2022-05-23T09:28:36Z" message: "" observedGeneration: 1 reason: InsufficientPods status: "False" type: DisruptionAllowed currentHealthy: 0 desiredHealthy: 0 disruptionsAllowed: 0 expectedPods: 0 observedGeneration: 1 Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-05-20-213928 How reproducible: always Steps to Reproduce: 1. check alerts 2. 3. Actual results: PodDisruptionBudgetAtLimit alert for openshift-storage Expected results: no such alert Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
# oc -n openshift-storage get pod -l app=rook-ceph-mon No resources found in openshift-storage namespace. # oc -n openshift-storage get pod NAME READY STATUS RESTARTS AGE cluster-cleanup-job-4c075bcb5648a69b9dcf18fe5fd45337-xgp4n 0/1 Completed 0 21h cluster-cleanup-job-b58e4193dd2b3724954abf30fec35ce1-72b5p 0/1 Completed 0 21h cluster-cleanup-job-d251d276fdbde7696d559d6acb423ad5-jmsqd 0/1 Completed 0 21h csi-addons-controller-manager-7f59b4549c-25k7s 2/2 Running 0 21h noobaa-core-0 1/1 Running 0 21h noobaa-db-pg-0 1/1 Running 0 21h noobaa-endpoint-5b667d696d-s7xpf 1/1 Running 0 21h noobaa-operator-986dbff8c-jxbf2 1/1 Running 0 21h ocs-metrics-exporter-5565885f75-rmwht 1/1 Running 0 21h ocs-operator-c76b54f4d-bqs2n 1/1 Running 0 21h odf-console-7b7848fb96-98hrq 1/1 Running 0 21h odf-operator-controller-manager-7896b69588-8vzpp 2/2 Running 0 21h rook-ceph-operator-56f9f8695b-vsqpt 1/1 Running 0 21h
Junqi Please provide more details: - How did you repro? Did you install and then uninstall? Since there are cleanup jobs, it seems there was an uninstall. - Please share an ODF must-gather. At a minimum, the rook-ceph-operator log likely shows why there are no mons running. The alert is valid because there are no mons running. The question is really how you arrived at this invalid config.