Bug 2089606 - PodDisruptionBudgetAtLimit alert for openshift-storage
Summary: PodDisruptionBudgetAtLimit alert for openshift-storage
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-24 07:06 UTC by Junqi Zhao
Modified: 2023-08-09 17:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-31 06:29:57 UTC
Embargoed:


Attachments (Terms of Use)

Description Junqi Zhao 2022-05-24 07:06:12 UTC
Description of problem:
rook-ceph-mon-pdb pdb under openshift-storage,  currentHealthy=0, desiredHealthy=0, this would trigger PodDisruptionBudgetAtLimit alert.
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PodDisruptionBudgetAtLimit"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "PodDisruptionBudgetAtLimit",
          "alertstate": "firing",
          "namespace": "openshift-storage",
          "poddisruptionbudget": "rook-ceph-mon-pdb",
          "prometheus": "openshift-monitoring/k8s",
          "severity": "warning"
        },
        "value": [
          1653373189.529,
          "1"
        ]
      }
    ]
  }
}


        - alert: PodDisruptionBudgetAtLimit
          annotations:
            description: The pod disruption budget is at minimum disruptions allowed level.
              The number of current healthy pods is equal to desired healthy pods.
            summary: The pod disruption budget is preventing further disruption to pods.
          expr: |
            max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy == kube_poddisruptionbudget_status_desired_healthy)
          for: 60m
          labels:
            severity: warning

# oc -n openshift-storage get pdb
NAME                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mon-pdb   N/A             1                 0                     21h
# oc -n openshift-storage get pdb rook-ceph-mon-pdb -oyaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  creationTimestamp: "2022-05-23T08:24:23Z"
  generation: 1
  name: rook-ceph-mon-pdb
  namespace: openshift-storage
  resourceVersion: "335364"
  uid: bce93fc1-1ded-4930-8535-1b67618bbf51
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: rook-ceph-mon
status:
  conditions:
  - lastTransitionTime: "2022-05-23T09:28:36Z"
    message: ""
    observedGeneration: 1
    reason: InsufficientPods
    status: "False"
    type: DisruptionAllowed
  currentHealthy: 0
  desiredHealthy: 0
  disruptionsAllowed: 0
  expectedPods: 0
  observedGeneration: 1

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-20-213928

How reproducible:
always

Steps to Reproduce:
1. check alerts
2.
3.

Actual results:
PodDisruptionBudgetAtLimit alert for openshift-storage

Expected results:
no such alert

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 2 Junqi Zhao 2022-05-24 07:08:31 UTC
# oc -n openshift-storage get pod -l app=rook-ceph-mon
No resources found in openshift-storage namespace.

# oc -n openshift-storage get pod
NAME                                                         READY   STATUS      RESTARTS   AGE
cluster-cleanup-job-4c075bcb5648a69b9dcf18fe5fd45337-xgp4n   0/1     Completed   0          21h
cluster-cleanup-job-b58e4193dd2b3724954abf30fec35ce1-72b5p   0/1     Completed   0          21h
cluster-cleanup-job-d251d276fdbde7696d559d6acb423ad5-jmsqd   0/1     Completed   0          21h
csi-addons-controller-manager-7f59b4549c-25k7s               2/2     Running     0          21h
noobaa-core-0                                                1/1     Running     0          21h
noobaa-db-pg-0                                               1/1     Running     0          21h
noobaa-endpoint-5b667d696d-s7xpf                             1/1     Running     0          21h
noobaa-operator-986dbff8c-jxbf2                              1/1     Running     0          21h
ocs-metrics-exporter-5565885f75-rmwht                        1/1     Running     0          21h
ocs-operator-c76b54f4d-bqs2n                                 1/1     Running     0          21h
odf-console-7b7848fb96-98hrq                                 1/1     Running     0          21h
odf-operator-controller-manager-7896b69588-8vzpp             2/2     Running     0          21h
rook-ceph-operator-56f9f8695b-vsqpt                          1/1     Running     0          21h

Comment 3 Travis Nielsen 2022-05-24 18:41:22 UTC
Junqi Please provide more details:
- How did you repro? Did you install and then uninstall? Since there are cleanup jobs, it seems there was an uninstall.
- Please share an ODF must-gather. At a minimum, the rook-ceph-operator log likely shows why there are no mons running.

The alert is valid because there are no mons running. The question is really how you arrived at this invalid config.


Note You need to log in before you can comment on or make changes to this bug.