Bug 1954790
Summary: | KCM Alert PodDisruptionBudget At and Limit do not alert with maxUnavailable or MinAvailable by percentage | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matthew Robson <mrobson> |
Component: | kube-controller-manager | Assignee: | ravig <rgudimet> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | aos-bugs, dhellmann, knarra, maszulik, mfojtik, rgudimet, steven.barre, vrutkovs, wking |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:04:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1968532, 1968555 |
Description
Matthew Robson
2021-04-28 18:59:09 UTC
This caused a regression in upgrade jobs - it assumes that all master nodes must upgrade within 15 mins. Instead this alert should use a most sophisticated metric: count_over_time((kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)[15m:10s]) > 0 To ensure that PDB was not violated for more than 10 seconds within 15 mins window A better idea - check for `cluster_version` metric, if `type` is `updating` then the alert should not be fired Not firing the alert during upgrades would be an issue as well. That is how we found the issue with the alert. Customer had some bad PDBs that cause the MCP rollout to hang for hours on the 4.6.25 upgrade before someone noticed. Then we realized the alerts were broken. Matt Can see the alert now with the latest payload: [root@localhost ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-11-024306 True False 9m30s Cluster version is 4.8.0-0.nightly-2021-06-11-024306 steps: 1) cordon one of the node: [root@localhost ~]# oc adm cordon yinzhou-bug-pkv6w-master-0.c.openshift-qe.internal node/yinzhou-bug-pkv6w-master-0.c.openshift-qe.internal cordoned [root@localhost ~]# oc get node NAME STATUS ROLES AGE VERSION yinzhou-bug-pkv6w-master-0.c.openshift-qe.internal Ready,SchedulingDisabled master 50m v1.21.0-rc.0+a5ec692 2) Delete one of the etcd pod: [root@localhost ~]# oc delete po etcd-quorum-guard-b8668f655-28c4x -n openshift-etcd pod "etcd-quorum-guard-b8668f655-28c4x" deleted [root@localhost ~]# oc get po NAME READY STATUS RESTARTS AGE etcd-quorum-guard-b8668f655-5z524 1/1 Running 0 49m etcd-quorum-guard-b8668f655-ck6ps 0/1 Pending 0 14s 3) wait for some time , check the alert : [root@localhost ~]# token=`oc sa get-token prometheus-k8s -n openshift-monitoring` [root@localhost ~]# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4278 0 4278 0 0 97227 0 --:--:-- --:--:-- --:--:-- 97227 { "status": "success", "data": { "alerts": [ { "labels": { "alertname": "KubePodNotReady", "namespace": "openshift-etcd", "pod": "etcd-quorum-guard-b8668f655-ck6ps", "severity": "warning" }, "annotations": { "description": "Pod openshift-etcd/etcd-quorum-guard-b8668f655-ck6ps has been in a non-ready state for longer than 15 minutes.", "summary": "Pod has been in a non-ready state for more than 15 minutes." Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |