Description of problem: alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Namespaces. when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) get 'No datapoints found.'. alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Namespaces. when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) get 'No datapoints found.'. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-10-31-133814 4.8 customer has the issue, suppose 4.9 has the issue also How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Our query expr is wrong apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: requests.cpu: "1" requests.memory: 1Gi limits.cpu: "2" limits.memory: 2Gi requests.nvidia.com/gpu: 4 4.8 alert rule alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Namespaces. ---- alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Namespaces.
PromQL query results: kube_resourcequota{resource=~".*cpu.*"} With 4 results: kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"} 2 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"} 0 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"} 1 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"} 0 Only the "requests.cpu" & "limits.cpu" are showing in 'resource' field, "resource=cpu" will not show any results.
PromQL query: kube_resourcequota{resource=~".*memory.*"} With 4 results: Element Value kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"} 2147483648 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"} 0 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"} 1073741824 kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"} Only the "limits.memory" & "requests.memory" are showing in 'resource' field, "resource=memory" will not show any results.
these are the alerts from 4.10.0-0.nightly-2021-10-31-133814, Comment 1 is 4.9, not 4.10 **************************** - alert: KubeCPUQuotaOvercommit annotations: description: Cluster has overcommitted CPU resource requests for Namespaces. summary: Cluster has overcommitted CPU resource requests. expr: | sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5 for: 5m labels: severity: warning - alert: KubeMemoryQuotaOvercommit annotations: description: Cluster has overcommitted memory resource requests for Namespaces. summary: Cluster has overcommitted memory resource requests. expr: | sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable{resource="memory",job="kube-state-metrics"}) > 1.5 for: 5m labels: severity: warning **************************** reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"} see from ******************************** count(kube_resourcequota) by (namespace, job, type, resource) {job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="hard"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="hard"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="hard"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="used"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="hard"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="used"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="used"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="used"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="used"} 1 {job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="hard"} 1 ********************************
(In reply to Junqi Zhao from comment #3) > reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"} change to reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"} from your cluster
From the #c1 and #c2, we can know the environment on which we face the issue, both requests.cpu and requests.memory have data, but our alert use cpu and memory in the expr and show 'No datapoint found'
(In reply to hongyan li from comment #9) > From the #c1 and #c2, we can know the environment on which we face the > issue, both requests.cpu and requests.memory have data, but our alert use > cpu and memory in the expr and show 'No datapoint found' yes, indeed
According to https://kubernetes.io/docs/concepts/policy/resource-quotas/#compute-resource-quota, cpu is same as requests.cpu and memory is same as requests.memory. IMHO, the expression must be modified to kube_resourcequota{resource=~"(requests.cpu|cpu)"} and kube_resourcequota{resource=~"(requests.memory|memory)"}. I can raise an upstream PR to fix the same.
@arajkuma Yes, you're right. The query expression needs to be adjusted. Do you want to raise the PR or should I do it?
@fpetkovs https://github.com/openshift/cluster-monitoring-operator/pull/1491 should fix this issue.
verified in payload 4.10.0-0.nightly-2021-12-01-072705 alert expr are changed as the following and expr works well. alert: KubeCPUQuotaOvercommit expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5 alert: KubeMemoryQuotaOvercommit expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
need more test
% oc label ns default openshift.io/cluster-monitoring="true" % oc project default Now using project "default" on server "https://api.hongyli-1202.qe.devcluster.openshift.com:6443". % oc apply -f - <<EOF heredoc> apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "4" requests.cpu: "1" requests.memory: 1Gi limits.cpu: "2" limits.memory: 2Gi heredoc> EOF resourcequota/compute-resources created % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"})' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496559.589,"0.011685650023197984"]}]}} 100 473 100 125 100 348 8333 23200 --:--:-- --:--:-- --:--:-- 31533 % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 464 100 125 100 339 6944 18833 --:--:-- --:--:-- --:--:-- 27294 {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496645.636,"0.047619047619047616"]}]}}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days