2018880 – Get 'No datapoints found.' when query metrics about alert rule KubeCPUQuotaOvercommit and KubeMemoryQuotaOvercommit

Bug 2018880 - Get 'No datapoints found.' when query metrics about alert rule KubeCPUQuotaOvercommit and KubeMemoryQuotaOvercommit

Summary: Get 'No datapoints found.' when query metrics about alert rule KubeCPUQuotaOv...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Simon Pasquier
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-01 06:13 UTC by hongyan li
Modified:	2023-09-15 01:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:23:41 UTC
Target Upstream Version:
Embargoed:
Flags:	hongyli: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 694	None	open	fix: Consider `requests.(cpu\|memory)` for quota overcommit alerts	2021-11-10 13:27:16 UTC
Github	openshift cluster-monitoring-operator pull 1491	None	open	[bot] Automated jsonnet dependencies update	2021-11-30 03:38:34 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:24:03 UTC

Description hongyan li 2021-11-01 06:13:53 UTC

Description of problem:

alert: KubeCPUQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5

for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted CPU resource requests for Namespaces.

when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) get 'No datapoints found.'.  

alert: KubeMemoryQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted memory resource requests for Namespaces.

when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) get 'No datapoints found.'.  

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-31-133814
4.8 customer has the issue, suppose 4.9 has the issue also

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Our query expr is wrong

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    requests.nvidia.com/gpu: 4

4.8 alert rule
alert: KubeCPUQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted CPU resource requests for Namespaces.
----
alert: KubeMemoryQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted memory resource requests for Namespaces.

Comment 1 Jie Wu 2021-11-01 07:10:47 UTC

PromQL query results:
kube_resourcequota{resource=~".*cpu.*"}

With 4 results:
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	2

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	1

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0

Only the "requests.cpu" & "limits.cpu" are showing in 'resource' field, "resource=cpu" will not show any results.

Comment 2 Jie Wu 2021-11-01 07:18:54 UTC

PromQL query:
kube_resourcequota{resource=~".*memory.*"}

With 4 results:

Element	Value
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	2147483648
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	1073741824
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"}

Only the "limits.memory" & "requests.memory" are showing in 'resource' field, "resource=memory" will not show any results.

Comment 3 Junqi Zhao 2021-11-01 07:22:00 UTC

these are the alerts from 4.10.0-0.nightly-2021-10-31-133814, Comment 1 is 4.9, not 4.10
****************************
        - alert: KubeCPUQuotaOvercommit
          annotations:
            description: Cluster has overcommitted CPU resource requests for Namespaces.
            summary: Cluster has overcommitted CPU resource requests.
          expr: |
            sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"})
              > 1.5
          for: 5m
          labels:
            severity: warning
        - alert: KubeMemoryQuotaOvercommit
          annotations:
            description: Cluster has overcommitted memory resource requests for Namespaces.
            summary: Cluster has overcommitted memory resource requests.
          expr: |
            sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="memory"})
              /
            sum(kube_node_status_allocatable{resource="memory",job="kube-state-metrics"})
              > 1.5
          for: 5m
          labels:
            severity: warning
****************************
reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"}
see from
********************************
count(kube_resourcequota) by (namespace, job, type, resource)
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="hard"} 1
********************************

Comment 4 Junqi Zhao 2021-11-01 07:23:16 UTC

(In reply to Junqi Zhao from comment #3)
> reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"}


change to
reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"} from your cluster

Comment 9 hongyan li 2021-11-08 05:46:47 UTC

From the #c1 and #c2, we can know the environment on which we face the issue, both requests.cpu and requests.memory have data, but our alert use cpu and memory in the expr and show 'No datapoint found'

Comment 10 Junqi Zhao 2021-11-08 06:35:53 UTC

(In reply to hongyan li from comment #9)
> From the #c1 and #c2, we can know the environment on which we face the
> issue, both requests.cpu and requests.memory have data, but our alert use
> cpu and memory in the expr and show 'No datapoint found'

yes， indeed

Comment 11 Arunprasad Rajkumar 2021-11-10 12:22:16 UTC

According to https://kubernetes.io/docs/concepts/policy/resource-quotas/#compute-resource-quota, cpu is same as requests.cpu and memory is same as requests.memory.

IMHO, the expression must be modified to kube_resourcequota{resource=~"(requests.cpu|cpu)"} and kube_resourcequota{resource=~"(requests.memory|memory)"}. I can raise an upstream PR to fix the same.

Comment 12 Filip Petkovski 2021-11-22 05:59:28 UTC

@arajkuma Yes, you're right. The query expression needs to be adjusted. Do you want to raise the PR or should I do it?

Comment 13 Arunprasad Rajkumar 2021-11-30 03:38:35 UTC

@fpetkovs https://github.com/openshift/cluster-monitoring-operator/pull/1491 should fix this issue.

Comment 16 hongyan li 2021-12-01 09:29:35 UTC

verified in payload 4.10.0-0.nightly-2021-12-01-072705

alert expr are changed as the following and expr works well.

alert: KubeCPUQuotaOvercommit
expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5

 

alert: KubeMemoryQuotaOvercommit
expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5

Comment 17 hongyan li 2021-12-02 02:07:00 UTC

need more test

Comment 18 hongyan li 2021-12-03 01:58:36 UTC

% oc label ns default openshift.io/cluster-monitoring="true"
% oc project default
Now using project "default" on server "https://api.hongyli-1202.qe.devcluster.openshift.com:6443".
% oc apply -f - <<EOF
heredoc> apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    pods: "4"
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
heredoc> EOF
resourcequota/compute-resources created

% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"})' 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496559.589,"0.011685650023197984"]}]}}
100   473  100   125  100   348   8333  23200 --:--:-- --:--:-- --:--:-- 31533
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   464  100   125  100   339   6944  18833 --:--:-- --:--:-- --:--:-- 27294
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496645.636,"0.047619047619047616"]}]}}

Comment 22 errata-xmlrpc 2022-03-10 16:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 23 Red Hat Bugzilla 2023-09-15 01:16:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.