Bug 2018880

Summary:	Get 'No datapoints found.' when query metrics about alert rule KubeCPUQuotaOvercommit and KubeMemoryQuotaOvercommit
Product:	OpenShift Container Platform	Reporter:	hongyan li <hongyli>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.10	CC:	amuller, anpicker, aos-bugs, arajkuma, erooth, jiewu, pgough
Target Milestone:	---	Flags:	hongyli: needinfo-
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:23:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description hongyan li 2021-11-01 06:13:53 UTC

Description of problem:

alert: KubeCPUQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5

for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted CPU resource requests for Namespaces.

when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) get 'No datapoints found.'.  

alert: KubeMemoryQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted memory resource requests for Namespaces.

when query metric sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource="memory",type="hard"}) get 'No datapoints found.'.  

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-10-31-133814
4.8 customer has the issue, suppose 4.9 has the issue also

How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Our query expr is wrong

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    requests.nvidia.com/gpu: 4

4.8 alert rule
alert: KubeCPUQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable{resource="cpu"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted CPU resource requests for Namespaces.
----
alert: KubeMemoryQuotaOvercommit
expr: sum(kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",resource="memory",type="hard"}) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
for: 5m
labels:
  severity: warning
annotations:
  message: Cluster has overcommitted memory resource requests for Namespaces.

Comment 1 Jie Wu 2021-11-01 07:10:47 UTC

PromQL query results:
kube_resourcequota{resource=~".*cpu.*"}

With 4 results:
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	2

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	1

kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.cpu",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0

Only the "requests.cpu" & "limits.cpu" are showing in 'resource' field, "resource=cpu" will not show any results.

Comment 2 Jie Wu 2021-11-01 07:18:54 UTC

PromQL query:
kube_resourcequota{resource=~".*memory.*"}

With 4 results:

Element	Value
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	2147483648
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="limits.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"}	0
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="hard"}	1073741824
kube_resourcequota{container="kube-rbac-proxy-main",endpoint="https-main",instance="10.129.2.7:8443",job="kube-state-metrics",namespace="quotapj",pod="kube-state-metrics-6d766d775-qtl5d",resource="requests.memory",resourcequota="compute-resources",service="kube-state-metrics",type="used"}

Only the "limits.memory" & "requests.memory" are showing in 'resource' field, "resource=memory" will not show any results.

Comment 3 Junqi Zhao 2021-11-01 07:22:00 UTC

these are the alerts from 4.10.0-0.nightly-2021-10-31-133814, Comment 1 is 4.9, not 4.10
****************************
        - alert: KubeCPUQuotaOvercommit
          annotations:
            description: Cluster has overcommitted CPU resource requests for Namespaces.
            summary: Cluster has overcommitted CPU resource requests.
          expr: |
            sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="cpu"})
              /
            sum(kube_node_status_allocatable{resource="cpu"})
              > 1.5
          for: 5m
          labels:
            severity: warning
        - alert: KubeMemoryQuotaOvercommit
          annotations:
            description: Cluster has overcommitted memory resource requests for Namespaces.
            summary: Cluster has overcommitted memory resource requests.
          expr: |
            sum(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics", type="hard", resource="memory"})
              /
            sum(kube_node_status_allocatable{resource="memory",job="kube-state-metrics"})
              > 1.5
          for: 5m
          labels:
            severity: warning
****************************
reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"}
see from
********************************
count(kube_resourcequota) by (namespace, job, type, resource)
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.cpu", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="hard"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="limits.memory", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/daemonsets.apps", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="count/deployments.apps", type="used"} 1
{job="kube-state-metrics", namespace="openshift-host-network", resource="pods", type="hard"} 1
********************************

Comment 4 Junqi Zhao 2021-11-01 07:23:16 UTC

(In reply to Junqi Zhao from comment #3)
> reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"}


change to
reason why you get "No datapoints found" is there is not kube_resourcequota with labels {type="hard", resource="cpu"} and {type="hard", resource="memory"} from your cluster

Comment 9 hongyan li 2021-11-08 05:46:47 UTC

From the #c1 and #c2, we can know the environment on which we face the issue, both requests.cpu and requests.memory have data, but our alert use cpu and memory in the expr and show 'No datapoint found'

Comment 10 Junqi Zhao 2021-11-08 06:35:53 UTC

(In reply to hongyan li from comment #9)
> From the #c1 and #c2, we can know the environment on which we face the
> issue, both requests.cpu and requests.memory have data, but our alert use
> cpu and memory in the expr and show 'No datapoint found'

yes， indeed

Comment 11 Arunprasad Rajkumar 2021-11-10 12:22:16 UTC

According to https://kubernetes.io/docs/concepts/policy/resource-quotas/#compute-resource-quota, cpu is same as requests.cpu and memory is same as requests.memory.

IMHO, the expression must be modified to kube_resourcequota{resource=~"(requests.cpu|cpu)"} and kube_resourcequota{resource=~"(requests.memory|memory)"}. I can raise an upstream PR to fix the same.

Comment 12 Filip Petkovski 2021-11-22 05:59:28 UTC

@arajkuma Yes, you're right. The query expression needs to be adjusted. Do you want to raise the PR or should I do it?

Comment 13 Arunprasad Rajkumar 2021-11-30 03:38:35 UTC

@fpetkovs https://github.com/openshift/cluster-monitoring-operator/pull/1491 should fix this issue.

Comment 16 hongyan li 2021-12-01 09:29:35 UTC

verified in payload 4.10.0-0.nightly-2021-12-01-072705

alert expr are changed as the following and expr works well.

alert: KubeCPUQuotaOvercommit
expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5

 

alert: KubeMemoryQuotaOvercommit
expr: sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5

Comment 17 hongyan li 2021-12-02 02:07:00 UTC

need more test

Comment 18 hongyan li 2021-12-03 01:58:36 UTC

% oc label ns default openshift.io/cluster-monitoring="true"
% oc project default
Now using project "default" on server "https://api.hongyli-1202.qe.devcluster.openshift.com:6443".
% oc apply -f - <<EOF
heredoc> apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    pods: "4"
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
heredoc> EOF
resourcequota/compute-resources created

% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"})' 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496559.589,"0.011685650023197984"]}]}}
100   473  100   125  100   348   8333  23200 --:--:-- --:--:-- --:--:-- 31533
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default)",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   464  100   125  100   339   6944  18833 --:--:-- --:--:-- --:--:-- 27294
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1638496645.636,"0.047619047619047616"]}]}}

Comment 22 errata-xmlrpc 2022-03-10 16:23:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 23 Red Hat Bugzilla 2023-09-15 01:16:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days