Bug 1962261

Summary: Monitoring components requesting more memory than they use
Product: OpenShift Container Platform Reporter: Filip Petkovski <fpetkovs>
Component: MonitoringAssignee: Filip Petkovski <fpetkovs>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: alegrand, anpicker, aos-bugs, dgrisonn, erooth, kakkoyun, pkrupa
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:09:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Junqi Zhao 2021-05-20 07:23:48 UTC
From the attached picture and tested in our cluster, need to change resources.requests.memory to a lower value for thanos-sidecar
Container Name: thanos-sidecar
resources: map[requests:map[cpu:1m memory:100Mi]]

(max (kube_pod_container_resource_requests{resource="memory", namespace="openshift-monitoring",container="thanos-sidecar"})  - max (container_memory_usage_bytes{namespace="openshift-monitoring",container="thanos-sidecar"})) /1024 /1024 
{}
69.3671875

Comment 3 Filip Petkovski 2021-05-20 08:00:11 UTC
*** Bug 1962305 has been marked as a duplicate of this bug. ***

Comment 4 Filip Petkovski 2021-05-20 08:02:25 UTC
I have made further adjustments in a new PR: https://github.com/openshift/cluster-monitoring-operator/pull/1172

Comment 6 Junqi Zhao 2021-05-24 03:19:06 UTC
tested with 4.8.0-0.nightly-2021-05-21-233425, need to change resources.requests.memory to a bigger value for prometheus-operator container, please change back to ON_QA if this also fine
******************************** 
searched
sort(
  max by (container) (container_memory_usage_bytes{namespace="openshift-monitoring"} or on(container) container_memory_rss{namespace="openshift-monitoring"}) -
  max by (container) (kube_pod_container_resource_requests{resource="memory", namespace="openshift-monitoring"})) / 1024 /1024
result
{container="prometheus-operator"}   -27.0703125
{container="telemeter-client"}   -14.890625
{container="alertmanager"}   -13.015625
{container="cluster-monitoring-operator"}   -12.5859375
{container="openshift-state-metrics"}   -12.3671875
{container="grafana"}   -11.5
{container="node-exporter"}   -5.5078125
{container="kube-state-metrics"}   -4.28125
{container="kube-rbac-proxy-self"}   -2.484375
{container="prom-label-proxy"}   -1.14453125
{container="kube-rbac-proxy"}   -0.18359375
{container="kube-rbac-proxy-main"}   1.30078125
{container="thanos-sidecar"}   3.62109375
{container="kube-rbac-proxy-rules"}   3.703125
{container="prometheus-proxy"}   5.26953125
{container="oauth-proxy"}   6.1015625
{container="reload"}   7.16015625
{container="alertmanager-proxy"}   7.78515625
{container="kube-rbac-proxy-thanos"}   8.29296875
{container="prometheus-adapter"}   9.58203125
{container="grafana-proxy"}   9.8125
{container="config-reloader"}   10.67578125
{container="thanos-query"}   60.05859375
{container="prometheus"}   1112.93359375
********************************

Comment 7 Junqi Zhao 2021-05-24 03:19:35 UTC
# for i in $(kubectl -n openshift-monitoring get pod --no-headers | awk '{print $1}'); do echo $i; kubectl -n openshift-monitoring get pod $i -o go-template='{{range.spec.containers}}{{"Container Name: "}}{{.name}}{{"\r\nresources: "}}{{.resources}}{{"\n"}}{{end}}'; echo -e "\n"; done
alertmanager-main-0
Container Name: alertmanager
resources: map[requests:map[cpu:4m memory:40Mi]]
Container Name: config-reloader
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: alertmanager-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]


alertmanager-main-1
Container Name: alertmanager
resources: map[requests:map[cpu:4m memory:40Mi]]
Container Name: config-reloader
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: alertmanager-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]


alertmanager-main-2
Container Name: alertmanager
resources: map[requests:map[cpu:4m memory:40Mi]]
Container Name: config-reloader
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: alertmanager-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]


cluster-monitoring-operator-fdb9d949c-vkl5q
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: cluster-monitoring-operator
resources: map[requests:map[cpu:10m memory:75Mi]]


grafana-7bb7f88d68-7ks6f
Container Name: grafana
resources: map[requests:map[cpu:4m memory:64Mi]]
Container Name: grafana-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]


kube-state-metrics-69cc98557f-stb24
Container Name: kube-state-metrics
resources: map[requests:map[cpu:2m memory:80Mi]]
Container Name: kube-rbac-proxy-main
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: kube-rbac-proxy-self
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-2v86g
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-427b8
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-5whz5
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-9r2bz
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-khtd6
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


node-exporter-psxqr
Container Name: node-exporter
resources: map[requests:map[cpu:8m memory:32Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


openshift-state-metrics-5f54b4ff58-w674d
Container Name: kube-rbac-proxy-main
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy-self
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: openshift-state-metrics
resources: map[requests:map[cpu:1m memory:32Mi]]


prometheus-adapter-6cb7687895-9bfn8
Container Name: prometheus-adapter
resources: map[requests:map[cpu:1m memory:40Mi]]


prometheus-adapter-6cb7687895-sppwt
Container Name: prometheus-adapter
resources: map[requests:map[cpu:1m memory:40Mi]]


prometheus-k8s-0
Container Name: prometheus
resources: map[requests:map[cpu:70m memory:1Gi]]
Container Name: config-reloader
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: thanos-sidecar
resources: map[requests:map[cpu:1m memory:25Mi]]
Container Name: prometheus-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: kube-rbac-proxy-thanos
resources: map[requests:map[cpu:1m memory:10Mi]]


prometheus-k8s-1
Container Name: prometheus
resources: map[requests:map[cpu:70m memory:1Gi]]
Container Name: config-reloader
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: thanos-sidecar
resources: map[requests:map[cpu:1m memory:25Mi]]
Container Name: prometheus-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: kube-rbac-proxy-thanos
resources: map[requests:map[cpu:1m memory:10Mi]]


prometheus-operator-fd77ffdd8-6brvp
Container Name: prometheus-operator
resources: map[requests:map[cpu:5m memory:150Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]


telemeter-client-5657ccddfb-74fhr
Container Name: telemeter-client
resources: map[requests:map[cpu:1m memory:40Mi]]
Container Name: reload
resources: map[requests:map[cpu:1m memory:10Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]


thanos-querier-db74d4959-74v2r
Container Name: thanos-query
resources: map[requests:map[cpu:10m memory:12Mi]]
Container Name: oauth-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: kube-rbac-proxy-rules
resources: map[requests:map[cpu:1m memory:15Mi]]


thanos-querier-db74d4959-lbqbv
Container Name: thanos-query
resources: map[requests:map[cpu:10m memory:12Mi]]
Container Name: oauth-proxy
resources: map[requests:map[cpu:1m memory:20Mi]]
Container Name: kube-rbac-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: prom-label-proxy
resources: map[requests:map[cpu:1m memory:15Mi]]
Container Name: kube-rbac-proxy-rules
resources: map[requests:map[cpu:1m memory:15Mi]]

Comment 8 Filip Petkovski 2021-05-25 06:29:09 UTC
Hi Junqi, 

the way to calculate the correct memory request is actually documented here: https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits.

The guidelines say that the requested memory should be 10% higher than the 90th percentile of memory usage during a CI run.
I went ahead and made a PR with the actual query which would calculate the discrepancy: https://github.com/openshift/enhancements/pull/788/files

When I ran the query after adjustments, the difference between requested and used memory was within 20Mi ranges. 
Do you think you can do your verification with this query as well?

Comment 12 errata-xmlrpc 2021-07-27 23:09:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438