1903699 – Prometheus consumes all available memory

Bug 1903699 - Prometheus consumes all available memory

Summary: Prometheus consumes all available memory

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Lindsey Barbee-Vargas
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-02 16:23 UTC by Steven Walter
Modified:	2024-03-25 17:21 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-23 19:33:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Steven Walter 2020-12-02 16:23:16 UTC

Description of problem:
- OpenShift 3.11 metric nodes lock up and are unresponsive, requiring a reboot.

- The reason containerd appears to be unresponsive is that the node is out of all useable memory, as it is consumed by prometheus.
- In vmcores captured from two seperate nodes with two seperate occurances, all available memory was used for prometheus:

	crash> kmem -i
					 PAGES        TOTAL      PERCENTAGE
		TOTAL MEM  10254501      39.1 GB         ----
			 FREE    58857     229.9 MB    0% of TOTAL MEM
			 USED  10195644      38.9 GB   99% of TOTAL MEM
		   SHARED     9881      38.6 MB    0% of TOTAL MEM
		  BUFFERS        0            0    0% of TOTAL MEM
		   CACHED    35778     139.8 MB    0% of TOTAL MEM
			 SLAB    86828     339.2 MB    0% of TOTAL MEM

	   TOTAL HUGE        0            0         ----
		HUGE FREE        0            0    0% of TOTAL HUGE

	   TOTAL SWAP        0            0         ----
		SWAP USED        0            0    0% of TOTAL SWAP
		SWAP FREE        0            0    0% of TOTAL SWAP

	 COMMIT LIMIT  5127250      19.6 GB         ----
		COMMITTED  11012811        42 GB  214% of TOTAL LIMIT

- The following shows 38.1 gigabytes of 42 gigabytes memory comitted to prometheus, leaving none for basic operations.

	crash> ps | head -n1
	   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM

	crash> ps | sort -nrk 8 | head -n50
	> 54299  38195   2  ffff9881a87f2100  RU  88.6 70417064 38109120  prometheus
	> 54298  38195   5  ffff9881a87f0000  RU  88.6 70417064 38109120  prometheus
	> 38307  38195   3  ffff9881b1ba9080  RU  88.6 70417064 38109120  prometheus
	  54746  38195   3  ffff9879037c6300  UN  88.6 70417064 38109120  prometheus
	  54301  38195   0  ffff98820752e300  UN  88.6 70417064 38109120  prometheus
	  54300  38195   1  ffff9881a87f6300  IN  88.6 70417064 38109120  prometheus
	  54297  38195   4  ffff98819c3a2100  IN  88.6 70417064 38109120  prometheus
	  38811  38195   0  ffff987a239aa100  IN  88.6 70417064 38109120  prometheus
	  38467  38195   4  ffff98820708a100  UN  88.6 70417064 38109120  prometheus
	  38466  38195   0  ffff987ce2ee1080  IN  88.6 70417064 38109120  prometheus
	  38311  38195   7  ffff98820aa28000  IN  88.6 70417064 38109120  prometheus
	  38310  38195   3  ffff98820aa2b180  IN  88.6 70417064 38109120  prometheus
	  38308  38195   1  ffff9881b8d31080  UN  88.6 70417064 38109120  prometheus
	  38305  38195   1  ffff9881cafc2100  UN  88.6 70417064 38109120  prometheus
	  38304  38195   2  ffff9881b1bac200  IN  88.6 70417064 38109120  prometheus
	  38303  38195   3  ffff98820a8dd280  IN  88.6 70417064 38109120  prometheus
	  38302  38195   0  ffff9881b5944200  IN  88.6 70417064 38109120  prometheus
	  38301  38195   1  ffff98819ba99080  IN  88.6 70417064 38109120  prometheus
	  38300  38195   6  ffff98819ba98000  IN  88.6 70417064 38109120  prometheus
	  38299  38195   7  ffff987dca870000  IN  88.6 70417064 38109120  prometheus
	  38298  38195   3  ffff987dca872100  IN  88.6 70417064 38109120  prometheus
	  38228  38195   1  ffff9879037c1080  IN  88.6 70417064 38109120  prometheus

We applied a memory limit on Prometheus as per https://access.redhat.com/solutions/3867881 to prevent it from crashing the nodes. The pods now OOMKill:

prometheus-k8s-1   3/4       OOMKilled   5         6m


Version-Release number of selected component (if applicable):
v3.11.286


How reproducible:
Unconfirmed


Additional info:
In previous bugs on memory with prometheus (https://bugzilla.redhat.com/show_bug.cgi?id=1790265) it's suggested to check against known capacity requirements:
https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-capacity-planning

However, this is not a very large cluster:

# oc get nodes | wc -l
20
# oc get pods --all-namespaces | wc -l
532

We currently have set a 3Gi memory limit so that it will OOMkill instead of crashing the node. I will upload config details.

Comment 13 Steven Walter 2021-01-05 19:26:11 UTC

Hi, after setting to 8G limit the issue seems to have resolved. I wonder if we should consider updating the scaling / performance doc:
https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-capacity-planning

As noted in the initial bug description, the customer's cluster is actually smaller than the smallest test on the chart, and yet needed more memory. Are there certain situations where more memory is required? I'm not sure if this is a priority since it's 3.11, but I wonder if we should tweak the document if this is a common issue.

Otherwise I think we can close the bz.

Comment 17 Lindsey Barbee-Vargas 2021-02-15 18:40:45 UTC

3.11 PR - https://github.com/openshift/openshift-docs/pull/29502
4.5-4.7 PR - https://github.com/openshift/openshift-docs/pull/29503

Simon, PTAL at the above PRs and let me know if more details are needed in the new note. Thank you.

Comment 18 Junqi Zhao 2021-02-18 02:30:43 UTC

LGTM

Comment 20 Lindsey Barbee-Vargas 2021-02-22 19:21:56 UTC

Verified fix is published and live on docs.openshift.com [3.11]:
https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html

Comment 22 Lindsey Barbee-Vargas 2021-02-23 19:33:34 UTC

Verified fix is published and live on docs.openshift.com [4.5/4.6]:
https://docs.openshift.com/container-platform/4.5/scalability_and_performance/scaling-cluster-monitoring-operator.html
https://docs.openshift.com/container-platform/4.6/scalability_and_performance/scaling-cluster-monitoring-operator.html

Verified fix will be available upon release of 4.7:
https://docs.openshift.com/container-platform/4.7/scalability_and_performance/scaling-cluster-monitoring-operator.html

Note You need to log in before you can comment on or make changes to this bug.