1559450 – Hawkular Metrics crashes with OutOfMemoryError under moderate load

Bug 1559450 - Hawkular Metrics crashes with OutOfMemoryError under moderate load

Summary: Hawkular Metrics crashes with OutOfMemoryError under moderate load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	John Sanda
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1559440 1559443 1559448
Blocks:	1558677
TreeView+	depends on / blocked

Reported:	2018-03-22 15:12 UTC by John Sanda
Modified:	2018-06-07 08:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1559448
Environment:
Last Closed:	2018-06-07 08:40:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	HWKMETRICS-768	0	Major	Closed	DeleteExpiredMetrics job causes OutOfMemoryErrors	2019-09-27 02:27:44 UTC
Red Hat Product Errata	RHBA-2018:1801	0	None	None	None	2018-06-07 08:40:49 UTC

Description John Sanda 2018-03-22 15:12:25 UTC

+++ This bug was initially created as a clone of Bug #1559448 +++

+++ This bug was initially created as a clone of Bug #1559443 +++

+++ This bug was initially created as a clone of Bug #1559440 +++

Description of problem:
DeleteExpiredMetrics is a background job that runs in the hawkular-metrics server. It was introduced in OCP 3.6 to basically clean up index tables, removing rows for metrics/pods that no longer exist. This was needed because the indexes basically have unbounded growth which led to other problems. The queries that the job performs have the potential to pull back a large amount of data which makes hawkular-metrics very susceptible to OOMEs under a relatively modest sized load. It first observed the problem in a cluster of about 9k pods, but I also observed the problem in a cluster with around 2k pods. Increasing the heap size of hawkular-metrics could alleviate the issue; however, there is no need for the job to query aggressively as it does since it is not latency sensitive.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Configure hawkular-metrics with 3 GB of memory. This should result in a JVM heap size of around 1300 MB.
2. Create about 2k pods
3. Let the DeleteExpiredMetrics job run
4. The job is only scheduled to run every 7 days. I can provide additional details to make the job run more frequently to assist with testing

Actual results:


Expected results:


Additional info:

Comment 4 Junqi Zhao 2018-05-31 01:33:48 UTC

Verification steps followed https://bugzilla.redhat.com/show_bug.cgi?id=1559440#c3, DeleteExpiredMetrics job is already dropped from code

# openshift version
openshift v3.6.173.0.122
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

Images:
metrics-cassandra-v3.6.173.0.122-2
metrics-hawkular-metrics-v3.6.173.0.122-2
metrics-heapster-v3.6.173.0.122-2

Comment 6 errata-xmlrpc 2018-06-07 08:40:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1801

Note You need to log in before you can comment on or make changes to this bug.