1464020 – Kibana-proxy gets OOMKilled

Bug 1464020 - Kibana-proxy gets OOMKilled

Summary: Kibana-proxy gets OOMKilled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Jeff Cantrill
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1468734 1468987
TreeView+	depends on / blocked

Reported:	2017-06-22 09:54 UTC by Ruben Romero Montes
Modified:	2020-12-14 08:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Consequence: Fix: Use underscores when providing memory switches to the Nodejs runtime instead of dashes. Result: The Nodejs interpreter understands the request
Clone Of:
Clones:	1468734 1468987 (view as bug list)
Environment:
Last Closed:	2017-11-28 21:58:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	fabric8io openshift-auth-proxy pull 20	None	None	None	2017-06-28 21:58:20 UTC
Github	openshift origin-aggregated-logging pull 521	None	None	None	2017-07-06 19:03:58 UTC
Red Hat Product Errata	RHSA-2017:3188	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Ruben Romero Montes 2017-06-22 09:54:02 UTC

Description of problem:
Kibana-proxy is not capable of managing continuous requests and the memory grows until it is killed by the cgroup OOM Killer.

Version-Release number of selected component (if applicable):
Openshift auth-proxy 3.4.1-21

How reproducible:
100%

Steps to Reproduce:
1. get the token `oc whoami -t`
2. Run the following script twice in background
kibana_stress.sh
 for i in {1..300}
 do
   curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://kibana.apps.rromeromlogging34.quicklab.pnq2.cee.redhat.com/elasticsearch/ -sk > /dev/null
 done

Actual results:
logging-kibana-1-jo9n6        1/2       OOMKilled   0          4m

Expected results:
expect the memory to be stable

Additional info:
After approximately 300 HTTP requests Node.js exceeds the default memory limit of 96 MB and is killed by the cgroup OOM killer. A huge IOP spike occurs at the same time (reads only; see "iotop -m 1"). I'm not sure why that happens, but it seems to originate in Node.js too.

Comment 1 Jeff Cantrill 2017-06-27 13:47:44 UTC

*** Bug 1465464 has been marked as a duplicate of this bug. ***

Comment 2 Peter Portante 2017-06-28 15:44:13 UTC

We need to find out what the memory usage would be if we ran this stress test twice with a much higher memory limit.  If it levels off as expected, we just need to bump the default limit.  If it does not level off, but shows signs of continuous, unbounded growth, then this is likely a bug in the kibana-proxy code itself.

Comment 3 Jeff Cantrill 2017-06-28 15:51:14 UTC

I believe this has been fixed upstream and needs to be backported.  Looking at 3.4 release, I don't see where the proxy is propagating the memory request to the nodejs run time.  Marking as upcoming release to remove from the 3.6 blocker list

Comment 4 openshift-github-bot 2017-07-07 13:54:20 UTC

Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e15bbb65fedf640e452c6ba48c633c0d5bca2dcc
bug 1464020. bump kibana-proxy to fix memory switches

Comment 6 Xia Zhao 2017-07-11 08:08:12 UTC

It's fixed. Verified with this command:

$  for i in {1..300};  do    curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://{kibana-route}/elasticsearch/ -sk > /dev/null;  done

run it twice from oc client side, and kibana containers keep running fine:

# oc get po
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-c45kr                       1/1       Running   0          23h
logging-curator-ops-1-4c2db                   1/1       Running   0          23h
logging-es-data-master-upd7q5u4-1-03r8m       1/1       Running   0          23h
logging-es-ops-data-master-m0rsr8oo-1-zll8h   1/1       Running   0          23h
logging-fluentd-hr9qd                         1/1       Running   0          23h
logging-fluentd-mxwjw                         1/1       Running   0          23h
logging-kibana-1-x849x                        2/2       Running   8          23h
logging-kibana-ops-1-n9b5g                    2/2       Running   6          23h

Test env:
# openshift version
openshift v3.6.126.14
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

Images tested with:
openshift3/logging-auth-proxy    4cf6b1d60d2b
openshift3/logging-kibana    4563b27eac07
openshift3/logging-elasticsearch    8809f390a819
openshift3/logging-fluentd    a2ea005ef4f6
openshift3/logging-curator    ea1887b8e441

Comment 7 Jan Wozniak 2017-07-13 16:35:21 UTC

I have a sneaking suspicion that it has not been fixed yet. Looking at the above comment, kibana has been restarted a few times during the test (possibly OOM, but hard to prove without logs)

We currently set the memory allocated to the kibana-proxy container to be the same as `max_old_space_size` for nodejs. But in V8, the total heap consists of multiple spaces. The old space has only memory ready for GC and measuring the heap by kibana-proxy code, there is at least additional 32MB sitting in the code space.

https://nodejs.org/api/v8.html#v8_v8_getheapspacestatistics

Comment 8 Xia Zhao 2017-07-25 05:43:11 UTC

(In reply to Jan Wozniak from comment #7)
> I have a sneaking suspicion that it has not been fixed yet. Looking at the
> above comment, kibana has been restarted a few times during the test
> (possibly OOM, but hard to prove without logs)

Hmm, I often meet kibana to restart 6 or 8 times before running stable in the very beginning of logging deployment, especially when the ops cluster is enabled. Thought it was caused by es/es-ops not ready for connection or kinds of network latency of IAAS at that moment, I'll provide kibana log next time when this situation is encountered, thank you for observing and pointing this out here. Let me know if any further thinkings on the restart count here.

The above clarification means the restart count (6 and 8) should exist in logging stacks before I running the curl commands in loop to test bug fix in this scenario. Please feel free to advise here if any better way of tests need to be performed. Thanks in advance!

Comment 9 Peter Portante 2017-07-25 13:38:16 UTC

Can we keep this bug about the kibana-auth-proxy container, and open a new BZ about the kibana container being restarted?

Comment 10 Peter Portante 2017-07-25 13:43:42 UTC

Strike that, we can track Kibana container restarts via BZ https://bugzilla.redhat.com/show_bug.cgi?id=1465464.

Comment 14 errata-xmlrpc 2017-11-28 21:58:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.