Bug 1464020 - Kibana-proxy gets OOMKilled
Kibana-proxy gets OOMKilled
Status: VERIFIED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.4.1
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.7.0
Assigned To: Jeff Cantrill
Xia Zhao
:
Depends On:
Blocks: 1468734 1468987
  Show dependency treegraph
 
Reported: 2017-06-22 05:54 EDT by Ruben Romero Montes
Modified: 2017-08-14 14:06 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Consequence: Fix: Use underscores when providing memory switches to the Nodejs runtime instead of dashes. Result: The Nodejs interpreter understands the request
Story Points: ---
Clone Of:
: 1468734 1468987 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github fabric8io/openshift-auth-proxy/pull/20 None None None 2017-06-28 17:58 EDT
Github openshift/origin-aggregated-logging/pull/521 None None None 2017-07-06 15:03 EDT

  None (edit)
Description Ruben Romero Montes 2017-06-22 05:54:02 EDT
Description of problem:
Kibana-proxy is not capable of managing continuous requests and the memory grows until it is killed by the cgroup OOM Killer.

Version-Release number of selected component (if applicable):
Openshift auth-proxy 3.4.1-21

How reproducible:
100%

Steps to Reproduce:
1. get the token `oc whoami -t`
2. Run the following script twice in background
kibana_stress.sh
 for i in {1..300}
 do
   curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://kibana.apps.rromeromlogging34.quicklab.pnq2.cee.redhat.com/elasticsearch/ -sk > /dev/null
 done

Actual results:
logging-kibana-1-jo9n6        1/2       OOMKilled   0          4m

Expected results:
expect the memory to be stable

Additional info:
After approximately 300 HTTP requests Node.js exceeds the default memory limit of 96 MB and is killed by the cgroup OOM killer. A huge IOP spike occurs at the same time (reads only; see "iotop -m 1"). I'm not sure why that happens, but it seems to originate in Node.js too.
Comment 1 Jeff Cantrill 2017-06-27 09:47:44 EDT
*** Bug 1465464 has been marked as a duplicate of this bug. ***
Comment 2 Peter Portante 2017-06-28 11:44:13 EDT
We need to find out what the memory usage would be if we ran this stress test twice with a much higher memory limit.  If it levels off as expected, we just need to bump the default limit.  If it does not level off, but shows signs of continuous, unbounded growth, then this is likely a bug in the kibana-proxy code itself.
Comment 3 Jeff Cantrill 2017-06-28 11:51:14 EDT
I believe this has been fixed upstream and needs to be backported.  Looking at 3.4 release, I don't see where the proxy is propagating the memory request to the nodejs run time.  Marking as upcoming release to remove from the 3.6 blocker list
Comment 4 openshift-github-bot 2017-07-07 09:54:20 EDT
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e15bbb65fedf640e452c6ba48c633c0d5bca2dcc
bug 1464020. bump kibana-proxy to fix memory switches
Comment 6 Xia Zhao 2017-07-11 04:08:12 EDT
It's fixed. Verified with this command:

$  for i in {1..300};  do    curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://{kibana-route}/elasticsearch/ -sk > /dev/null;  done

run it twice from oc client side, and kibana containers keep running fine:

# oc get po
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-c45kr                       1/1       Running   0          23h
logging-curator-ops-1-4c2db                   1/1       Running   0          23h
logging-es-data-master-upd7q5u4-1-03r8m       1/1       Running   0          23h
logging-es-ops-data-master-m0rsr8oo-1-zll8h   1/1       Running   0          23h
logging-fluentd-hr9qd                         1/1       Running   0          23h
logging-fluentd-mxwjw                         1/1       Running   0          23h
logging-kibana-1-x849x                        2/2       Running   8          23h
logging-kibana-ops-1-n9b5g                    2/2       Running   6          23h

Test env:
# openshift version
openshift v3.6.126.14
kubernetes v1.6.1+5115d708d7
etcd 3.2.0

Images tested with:
openshift3/logging-auth-proxy    4cf6b1d60d2b
openshift3/logging-kibana    4563b27eac07
openshift3/logging-elasticsearch    8809f390a819
openshift3/logging-fluentd    a2ea005ef4f6
openshift3/logging-curator    ea1887b8e441
Comment 7 Jan Wozniak 2017-07-13 12:35:21 EDT
I have a sneaking suspicion that it has not been fixed yet. Looking at the above comment, kibana has been restarted a few times during the test (possibly OOM, but hard to prove without logs)

We currently set the memory allocated to the kibana-proxy container to be the same as `max_old_space_size` for nodejs. But in V8, the total heap consists of multiple spaces. The old space has only memory ready for GC and measuring the heap by kibana-proxy code, there is at least additional 32MB sitting in the code space.

https://nodejs.org/api/v8.html#v8_v8_getheapspacestatistics
Comment 8 Xia Zhao 2017-07-25 01:43:11 EDT
(In reply to Jan Wozniak from comment #7)
> I have a sneaking suspicion that it has not been fixed yet. Looking at the
> above comment, kibana has been restarted a few times during the test
> (possibly OOM, but hard to prove without logs)

Hmm, I often meet kibana to restart 6 or 8 times before running stable in the very beginning of logging deployment, especially when the ops cluster is enabled. Thought it was caused by es/es-ops not ready for connection or kinds of network latency of IAAS at that moment, I'll provide kibana log next time when this situation is encountered, thank you for observing and pointing this out here. Let me know if any further thinkings on the restart count here.

The above clarification means the restart count (6 and 8) should exist in logging stacks before I running the curl commands in loop to test bug fix in this scenario. Please feel free to advise here if any better way of tests need to be performed. Thanks in advance!
Comment 9 Peter Portante 2017-07-25 09:38:16 EDT
Can we keep this bug about the kibana-auth-proxy container, and open a new BZ about the kibana container being restarted?
Comment 10 Peter Portante 2017-07-25 09:43:42 EDT
Strike that, we can track Kibana container restarts via BZ https://bugzilla.redhat.com/show_bug.cgi?id=1465464.

Note You need to log in before you can comment on or make changes to this bug.