Description of problem: Kibana-proxy is not capable of managing continuous requests and the memory grows until it is killed by the cgroup OOM Killer. Version-Release number of selected component (if applicable): Openshift auth-proxy 3.4.1-21 How reproducible: 100% Steps to Reproduce: 1. get the token `oc whoami -t` 2. Run the following script twice in background kibana_stress.sh for i in {1..300} do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://kibana.apps.rromeromlogging34.quicklab.pnq2.cee.redhat.com/elasticsearch/ -sk > /dev/null done Actual results: logging-kibana-1-jo9n6 1/2 OOMKilled 0 4m Expected results: expect the memory to be stable Additional info: After approximately 300 HTTP requests Node.js exceeds the default memory limit of 96 MB and is killed by the cgroup OOM killer. A huge IOP spike occurs at the same time (reads only; see "iotop -m 1"). I'm not sure why that happens, but it seems to originate in Node.js too.
*** Bug 1465464 has been marked as a duplicate of this bug. ***
We need to find out what the memory usage would be if we ran this stress test twice with a much higher memory limit. If it levels off as expected, we just need to bump the default limit. If it does not level off, but shows signs of continuous, unbounded growth, then this is likely a bug in the kibana-proxy code itself.
I believe this has been fixed upstream and needs to be backported. Looking at 3.4 release, I don't see where the proxy is propagating the memory request to the nodejs run time. Marking as upcoming release to remove from the 3.6 blocker list
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/e15bbb65fedf640e452c6ba48c633c0d5bca2dcc bug 1464020. bump kibana-proxy to fix memory switches
It's fixed. Verified with this command: $ for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://{kibana-route}/elasticsearch/ -sk > /dev/null; done run it twice from oc client side, and kibana containers keep running fine: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-c45kr 1/1 Running 0 23h logging-curator-ops-1-4c2db 1/1 Running 0 23h logging-es-data-master-upd7q5u4-1-03r8m 1/1 Running 0 23h logging-es-ops-data-master-m0rsr8oo-1-zll8h 1/1 Running 0 23h logging-fluentd-hr9qd 1/1 Running 0 23h logging-fluentd-mxwjw 1/1 Running 0 23h logging-kibana-1-x849x 2/2 Running 8 23h logging-kibana-ops-1-n9b5g 2/2 Running 6 23h Test env: # openshift version openshift v3.6.126.14 kubernetes v1.6.1+5115d708d7 etcd 3.2.0 Images tested with: openshift3/logging-auth-proxy 4cf6b1d60d2b openshift3/logging-kibana 4563b27eac07 openshift3/logging-elasticsearch 8809f390a819 openshift3/logging-fluentd a2ea005ef4f6 openshift3/logging-curator ea1887b8e441
I have a sneaking suspicion that it has not been fixed yet. Looking at the above comment, kibana has been restarted a few times during the test (possibly OOM, but hard to prove without logs) We currently set the memory allocated to the kibana-proxy container to be the same as `max_old_space_size` for nodejs. But in V8, the total heap consists of multiple spaces. The old space has only memory ready for GC and measuring the heap by kibana-proxy code, there is at least additional 32MB sitting in the code space. https://nodejs.org/api/v8.html#v8_v8_getheapspacestatistics
(In reply to Jan Wozniak from comment #7) > I have a sneaking suspicion that it has not been fixed yet. Looking at the > above comment, kibana has been restarted a few times during the test > (possibly OOM, but hard to prove without logs) Hmm, I often meet kibana to restart 6 or 8 times before running stable in the very beginning of logging deployment, especially when the ops cluster is enabled. Thought it was caused by es/es-ops not ready for connection or kinds of network latency of IAAS at that moment, I'll provide kibana log next time when this situation is encountered, thank you for observing and pointing this out here. Let me know if any further thinkings on the restart count here. The above clarification means the restart count (6 and 8) should exist in logging stacks before I running the curl commands in loop to test bug fix in this scenario. Please feel free to advise here if any better way of tests need to be performed. Thanks in advance!
Can we keep this bug about the kibana-auth-proxy container, and open a new BZ about the kibana container being restarted?
Strike that, we can track Kibana container restarts via BZ https://bugzilla.redhat.com/show_bug.cgi?id=1465464.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188