Bug 1465464
| Summary: | Kibana container grows in memory till out of memory | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> | |
| Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | |
| Status: | CLOSED WONTFIX | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.4.1 | CC: | aos-bugs, erich, hhorak, jcantril, juzhao, jwozniak, pportant, stwalter, wsun | |
| Target Milestone: | --- | Keywords: | Reopened | |
| Target Release: | 3.4.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause:
Consequence:
Fix: Use underscores instead of dashes for the memory switch
Result: Memory settings are respected by the nodejs runtime
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1469711 (view as bug list) | Environment: | ||
| Last Closed: | 2017-09-04 02:15:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1474689 | |||
| Bug Blocks: | 1469711 | |||
| Attachments: | ||||
|
Description
Steven Walter
2017-06-27 13:33:41 UTC
*** This bug has been marked as a duplicate of bug 1464020 *** As per conversation, re-opened as a separate issue from bug 1464020 -- that bug is specifically for the kibana proxy container and this is for the kibana container I believe this has been fixed upstream and needs to be backported. Looking at 3.4 release, I don't see where the proxy is propagating the memory request to the nodejs run time. Marking as upcoming release to remove from the 3.6 blocker list Strike comment #5 as i pasted into wrong issue. @Steven, can you provide the version of the image you are using for kibana? Something like 3.4.1-XX. That would give me a better idea what specifically was tested other then the sha Found a few posts that seem to indicate we need to configure --max_old_space_size. [1] https://github.com/nodejs/node/issues/7937 [2] https://github.com/elastic/kibana/issues/9006 Note the flag per searching the node docs should have underscores and not dashes Where do we specify --max_old_space_size ? Deployed logging 3.4.1 and let it run for a few hours, during this period, created a few projects to populate logs, described kibana pod, OOMKilled for kibana and kibana-proxy container.
# oc describe po ${kibana-pods}
Containers:
kibana:
..........
Port:
Limits:
memory: 736Mi
Requests:
memory: 736Mi
State: Running
Started: Wed, 26 Jul 2017 18:06:26 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 26 Jul 2017 12:43:40 -0400
Finished: Wed, 26 Jul 2017 18:06:24 -0400
kibana-proxy:
...............
Port: 3000/TCP
Limits:
memory: 96Mi
Requests:
memory: 96Mi
State: Running
Started: Wed, 26 Jul 2017 20:15:41 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 26 Jul 2017 12:43:44 -0400
Finished: Wed, 26 Jul 2017 20:15:39 -0400
Created attachment 1305197 [details]
kibana pod info
Created a few projects to populate logs and run twice with
$ for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana_route}/elasticsearch/ -sk > /dev/null; done; sleep 120s; for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana-ops_route}/elasticsearch/ -sk > /dev/null; done
kibana pod restarted 3 times, kibana-ops pod restarted 2 times. Although kibana container did not terminated with OOMKilled error, kibana-proxy terminated with OOMKilled error
# oc get po
NAME READY STATUS RESTARTS AGE
java-mainclass-1-56prz 1/1 Running 0 44m
logging-curator-1-gw3l4 1/1 Running 0 3h
logging-curator-ops-1-0neb5 1/1 Running 0 3h
logging-deployer-zjpl7 0/1 Completed 0 3h
logging-es-axorjnf6-1-ivr7c 1/1 Running 0 3h
logging-es-ops-plqwi2jt-1-5atpn 1/1 Running 0 3h
logging-fluentd-s318q 1/1 Running 0 3h
logging-kibana-1-4aqqx 2/2 Running 3 3h
logging-kibana-ops-1-d39aj 2/2 Running 2 3h
# oc describe po ${kibana-pods}, more info see the attached file
Containers:
kibana:
.......................
Limits:
memory: 736Mi
Requests:
memory: 736Mi
State: Running
Started: Tue, 08 Aug 2017 04:38:32 -0400
Ready: True
Restart Count: 0
.......................
kibana-proxy:
.......................
Limits:
memory: 96Mi
Requests:
memory: 96Mi
State: Running
Started: Tue, 08 Aug 2017 07:45:02 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 08 Aug 2017 07:14:19 -0400
Finished: Tue, 08 Aug 2017 07:44:59 -0400
Ready: True
Restart Count: 3
Created attachment 1310599 [details]
OOMKilled for container kibana-proxy
Verified with logging-kibana:3.4.1-28, same steps with Comment 22, kibana-proxy container still terminated with OOMKilled error, more info see the attached file # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-2gij7 1/1 Running 0 1h logging-curator-ops-1-yeau9 1/1 Running 0 1h logging-deployer-nj623 0/1 Completed 0 1h logging-es-7dw1dcne-1-86w7j 1/1 Running 0 1h logging-es-ops-bv2i5e2v-1-8fu68 1/1 Running 0 1h logging-fluentd-ydj7a 1/1 Running 0 1h logging-kibana-1-i0gdv 2/2 Running 2 1h logging-kibana-ops-1-68105 2/2 Running 2 1h Created attachment 1314999 [details]
OOMKilled for container kibana-proxy -20170818
In order to the debug this, we need to be sure we add up all the "limits" for pods running on the node, the total size of memory on the node, gather the node reserve size in /etc/origin/node/node-config.yml, and see if everything "fits". Second, we need to make sure the "requests" size is the same as the "limits" explicitly set in all the DCs and DSs. Changed resources.limits.memory from default value 96Mi to 150Mi for kibana-proxy, not found OOM error for both kibana and kibana-proxy containers.
See the attached file.
I think it makes sense, the OOM would throw out if the resources.limits.memory is two small.
Steps:
Created a few projects to populate logs and run four times with
$ for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana_route}/elasticsearch/ -sk > /dev/null; done; sleep 120s; for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana-ops_route}/elasticsearch/ -sk > /dev/null; done
Created attachment 1315146 [details]
changed resources.limits.memory to 150Mi, no OOM error now
no OOM error shows now.
Verification steps:
Verified with logging-kibana:3.4.1-28, and changed resources.limits.memory from default value 96Mi to a larger value, created a few projects to populate logs and run four times with
$ for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana_route}/elasticsearch/ -sk > /dev/null; done; sleep 120s; for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana-ops_route}/elasticsearch/ -sk > /dev/null; done
When we adjusted the calculations of max-old-space-size to half [1] of what pod receives, another PR to openshift-ansible [2] got lost in the process. When it merges, the default memory should be set to conservative 256MB which should suffice. [1] https://github.com/openshift/origin-aggregated-logging/pull/529 [2] https://github.com/openshift/openshift-ansible/pull/4761 Don't we also need a fix to half max-old-space-size for the kibana-proxy as well? Created attachment 1319837 [details]
resources.limits.memory is still 96Mi for kibana-proxy container
Wei, We should encourage users of 3.4 to migrate to 3.5. The work around for this issue is to deploy logging and then edit the logging-kibana DeploymentConfig: 1. Set env. variable OCP_AUTH_PROXY_MEMORY_LIMIT to 256Mi for the kibana-proxy container Workaround in Comment 40 works, there is no OOM error for kibana-proxy container. Since we are not going to fix it for logging 3.4, and there is not documentations to encourage user migrate to 3.5, I think we should close it as WONTFIX and remove this defect from errata. Verification step: Creat a few projects to populate logs and run with $ for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana_route}/elasticsearch/ -sk > /dev/null; done; sleep 120s; for i in {1..300}; do curl --fail --max-time 10 -H "Authorization: Bearer `oc whoami -t`" https://${kibnana-ops_route}/elasticsearch/ -sk > /dev/null; done @Jeff, I have reported one documentation defect for this issue, it tells the workaround if customers come across this issue. https://bugzilla.redhat.com/show_bug.cgi?id=1488001 We closed this defect as WONTFIX, are you OK with this solution? Yes. |