Bug 1570104 - [Infra] OpenDaylight is consuming too much heap memory
Summary: [Infra] OpenDaylight is consuming too much heap memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 13.0 (Queens)
Assignee: Tim Rozet
QA Contact: Itzik Brown
URL:
Whiteboard: odl_infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-20 16:06 UTC by Sai Sindhur Malleni
Modified: 2018-10-18 07:19 UTC (History)
9 users (show)

Fixed In Version: puppet-opendaylight-8.1.0-0.20180321182557
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 13:52:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
/bin/inc (9.76 KB, text/plain)
2018-04-23 19:20 UTC, Sai Sindhur Malleni
no flags Details
bin/setenv (2.17 KB, text/plain)
2018-04-23 19:20 UTC, Sai Sindhur Malleni
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenDaylight Bug INTPAK-163 None None None 2018-04-24 15:32:05 UTC
OpenDaylight gerrit 71267 None None None 2018-04-24 17:51:00 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:53:11 UTC

Description Sai Sindhur Malleni 2018-04-20 16:06:15 UTC
Description of problem:
I noticed something strange and want to make sure I'm not missing something very obvious. Looking at the karaf process in the ODL container I see

[root@overcloud-odl-0 heat-admin]# ps aux | grep karaf | grep server
42462      19245 27.5  6.4 60158752 8518440 ?    Sl   13:50  34:55 /usr/bin/java -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main


I do not see any startup options for the starting heap size or maximum heap size.

Meanwhile, running perf tests we are seeing ODL hog a lot of memory for heap. Earlier default heap max was set to 2G and if it went over we would just see an OOM and ODL would be killed. However, it now seems to be using as much as it likes.

You can see heap size here[1] (nice sawtooth :-) )


[1]- https://snapshot.raintank.io/dashboard/snapshot/iJPjHdURQ3IIDTL8fqYCZ7kST5T2Ca3r

Version-Release number of selected component (if applicable):
OSP 13
opendaylight-8.0.0-5.el7ost.noarch  
puppet-opendaylight-8.1.0-0.20180321182556.45c4db7.el7ost.noarch 

How reproducible:
100%

Steps to Reproduce:
1. Install OSP13 + ODL
2. Monitor default java startup opts and heap size
3.

Actual results:
ODL seems to be consuming as much heap as it likes

Expected results:
Heap should be set to a max of 2G and trying to go over should result in the JVM getting killed

Additional info:

Comment 1 Michael Vorburger 2018-04-23 14:52:08 UTC
This new "memory problem" here is very different from the ones we chased earlier; and hopefully much easier to solve: What seems to be happening here is that when we moved into a container with ODL, we somehow lost the JVM options we had earlier!

Or perhaps it's not even related to going into a container, but some other upstream (Karaf version bump?) or downstream (RPM? TripleO) change? Whatever the culprit - we can see /usr/bin/java ... line above that there are no JVM memory like settings -Xmx anymore. 

It's actually about more than just the only -Xmx; if we compare the lines above with what I see on a netvirt/karaf (master Fluorine; but should be the same on Oxygen)) upstream, we also lost Xms and UnlockDiagnosticVMOptions and HeapDumpOnOutOfMemoryError, that's curious:

/usr/bin/java -Djava.security.properties=/home/vorburger/dev/ODL/git/netvirt/karaf/target/assembly/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.security.egd=file:/dev/./urandom 

Someone needs to figure out why all of these options which are there for a good reason upstream ;) seem to have been lost in the RPM that runs inside the container downstream?

Now from what I understand about containers, they can but don't have to have memory limits (I'm more familiar with who this works for application containers in OpenShift than re. how an ODL container is OSP is configured in details).

It would seem that this ODL container has neither the required JVM memory option for memory management, nor does it's encasing container have any limit.  Therefore I suspect it just keeps growing and grabbing GB after GB from the underlying host node on which the container runs.

We may also want to add additional container specific JVM options; in addition to the regular (non container related) usual Xmx and Xms, I would recommend we also consider adding this magic:

    -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap

We should inherit as much as possible from upstream (so the fix for this issue should NOT contain Xmx & Xms, because that is already available upstream), and pass additional options like above as parameters or set ENV VARs (JAVA_OPTS ?) instead of patch existing upstream Karaf launch scripts such as /opt/opendaylight/bin/karaf directly.

Comment 2 Sai Sindhur Malleni 2018-04-23 15:12:05 UTC
I think this bug's priority should be more than "medium". In cases where we have ODL collocated with OpenStack controllers this could cause severe performance and even functional issues, probably taking down other serices in OpenStack.

Comment 3 Sai Sindhur Malleni 2018-04-23 15:13:09 UTC
Also, +1 to Michael's point. Pretty sure this is a startup option things. In OSP12, I remember we had the java specific startup args for heap even in the container. So something changed with how TripleO sets up ODL.

Comment 4 Tim Rozet 2018-04-23 16:23:23 UTC
We do not set cgroup mem limit on the container, so inheriting that for the jvm process wont do anything (for now, but it is good to remember for the future).  The JVM process itself should have heap limit, but it is possible that is being overwritten.  We used to pass JAVA_OPTS into the systemd env, however with the move to containers that no longer works (because we do not use systemd in the container).  We had to make a change to be able to pass JAVA_OPTS by modifying the start script used by karaf:

https://git.opendaylight.org/gerrit/#/c/68783/

My hunch here is that we are overwriting the final opts in the start script accidentally and overwriting the other arguments.  Can you please provide your karaf script from the container so we can see how it has been modified by puppet?

Comment 5 Sai Sindhur Malleni 2018-04-23 16:31:47 UTC
Attaching /opt/opendaylight/bin/karaf
https://gist.github.com/smalleni/b23c22d6f91229f3cd609f70fe29c58c

Comment 6 Michael Vorburger 2018-04-23 17:21:27 UTC
Could I please also ask for the "bin/inc" and "bin/setenv" scripts from your end, not just "bin/karaf" ?  In latest upstream Oxygen maintenance branch, I can see that "standard" Xmx I've referred to above comes from here, and then we can understand how we loose it downstream:

bin/setenv:    export JAVA_MAX_MEM="2048m"

bin/inc:    DEFAULT_JAVA_OPTS="-Xms${JAVA_MIN_MEM} -Xmx${JAVA_MAX_MEM} -XX:+UnlockDiagnosticVMOptions "

Comment 7 Sai Sindhur Malleni 2018-04-23 19:20:09 UTC
Created attachment 1425722 [details]
/bin/inc

Comment 8 Sai Sindhur Malleni 2018-04-23 19:20:32 UTC
Created attachment 1425723 [details]
bin/setenv

Comment 9 Michael Vorburger 2018-04-24 11:06:14 UTC
> comes from here, and then we can understand how we loose it downstream

the attached bin/setenv and bin/inc are correct, so we probably start differently.

Comment 10 Tim Rozet 2018-04-24 13:09:31 UTC
We start with:
https://github.com/openstack/tripleo-heat-templates/blob/master/docker/services/opendaylight-api.yaml#L111


One easy way to fix this is to just specify the heap size as part of JAVA_OPTS in puppet-opendaylight, and any other arguments we are also missing.

Comment 11 Tim Rozet 2018-04-24 14:54:58 UTC
The reason nothing is set is because there is a check in the inc function setupDebugOptions:

setupDebugOptions() {
    if [ "x${JAVA_OPTS}" = "x" ]; then
        JAVA_OPTS="${DEFAULT_JAVA_OPTS}"
    fi

Since we already set JAVA_OPTS, it never sets the default opts which includes mem settings.

Comment 12 Tim Rozet 2018-04-24 15:02:33 UTC
As Stephen mentioned, we should use EXTRA_JAVA_OPTS here so that we are able to include the DEFAULT_JAVA_OPTS.

Comment 13 Michael Vorburger 2018-04-24 15:05:06 UTC
Skitt on IRC pointed out we could just use EXTRA_JAVA_OPTS instead JAVA_OPTS.  That way, the standard upstream JVM args (Xmx etc) are preserved, but we can augment with additional options - such as (apparently) the -Djava.net.preferIPv4Stack=true thing.

Comment 20 Itzik Brown 2018-05-22 07:22:36 UTC
Checked with:
puppet-opendaylight-8.1.2-1.38977efgit.el7ost.noarch

Seems like it's right:

[root@controller-1 heat-admin]# ps -ef |grep java
42462      37818   37069 27 May21 ?        06:35:06 /usr/bin/java -Djava.security.properties=/opt/opendaylight/etc/odl.java.security -Xms128M -Xmx2048m -XX:+UnlockDiagnosticVMOptions -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote -Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom -Djava.endorsed.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/endorsed:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/endorsed:/opt/opendaylight/lib/endorsed -Djava.ext.dirs=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/jre/lib/ext:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.171-7.b10.el7.x86_64/jre/lib/ext:/opt/opendaylight/lib/ext -Dkaraf.instances=/opt/opendaylight/instances -Dkaraf.home=/opt/opendaylight -Dkaraf.base=/opt/opendaylight -Dkaraf.data=/opt/opendaylight/data -Dkaraf.etc=/opt/opendaylight/etc -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir=/opt/opendaylight/data/tmp -Djava.util.logging.config.file=/opt/opendaylight/etc/java.util.logging.properties -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /opt/opendaylight/lib/boot/org.apache.karaf.diagnostic.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.jaas.boot-4.1.3.jar:/opt/opendaylight/lib/boot/org.apache.karaf.main-4.1.3.jar:/opt/opendaylight/lib/boot/org.osgi.core-6.0.0.jar org.apache.karaf.main.Main

Comment 22 errata-xmlrpc 2018-06-27 13:52:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.