Bug 1559152 - hawkular-metrics fails to start, enters CrashLoopBackoff
Summary: hawkular-metrics fails to start, enters CrashLoopBackoff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.z
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-21 19:59 UTC by Dan Yocum
Modified: 2018-05-18 03:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-18 03:54:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1576 0 None None None 2018-05-18 03:55:23 UTC

Description Dan Yocum 2018-03-21 19:59:22 UTC
Description of problem:

On a fresh 3.7.23 deployment, hawkular metrics fails to start with error:

"Initial heap size set to a larger value than the maximum heap size"

(Tries to create a pod w/ 3G and largest pod allowed on OSD is 2G)

Version-Release number of selected component (if applicable):

3.7

How reproducible:

Every

Steps to Reproduce:
1. deploy metrics

Actual results:

# oc logs hawkular-metrics-gzj2x
2018-03-21 19:54:43 Starting Hawkular Metrics
The service account has read permissions for its project. Proceeding
The service account has permission to watch namespaces. Proceeding
Creating the Hawkular Metrics keystore from the Secret's cert data
Converting the PKCS12 keystore into a Java Keystore
Importing keystore /opt/hawkular/auth/hawkular-metrics.pkcs12 to /opt/hawkular/auth/hawkular-metrics.keystore...
Entry for alias hawkular-metrics successfully imported.
Import command completed:  1 entries successfully imported, 0 entries failed or cancelled
[Storing /opt/hawkular/auth/hawkular-metrics.keystore]

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore /opt/hawkular/auth/hawkular-metrics.keystore -destkeystore /opt/hawkular/auth/hawkular-metrics.keystore -deststoretype pkcs12".
Building the trust store
Certificate was added to keystore
Certificate was added to keystore
Splitting up the Kubernetes CA into individual certificates
Adding the Kubernetes CAs into the trust store
Certificate was added to keystore
Retrieving the Logging's CA and adding to the trust store, if Logging is available
Could not get the logging secret! Status code: 403. The Hawkular Alerts integration with Logging might not work properly.
-Xms1536m -Xmx1536m -XX:+UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:MaxMetaspaceSize=100m -XX:+ExitOnOutOfMemoryError
/opt/eap/bin/standalone.conf: line 105: max_mem: command not found
=========================================================================

  JBoss Bootstrap Environment

  JBOSS_HOME: /opt/eap

  JAVA: /usr/lib/jvm/java-1.8.0/bin/java

  JAVA_OPTS:  -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1303m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump




Expected results:

Starts



Additional info:

# oc describe rc/hawkular-metrics
Name:		hawkular-metrics
Namespace:	openshift-infra
Selector:	name=hawkular-metrics
Labels:		metrics-infra=hawkular-metrics
		name=hawkular-metrics
Annotations:	kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"ReplicationController","metadata":{"annotations":{},"creationTimestamp":"2018-03-21T17:58:21Z","generation":2,"labels":{"met...
Replicas:	1 current / 1 desired
Pods Status:	1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:		metrics-infra=hawkular-metrics
			name=hawkular-metrics
  Service Account:	hawkular
  Containers:
   hawkular-metrics:
    Image:	registry.reg-aws.openshift.com:443/openshift3/metrics-hawkular-metrics:v3.7
    Ports:	8080/TCP, 8443/TCP, 8888/TCP
    Command:
      /opt/hawkular/scripts/hawkular-metrics-wrapper.sh
      -b
      0.0.0.0
      -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra
      -Dhawkular.metrics.cassandra.use-ssl
      -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.metrics.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd
      -Dhawkular.metrics.allowed-cors-access-control-allow-headers=authorization
      -Dhawkular.metrics.default-ttl=7
      -Dhawkular.metrics.admin-tenant=_hawkular_admin
      -Dhawkular-alerts.cassandra-nodes=hawkular-cassandra
      -Dhawkular-alerts.cassandra-use-ssl
      -Dhawkular.alerts.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.alerts.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd
      -Dhawkular.alerts.allowed-cors-access-control-allow-headers=authorization
      -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
      -Dorg.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
      -Dcom.datastax.driver.FORCE_NIO=true
      -DKUBERNETES_MASTER_URL=https://kubernetes.default.svc
      -DUSER_WRITE_ACCESS=False
      -Dhawkular.metrics.jmx-reporting-enabled
    Limits:
      memory:	3Gi
    Requests:
      cpu:	100m
      memory:	3Gi
    Liveness:	exec [/opt/hawkular/scripts/hawkular-metrics-liveness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:	exec [/opt/hawkular/scripts/hawkular-metrics-readiness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:			 (v1:metadata.namespace)
      MASTER_URL:			https://kubernetes.default.svc
      JGROUPS_PASSWORD:			XrZ1lKOPdJdtgWJLh
      TRUSTSTORE_AUTHORITIES:		/hawkular-metrics-certs/tls.truststore.crt
      ENABLE_PROMETHEUS_ENDPOINT:	True
      OPENSHIFT_KUBE_PING_NAMESPACE:	 (v1:metadata.namespace)
      OPENSHIFT_KUBE_PING_LABELS:	metrics-infra=hawkular-metrics,name=hawkular-metrics
      STARTUP_TIMEOUT:			500
    Mounts:
      /hawkular-account from hawkular-metrics-account (rw)
      /hawkular-metrics-certs from hawkular-metrics-certs (rw)
  Volumes:
   hawkular-metrics-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-certs
    Optional:	false
   hawkular-metrics-account:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
    Optional:	false
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  1h		1h		1	replication-controller			Normal		SuccessfulCreate	Created pod: hawkular-metrics-6lhcs
  1h		1h		1	replication-controller			Normal		SuccessfulDelete	Deleted pod: hawkular-metrics-6lhcs
  1h		1h		1	replication-controller			Normal		SuccessfulCreate	Created pod: hawkular-metrics-9kwzq
  9m		9m		1	replication-controller			Normal		SuccessfulDelete	Deleted pod: hawkular-metrics-9kwzq
  8m		8m		1	replication-controller			Normal		SuccessfulCreate	Created pod: hawkular-metrics-gzj2x

Comment 1 Dan Yocum 2018-03-21 20:57:00 UTC
My apologies - my initial assessment was wrong.  It's not that the pod size was too large, it's that the Xms is set larger than the Xmx:

  JAVA_OPTS:  -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1303m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump



There is no container memory size limit on the openshift-infra project.

Comment 2 John Sanda 2018-03-21 21:39:03 UTC
(In reply to Dan Yocum from comment #1)
> My apologies - my initial assessment was wrong.  It's not that the pod size
> was too large, it's that the Xms is set larger than the Xmx:
> 
>   JAVA_OPTS:  -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log"
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading
> -Xms1536m -Xmx1303m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m
> -Djava.net.preferIPv4Stack=true
> -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api
> -Djava.awt.headless=true
> -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/
> base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-
> logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/
> jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar
> -Djava.util.logging.manager=org.jboss.logmanager.LogManager
> -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/
> secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-
> proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,
> discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump
> 
> 
> 
> There is no container memory size limit on the openshift-infra project.

Thanks for the info. We are able to reproduce with brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/metrics-hawkular-metrics:v3.7.39.

Comment 5 Dan Yocum 2018-03-26 19:18:16 UTC
There is a work-around.  It's ugly, but it overrides the JAVA_OPTS env var that are passed in the hawkular-metrics-wrapper.sh script.  



oc scale --replicas=0 rc/hawkular-metrics

oc env rc/hawkular-metrics JAVA_OPTS=/'-server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xms1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump'

oc scale --replicas=1 rc/hawkular-metrics

Comment 6 Dan Yocum 2018-03-26 19:30:27 UTC
Sorry - there's a typo in the above command (an extra '/' that is in the old 3.0 oc cli docs).  This is the right command:

oc scale --replicas=0 rc/hawkular-metrics

oc env rc/hawkular-metrics JAVA_OPTS='-server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xms1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump'

oc scale --replicas=1 rc/hawkular-metrics
Collapse All Comments
Expand All Comments
Add Comment
Unwrap comments
Show CC Changes

Comment 7 Dan Yocum 2018-03-26 19:31:25 UTC
John just saw another typo:

oc scale --replicas=0 rc/hawkular-metrics

oc env rc/hawkular-metrics JAVA_OPTS='-server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.9.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump'

oc scale --replicas=1 rc/hawkular-metrics

Comment 8 Dan Yocum 2018-04-02 21:21:13 UTC
I just deployed a new cluster and apparently the rc now set the JAVA_OPTS to this:

... -Xms1303m -Xmx1303m ...

Is this right??

Comment 9 John Sanda 2018-04-13 17:28:45 UTC
(In reply to Dan Yocum from comment #8)
> I just deployed a new cluster and apparently the rc now set the JAVA_OPTS to
> this:
> 
> ... -Xms1303m -Xmx1303m ...
> 
> Is this right??

Sorry for the late response. Yes, that looks right. What is the status with this issue?

Comment 10 Dan Yocum 2018-04-13 19:55:05 UTC
I don't think it's right - we talked about it in https://bugzilla.redhat.com/show_bug.cgi?id=1559477#c15.  The heap should be 50% of the container limit which is 3GB,so these should be 1536m.  1303m caused INTERNAL_SERVER_ERROR seen in https://bugzilla.redhat.com/show_bug.cgi?id=1559477#c10

Comment 13 Junqi Zhao 2018-05-08 13:25:17 UTC
Tested with metrics-hawkular-metrics-v3.7.46-1, hawkular-metrics pod runs well

# oc get po 
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-l5jjt   1/1       Running   0          14m
hawkular-metrics-pdszs       1/1       Running   0          14m
heapster-nk2kw               1/1       Running   0          14m

Xmx and Xms are used the same value, it is 50% of the hawkular-metrics container memory limit
*****************************************************************************
  JBoss Bootstrap Environment
  JBOSS_HOME: /opt/eap
  JAVA: /usr/lib/jvm/java-1.8.0/bin/java
  JAVA_OPTS:  -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m  -Xmx1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/main/jboss-logmanager-2.0.3.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -javaagent:/opt/eap/jolokia.jar=port=8778,protocol=https,caCert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt,clientPrincipal=cn=system:master-proxy,useSslClientAuthentication=true,extraClientCheck=true,host=0.0.0.0,discoveryEnabled=false -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump

*****************************************************************************

Comment 16 errata-xmlrpc 2018-05-18 03:54:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1576


Note You need to log in before you can comment on or make changes to this bug.