Created attachment 1411793 [details] oc logs hawular-metrics Description of problem: hawkular-metrics pod is running, but failing to do anything useful. Logs show this (see attatchment for full logs): 2018-03-22 15:42:15,235 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception <snip> Caused by: java.io.IOException: Connection reset by peer ... # oc describe rc/hawkular-metrics Name: hawkular-metrics Namespace: openshift-infra Selector: name=hawkular-metrics Labels: metrics-infra=hawkular-metrics name=hawkular-metrics Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"ReplicationController","metadata":{"annotations":{},"creationTimestamp":"2018-03-22T13:40:44Z","generation":2,"labels":{"met... Replicas: 1 current / 1 desired Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: metrics-infra=hawkular-metrics name=hawkular-metrics Service Account: hawkular Containers: hawkular-metrics: Image: registry.reg-aws.openshift.com:443/openshift3/metrics-hawkular-metrics:v3.6 Ports: 8080/TCP, 8443/TCP, 8888/TCP Command: /opt/hawkular/scripts/hawkular-metrics-wrapper.sh -b 0.0.0.0 -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra -Dhawkular.metrics.cassandra.use-ssl -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd -Dhawkular.metrics.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd -Dhawkular.metrics.allowed-cors-access-control-allow-headers=authorization -Dhawkular.metrics.default-ttl=7 -Dhawkular.metrics.admin-tenant=_hawkular_admin -Dhawkular-alerts.cassandra-nodes=hawkular-cassandra -Dhawkular-alerts.cassandra-use-ssl -Dhawkular.alerts.openshift.auth-methods=openshift-oauth,htpasswd -Dhawkular.alerts.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd -Dhawkular.alerts.allowed-cors-access-control-allow-headers=authorization -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true -Dorg.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true -Dcom.datastax.driver.FORCE_NIO=true -DKUBERNETES_MASTER_URL=https://kubernetes.default.svc -DUSER_WRITE_ACCESS=False Limits: memory: 3Gi Requests: cpu: 100m memory: 3Gi Liveness: exec [/opt/hawkular/scripts/hawkular-metrics-liveness.py] delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [/opt/hawkular/scripts/hawkular-metrics-readiness.py] delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAMESPACE: (v1:metadata.namespace) MASTER_URL: https://kubernetes.default.svc JGROUPS_PASSWORD: R4GfpSKaCon1SIcE5 TRUSTSTORE_AUTHORITIES: /hawkular-metrics-certs/tls.truststore.crt OPENSHIFT_KUBE_PING_NAMESPACE: (v1:metadata.namespace) OPENSHIFT_KUBE_PING_LABELS: metrics-infra=hawkular-metrics,name=hawkular-metrics STARTUP_TIMEOUT: 500 Mounts: /hawkular-account from hawkular-metrics-account (rw) /hawkular-metrics-certs from hawkular-metrics-certs (rw) Volumes: hawkular-metrics-certs: Type: Secret (a volume populated by a Secret) SecretName: hawkular-metrics-certs Optional: false hawkular-metrics-account: Type: Secret (a volume populated by a Secret) SecretName: hawkular-metrics-account Optional: false Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 21m 21m 1 replication-controller Normal SuccessfulDelete Deleted pod: hawkular-metrics-vx9zl 20m 20m 1 replication-controller Normal SuccessfulCreate Created pod: hawkular-metrics-98qkv 9m 9m 1 replication-controller Normal SuccessfulDelete Deleted pod: hawkular-metrics-98qkv 9m 9m 1 replication-controller Normal SuccessfulCreate Created pod: hawkular-metrics-vtgk1 Version-Release number of selected component (if applicable): atomic-openshift-3.6.173.0.96-1.git.4.e6301f8.el7.x86_64 How reproducible: always Steps to Reproduce: 1. install metrics 2. browse to https://metrics.<cluster>.openshift.com/hawkular/metrics 3. Actual results: Application is not available Expected results: succeeds Additional info:
See: https://bugzilla.redhat.com/show_bug.cgi?id=1478410
Can you please also provide the following information: * logs for heapster * logs for cassandra * output `oc get pods --all-namespaces | wc -l` * for each cassandra pod output of: * oc -n openshift-infra exec <cassandra pod> nodetool status * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics tablestats * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx * oc -n openshift-infra exec <cassandra pod> nodetool tablehistograms hawkular_metrics data * oc -n openshift-infra exec <cassandra pod> nodetool tpstats * oc -n openshift-infra exec <cassandra pod> nodetool proxyhistograms The errors in the logs indicate an HTTP timeout. Hawkular Metrics was taking too long to process the requests. Unfortunately the logs do not tell us which REST endpoints were involved. There is one in particular that has been problematic with respect to resulting in timeouts. We have a couple upstream bug fixes that are in the process of being back ported to help with this. I will link to this ticket.
(In reply to John Sanda from comment #2) > Can you please also provide the following information: > > * logs for heapster > * logs for cassandra > * output `oc get pods --all-namespaces | wc -l` > * for each cassandra pod output of: > * oc -n openshift-infra exec <cassandra pod> nodetool status > * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics > tablestats > * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics There is a slight error in the last two commands. The correct commands are: oc -n openshift-infra exec <cassandra pod> nodetool tablestats hawkluar_metrics oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx
Created attachment 1411852 [details] more logs and a script!
I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. Also, I'm seeing this in the hawkular-metrics logs 2018-04-02 21:02:05,196 WARN [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found 2018-04-02 21:02:05,197 INFO [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. 2018-04-02 21:02:35,166 WARN [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found 2018-04-02 21:02:35,166 INFO [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. I'd also like to point out that this version of the rc has these JAVA_OPTS: -Xms1303m -Xmx1303m Is this correct, shouldn't they be 1536??
Here are some of the hawkular-image particulars from docker inspect: "build-date": "2018-03-29T15:34:13.809374", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/metrics-hawkular-metrics/images/v3.7.42-2",
(In reply to Dan Yocum from comment #10) > I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. > > Also, I'm seeing this in the hawkular-metrics logs > > 2018-04-02 21:02:05,196 WARN > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: > Not Found > 2018-04-02 21:02:05,197 INFO > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceHandler] (default I/O-1) Could not determine a namespace id for > namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. > 2018-04-02 21:02:35,166 WARN > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: > Not Found > 2018-04-02 21:02:35,166 INFO > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceHandler] (default I/O-1) Could not determine a namespace id for > namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. This issue first came up in bug 1506736 and was fixed upstream in https://issues.jboss.org/browse/HWKMETRICS-741. The changes for HWKMETRICS-741 went into OCP 3.9. When a project is deleted and pods go into the terminating state, there is still a window in which Heapster can collect and send metrics to Hawkular Metrics for those deleted pods. Hawkular Metrics cannot find the namespace for those metrics, and it returns a 500. > > > I'd also like to point out that this version of the rc has these JAVA_OPTS: > > -Xms1303m -Xmx1303m > > Is this correct, shouldn't they be 1536?? The heap size is calculated based on the pod's total memory.
The RESTEasy error, which is an HTTP timeout, is being addressed in part by https://issues.jboss.org/browse/HWKMETRICS-754. That should be in the next 3.7 release. It can't hurt though for me to take a look at the environment and see if anything else is going on.
For hawkular-metrics v3.7.40-1, I see these limits in the rc: resources: limits: memory: 3Gi requests: cpu: 100m memory: 3Gi And this for the heap: ... -Xms1536m -Xmx1536m ... Is the heap supposed to be 50% of the memory limit (or request)? I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10, above, went away when I set the heap sizes to -Xms1536m -Xmx1536m.
(In reply to Dan Yocum from comment #14) > For hawkular-metrics v3.7.40-1, I see these limits in the rc: > > resources: > limits: > memory: 3Gi > requests: > cpu: 100m > memory: 3Gi > > > And this for the heap: > > ... -Xms1536m -Xmx1536m ... > > Is the heap supposed to be 50% of the memory limit (or request)? > > > I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10, > above, went away when I set the heap sizes to -Xms1536m -Xmx1536m. Yes, I believe the heap is supposed to be about 50% of the total memory.
Ruben, Can you take a look and see if the problem here is the base image changing like in bug 1567827.
Tested with metrics-hawkular-metrics-v3.6.173.0.117-1, hawkular-metrics pod runs well # openshift version openshift v3.6.173.0.117 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-x0wcx 1/1 Running 0 13m hawkular-metrics-9rc6f 1/1 Running 0 13m heapster-2b1xp 1/1 Running 0 13m Xmx and Xms are used the same value, it is 50% of the hawkular-metrics container memory limit ========================================================================= JBoss Bootstrap Environment JBOSS_HOME: /opt/eap JAVA: /usr/lib/jvm/java-1.8.0/bin/java JAVA_OPTS: -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -javaagent:/opt/jolokia/jolokia.jar=config=/opt/jolokia/etc/jolokia.properties -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.8.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -XX:+UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump =========================================================================
There is not error "RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception" in hawkular-metrics pod, and sanity testing is passed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1579