Bug 1559477 - Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception
Summary: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.Un...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.6.z
Assignee: Ruben Vargas Palma
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-22 15:55 UTC by Dan Yocum
Modified: 2018-05-17 07:59 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-17 07:58:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc logs hawular-metrics (12.30 KB, application/x-gzip)
2018-03-22 15:55 UTC, Dan Yocum
no flags Details
more logs and a script! (60.19 KB, application/x-gzip)
2018-03-22 21:00 UTC, Dan Yocum
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker HWKMETRICS-753 0 Major Closed Improve SimpleTagQueryParser to avoid regexp when possible 2020-11-10 14:53:37 UTC
Red Hat Issue Tracker HWKMETRICS-757 0 Major Closed Tags queries unnecessarily enrich with full Metric<T> data 2020-11-10 14:53:38 UTC
Red Hat Product Errata RHBA-2018:1579 0 None None None 2018-05-17 07:59:26 UTC

Description Dan Yocum 2018-03-22 15:55:17 UTC
Created attachment 1411793 [details]
oc logs hawular-metrics

Description of problem:

hawkular-metrics pod is running, but failing to do anything useful.

Logs show this (see attatchment for full logs):

2018-03-22 15:42:15,235 ERROR [org.jboss.resteasy.resteasy_jaxrs.i18n] (RxComputationScheduler-1) RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception

<snip>

Caused by: java.io.IOException: Connection reset by peer
...


# oc describe rc/hawkular-metrics
Name:		hawkular-metrics
Namespace:	openshift-infra
Selector:	name=hawkular-metrics
Labels:		metrics-infra=hawkular-metrics
		name=hawkular-metrics
Annotations:	kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"ReplicationController","metadata":{"annotations":{},"creationTimestamp":"2018-03-22T13:40:44Z","generation":2,"labels":{"met...
Replicas:	1 current / 1 desired
Pods Status:	1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:		metrics-infra=hawkular-metrics
			name=hawkular-metrics
  Service Account:	hawkular
  Containers:
   hawkular-metrics:
    Image:	registry.reg-aws.openshift.com:443/openshift3/metrics-hawkular-metrics:v3.6
    Ports:	8080/TCP, 8443/TCP, 8888/TCP
    Command:
      /opt/hawkular/scripts/hawkular-metrics-wrapper.sh
      -b
      0.0.0.0
      -Dhawkular.metrics.cassandra.nodes=hawkular-cassandra
      -Dhawkular.metrics.cassandra.use-ssl
      -Dhawkular.metrics.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.metrics.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd
      -Dhawkular.metrics.allowed-cors-access-control-allow-headers=authorization
      -Dhawkular.metrics.default-ttl=7
      -Dhawkular.metrics.admin-tenant=_hawkular_admin
      -Dhawkular-alerts.cassandra-nodes=hawkular-cassandra
      -Dhawkular-alerts.cassandra-use-ssl
      -Dhawkular.alerts.openshift.auth-methods=openshift-oauth,htpasswd
      -Dhawkular.alerts.openshift.htpasswd-file=/hawkular-account/hawkular-metrics.htpasswd
      -Dhawkular.alerts.allowed-cors-access-control-allow-headers=authorization
      -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
      -Dorg.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH=true
      -Dcom.datastax.driver.FORCE_NIO=true
      -DKUBERNETES_MASTER_URL=https://kubernetes.default.svc
      -DUSER_WRITE_ACCESS=False
    Limits:
      memory:	3Gi
    Requests:
      cpu:	100m
      memory:	3Gi
    Liveness:	exec [/opt/hawkular/scripts/hawkular-metrics-liveness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:	exec [/opt/hawkular/scripts/hawkular-metrics-readiness.py] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:			 (v1:metadata.namespace)
      MASTER_URL:			https://kubernetes.default.svc
      JGROUPS_PASSWORD:			R4GfpSKaCon1SIcE5
      TRUSTSTORE_AUTHORITIES:		/hawkular-metrics-certs/tls.truststore.crt
      OPENSHIFT_KUBE_PING_NAMESPACE:	 (v1:metadata.namespace)
      OPENSHIFT_KUBE_PING_LABELS:	metrics-infra=hawkular-metrics,name=hawkular-metrics
      STARTUP_TIMEOUT:			500
    Mounts:
      /hawkular-account from hawkular-metrics-account (rw)
      /hawkular-metrics-certs from hawkular-metrics-certs (rw)
  Volumes:
   hawkular-metrics-certs:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-certs
    Optional:	false
   hawkular-metrics-account:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	hawkular-metrics-account
    Optional:	false
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----			-------------	--------	------			-------
  21m		21m		1	replication-controller			Normal		SuccessfulDelete	Deleted pod: hawkular-metrics-vx9zl
  20m		20m		1	replication-controller			Normal		SuccessfulCreate	Created pod: hawkular-metrics-98qkv
  9m		9m		1	replication-controller			Normal		SuccessfulDelete	Deleted pod: hawkular-metrics-98qkv
  9m		9m		1	replication-controller			Normal		SuccessfulCreate	Created pod: hawkular-metrics-vtgk1


Version-Release number of selected component (if applicable):

atomic-openshift-3.6.173.0.96-1.git.4.e6301f8.el7.x86_64


How reproducible:

always

Steps to Reproduce:
1. install metrics
2. browse to https://metrics.<cluster>.openshift.com/hawkular/metrics
3.

Actual results:

Application is not available


Expected results:


succeeds

Additional info:

Comment 1 Dan Yocum 2018-03-22 15:57:42 UTC
See: https://bugzilla.redhat.com/show_bug.cgi?id=1478410

Comment 2 John Sanda 2018-03-22 18:33:54 UTC
Can you please also provide the following information:

* logs for heapster
* logs for cassandra
* output `oc get pods --all-namespaces | wc -l`
* for each cassandra pod output of:
    * oc -n openshift-infra exec <cassandra pod> nodetool status
    * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics tablestats
    * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx
    * oc -n openshift-infra exec <cassandra pod> nodetool tablehistograms hawkular_metrics data
    * oc -n openshift-infra exec <cassandra pod> nodetool tpstats
    * oc -n openshift-infra exec <cassandra pod> nodetool proxyhistograms


The errors in the logs indicate an HTTP timeout. Hawkular Metrics was taking too long to process the requests. Unfortunately the logs do not tell us which REST endpoints were involved. There is one in particular that has been problematic with respect to resulting in timeouts. We have a couple upstream bug fixes that are in the process of being back ported to help with this. I will link to this ticket.

Comment 3 John Sanda 2018-03-22 18:50:50 UTC
(In reply to John Sanda from comment #2)
> Can you please also provide the following information:
> 
> * logs for heapster
> * logs for cassandra
> * output `oc get pods --all-namespaces | wc -l`
> * for each cassandra pod output of:
>     * oc -n openshift-infra exec <cassandra pod> nodetool status
>     * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics
> tablestats
>     * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics


There is a slight error in the last two commands. The correct commands are:

oc -n openshift-infra exec <cassandra pod> nodetool tablestats hawkluar_metrics

oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx

Comment 4 Dan Yocum 2018-03-22 21:00:03 UTC
Created attachment 1411852 [details]
more logs and a script!

Comment 10 Dan Yocum 2018-04-02 21:25:39 UTC
I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. 

Also, I'm seeing this in the hawkular-metrics logs

2018-04-02 21:02:05,196 WARN  [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found
2018-04-02 21:02:05,197 INFO  [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR.
2018-04-02 21:02:35,166 WARN  [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found
2018-04-02 21:02:35,166 INFO  [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR.


I'd also like to point out that this version of the rc has these JAVA_OPTS:

-Xms1303m -Xmx1303m

Is this correct, shouldn't they be 1536??

Comment 11 Dan Yocum 2018-04-02 23:17:22 UTC
Here are some of the hawkular-image particulars from docker inspect:

 "build-date": "2018-03-29T15:34:13.809374",
 "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/metrics-hawkular-metrics/images/v3.7.42-2",

Comment 12 John Sanda 2018-04-04 14:34:41 UTC
(In reply to Dan Yocum from comment #10)
> I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. 
> 
> Also, I'm seeing this in the hawkular-metrics logs
> 
> 2018-04-02 21:02:05,196 WARN 
> [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.
> NamespaceListener] (default I/O-1) Error getting project metadata. Code 404:
> Not Found
> 2018-04-02 21:02:05,197 INFO 
> [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.
> NamespaceHandler] (default I/O-1) Could not determine a namespace id for
> namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR.
> 2018-04-02 21:02:35,166 WARN 
> [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.
> NamespaceListener] (default I/O-1) Error getting project metadata. Code 404:
> Not Found
> 2018-04-02 21:02:35,166 INFO 
> [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.
> NamespaceHandler] (default I/O-1) Could not determine a namespace id for
> namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR.

This issue first came up in bug 1506736 and was fixed upstream in https://issues.jboss.org/browse/HWKMETRICS-741. The changes for HWKMETRICS-741 went into OCP 3.9. 

When a project is deleted and pods go into the terminating state, there is still a window in which Heapster can collect and send metrics to Hawkular Metrics for those deleted pods. Hawkular Metrics cannot find the namespace for those metrics, and it returns a 500.

> 
> 
> I'd also like to point out that this version of the rc has these JAVA_OPTS:
> 
> -Xms1303m -Xmx1303m
> 
> Is this correct, shouldn't they be 1536??

The heap size is calculated based on the pod's total memory.

Comment 13 John Sanda 2018-04-04 14:54:28 UTC
The RESTEasy error, which is an HTTP timeout, is being addressed in part by https://issues.jboss.org/browse/HWKMETRICS-754. That should be in the next 3.7 release. It can't hurt though for me to take a look at the environment and see if anything else is going on.

Comment 14 Dan Yocum 2018-04-04 21:13:04 UTC
For hawkular-metrics v3.7.40-1, I see these limits in the rc:

        resources:
          limits:
            memory: 3Gi
          requests:
            cpu: 100m
            memory: 3Gi


And this for the heap:

... -Xms1536m -Xmx1536m ...

Is the heap supposed to be 50% of the memory limit (or request)?


I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10, above, went away when I set the heap sizes to -Xms1536m -Xmx1536m.

Comment 15 John Sanda 2018-04-06 19:30:43 UTC
(In reply to Dan Yocum from comment #14)
> For hawkular-metrics v3.7.40-1, I see these limits in the rc:
> 
>         resources:
>           limits:
>             memory: 3Gi
>           requests:
>             cpu: 100m
>             memory: 3Gi
> 
> 
> And this for the heap:
> 
> ... -Xms1536m -Xmx1536m ...
> 
> Is the heap supposed to be 50% of the memory limit (or request)?
> 
> 
> I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10,
> above, went away when I set the heap sizes to -Xms1536m -Xmx1536m.

Yes, I believe the heap is supposed to be about 50% of the total memory.

Comment 16 John Sanda 2018-04-18 13:55:44 UTC
Ruben,

Can you take a look and see if the problem here is the base image changing like in bug 1567827.

Comment 19 Junqi Zhao 2018-05-09 02:47:10 UTC
Tested with metrics-hawkular-metrics-v3.6.173.0.117-1, hawkular-metrics pod runs well

# openshift version
openshift v3.6.173.0.117
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

# oc get po
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-x0wcx   1/1       Running   0          13m
hawkular-metrics-9rc6f       1/1       Running   0          13m
heapster-2b1xp               1/1       Running   0          13m

Xmx and Xms are used the same value, it is 50% of the hawkular-metrics container memory limit
=========================================================================
  JBoss Bootstrap Environment
  JBOSS_HOME: /opt/eap
  JAVA: /usr/lib/jvm/java-1.8.0/bin/java
  JAVA_OPTS:  -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -javaagent:/opt/jolokia/jolokia.jar=config=/opt/jolokia/etc/jolokia.properties -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.8.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -XX:+UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump
=========================================================================

Comment 20 Junqi Zhao 2018-05-09 03:04:07 UTC
There is not error "RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception" in hawkular-metrics pod, and sanity testing is passed

Comment 23 errata-xmlrpc 2018-05-17 07:58:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1579


Note You need to log in before you can comment on or make changes to this bug.