Bug 1559477
Summary: | Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Yocum <dyocum> | ||||||
Component: | Hawkular | Assignee: | Ruben Vargas Palma <rvargasp> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.6.1 | CC: | aos-bugs, dyocum | ||||||
Target Milestone: | --- | Keywords: | OnlineDedicated, OpsBlocker | ||||||
Target Release: | 3.6.z | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-05-17 07:58:35 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Dan Yocum
2018-03-22 15:55:17 UTC
Can you please also provide the following information: * logs for heapster * logs for cassandra * output `oc get pods --all-namespaces | wc -l` * for each cassandra pod output of: * oc -n openshift-infra exec <cassandra pod> nodetool status * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics tablestats * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx * oc -n openshift-infra exec <cassandra pod> nodetool tablehistograms hawkular_metrics data * oc -n openshift-infra exec <cassandra pod> nodetool tpstats * oc -n openshift-infra exec <cassandra pod> nodetool proxyhistograms The errors in the logs indicate an HTTP timeout. Hawkular Metrics was taking too long to process the requests. Unfortunately the logs do not tell us which REST endpoints were involved. There is one in particular that has been problematic with respect to resulting in timeouts. We have a couple upstream bug fixes that are in the process of being back ported to help with this. I will link to this ticket. (In reply to John Sanda from comment #2) > Can you please also provide the following information: > > * logs for heapster > * logs for cassandra > * output `oc get pods --all-namespaces | wc -l` > * for each cassandra pod output of: > * oc -n openshift-infra exec <cassandra pod> nodetool status > * oc -n openshift-infra exec <cassandra pod> nodetool hawkluar_metrics > tablestats > * oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics There is a slight error in the last two commands. The correct commands are: oc -n openshift-infra exec <cassandra pod> nodetool tablestats hawkluar_metrics oc -n openshift-infra exec <cassandra pod> nodetool hawkular_metrics tablehistograms hawkular_metrics metrics_tags_idx Created attachment 1411852 [details]
more logs and a script!
I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. Also, I'm seeing this in the hawkular-metrics logs 2018-04-02 21:02:05,196 WARN [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found 2018-04-02 21:02:05,197 INFO [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. 2018-04-02 21:02:35,166 WARN [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: Not Found 2018-04-02 21:02:35,166 INFO [org.hawkular.openshift.auth.org.hawkular.openshift.namespace.NamespaceHandler] (default I/O-1) Could not determine a namespace id for namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. I'd also like to point out that this version of the rc has these JAVA_OPTS: -Xms1303m -Xmx1303m Is this correct, shouldn't they be 1536?? Here are some of the hawkular-image particulars from docker inspect: "build-date": "2018-03-29T15:34:13.809374", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift3/metrics-hawkular-metrics/images/v3.7.42-2", (In reply to Dan Yocum from comment #10) > I just rolled out a 3.7.23 cluster and I'm having the RestEASY issue. > > Also, I'm seeing this in the hawkular-metrics logs > > 2018-04-02 21:02:05,196 WARN > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: > Not Found > 2018-04-02 21:02:05,197 INFO > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceHandler] (default I/O-1) Could not determine a namespace id for > namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. > 2018-04-02 21:02:35,166 WARN > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceListener] (default I/O-1) Error getting project metadata. Code 404: > Not Found > 2018-04-02 21:02:35,166 INFO > [org.hawkular.openshift.auth.org.hawkular.openshift.namespace. > NamespaceHandler] (default I/O-1) Could not determine a namespace id for > namespace. Cannot process request. Returning an INTERNAL_SERVER_ERROR. This issue first came up in bug 1506736 and was fixed upstream in https://issues.jboss.org/browse/HWKMETRICS-741. The changes for HWKMETRICS-741 went into OCP 3.9. When a project is deleted and pods go into the terminating state, there is still a window in which Heapster can collect and send metrics to Hawkular Metrics for those deleted pods. Hawkular Metrics cannot find the namespace for those metrics, and it returns a 500. > > > I'd also like to point out that this version of the rc has these JAVA_OPTS: > > -Xms1303m -Xmx1303m > > Is this correct, shouldn't they be 1536?? The heap size is calculated based on the pod's total memory. The RESTEasy error, which is an HTTP timeout, is being addressed in part by https://issues.jboss.org/browse/HWKMETRICS-754. That should be in the next 3.7 release. It can't hurt though for me to take a look at the environment and see if anything else is going on. For hawkular-metrics v3.7.40-1, I see these limits in the rc: resources: limits: memory: 3Gi requests: cpu: 100m memory: 3Gi And this for the heap: ... -Xms1536m -Xmx1536m ... Is the heap supposed to be 50% of the memory limit (or request)? I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10, above, went away when I set the heap sizes to -Xms1536m -Xmx1536m. (In reply to Dan Yocum from comment #14) > For hawkular-metrics v3.7.40-1, I see these limits in the rc: > > resources: > limits: > memory: 3Gi > requests: > cpu: 100m > memory: 3Gi > > > And this for the heap: > > ... -Xms1536m -Xmx1536m ... > > Is the heap supposed to be 50% of the memory limit (or request)? > > > I'd like to add that the INTERNAL_SERVER_ERROR errors I saw in comment #10, > above, went away when I set the heap sizes to -Xms1536m -Xmx1536m. Yes, I believe the heap is supposed to be about 50% of the total memory. Ruben, Can you take a look and see if the problem here is the base image changing like in bug 1567827. Tested with metrics-hawkular-metrics-v3.6.173.0.117-1, hawkular-metrics pod runs well # openshift version openshift v3.6.173.0.117 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-x0wcx 1/1 Running 0 13m hawkular-metrics-9rc6f 1/1 Running 0 13m heapster-2b1xp 1/1 Running 0 13m Xmx and Xms are used the same value, it is 50% of the hawkular-metrics container memory limit ========================================================================= JBoss Bootstrap Environment JBOSS_HOME: /opt/eap JAVA: /usr/lib/jvm/java-1.8.0/bin/java JAVA_OPTS: -server -verbose:gc -Xloggc:"/opt/eap/standalone/log/gc.log" -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=3M -XX:-TraceClassUnloading -Xms1536m -Xmx1536m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=true -Djboss.modules.system.pkgs=org.jboss.logmanager,jdk.nashorn.api -Djava.awt.headless=true -javaagent:/opt/jolokia/jolokia.jar=config=/opt/jolokia/etc/jolokia.properties -Xbootclasspath/p:/opt/eap/jboss-modules.jar:/opt/eap/modules/system/layers/base/.overlays/layer-base-jboss-eap-7.0.8.CP/org/jboss/logmanager/main/jboss-logmanager-2.0.7.Final-redhat-1.jar:/opt/eap/modules/system/layers/base/org/jboss/logmanager/ext/main/jboss-logmanager-ext-1.0.0.Alpha2-redhat-1.jar -Djava.util.logging.manager=org.jboss.logmanager.LogManager -XX:+UseParallelGC -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump ========================================================================= There is not error "RESTEASY002020: Unhandled asynchronous exception, sending back 500: org.jboss.resteasy.spi.UnhandledException: RESTEASY003770: Response is committed, can't handle exception" in hawkular-metrics pod, and sanity testing is passed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1579 |