Bug 1299466
Summary: | heapster pod crashes repeatedly: invalid memory address or nil pointer dereference | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Evgheni Dereveanchin <ederevea> | ||||
Component: | Hawkular | Assignee: | Matt Wringe <mwringe> | ||||
Status: | CLOSED ERRATA | QA Contact: | chunchen <chunchen> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.1.0 | CC: | aos-bugs, asogukpi, caugello, jcantril, mwringe, tdawson, wsun | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-05-12 16:26:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Evgheni Dereveanchin
2016-01-18 12:38:09 UTC
Created attachment 1115843 [details]
metrics graphs during issue
Attaching a screenshot of what I saw in the graphs that caused me investigate more: the pod was idle and stats were gathered just fine. Then I pointed some traffic at the pod at 13:17 - a spike is seen and then the heapster pod starts crashing causing gaps in metrics data. Around 13:32 the load test was stopped and the heapster cartridge stopped rebooting - notice no more gaps in metrics.
Appears to have been fixed here: https://github.com/kubernetes/heapster/pull/693 Working on patching the previous build with fix from https://github.com/kubernetes/heapster/pull/693 to minimize overall changes to the heapster release. Can you edit your deployment config to use the image from https://brewweb.devel.redhat.com/buildinfo?buildID=474926 to see if it resolves the issue? Tried with latest openshift3/metrics-heapster(feddf8a2c405), met below errors in the heapster pod: [chunchen@F17-CCY daily]$ oc get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-gmkxd 1/1 Running 0 58m hawkular-metrics-i6xxs 1/1 Running 0 58m heapster-ljzj1 1/1 Running 4 58m [chunchen@F17-CCY daily]$ curl -kI -H "Authorization: Bearer `oc whoami -t`" https://hawkular-metrics.0119-7pd.qe.rhcloud.com/hawkular/metrics HTTP/1.0 200 Connection established HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Accept-Ranges: bytes ETag: W/"1487-1444516544000" Last-Modified: Sat, 10 Oct 2015 22:35:44 GMT Content-Type: text/html Content-Length: 1487 Date: Tue, 19 Jan 2016 08:47:01 GMT [chunchen@F17-CCY daily]$ oc logs heapster-ljzj1 <-------------snip--------------> I0119 02:44:07.240608 1 heapster.go:71] Starting heapster on port 8082 E0119 02:44:50.882027 1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.58.121:9042] Operation timed out)) E0119 03:30:04.391991 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))) E0119 03:30:04.405392 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))) E0119 03:30:04.462373 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write) When you updated the Heapster image, did you make any other changes to your deployment? It seems like this is a problem with Hawkular Metrics being able to communicate with Cassandra. What exactly is your setup with your pod which is performing a load? How many of them are you running and are they all just doing a wget in a loop? How many Openshift nodes, etc Moving to ON_QA in hopes of getting a response to https://bugzilla.redhat.com/show_bug.cgi?id=1299466#c8 Checked with latest openshift3/metrics-heapster(32b5f5bab5e7) image, the nil pointer issue and the other one raised by me all both do not reproduce, so I mark the bug as verified. @chunchen do you know if anything special happened when you got the second new error? Its probably something we should see if we can reproduce or not Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064 |