Bug 1299466

Summary: heapster pod crashes repeatedly: invalid memory address or nil pointer dereference
Product: OpenShift Container Platform Reporter: Evgheni Dereveanchin <ederevea>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED ERRATA QA Contact: chunchen <chunchen>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: aos-bugs, asogukpi, caugello, jcantril, mwringe, tdawson, wsun
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-12 16:26:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
metrics graphs during issue none

Description Evgheni Dereveanchin 2016-01-18 12:38:09 UTC
Description of problem:

After deploying Metrics in a 3.1 environment according to [1] and simulating load on some pods, the heapster cartridge seems to be crashing all the time

Version-Release number of selected component (if applicable):
REPOSITORY                                                         TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
registry.access.redhat.com/openshift3/metrics-heapster             3.1.0               800434e62203        3 months ago        228.4 MB

How reproducible:
every couple of minutes on my test environment with a couple of projects and several dozen running pods

Steps to Reproduce:
1. deploy metrics [1]
2. ensure graphs are visible on pod/metrics page in the WebUI
3. simulate load on pod (wget in a loop in my case)

Actual results:

gaps in graphs start appearing, heapster pod reports errors:

oc logs heapster-31ks6                                                                                                                                                                      Mon Jan 18 13:26:04 2016

Starting Heapster with the following arguments: --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&la
belToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.ce
rt --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy
I0118 07:21:59.943799       1 heapster.go:60] heapster --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_sy
stem&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heap
ster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy
I0118 07:22:00.040800       1 heapster.go:61] Heapster version 0.18.0
I0118 07:22:00.044948       1 kube_factory.go:168] Using Kubernetes client with master "https://kubernetes.default.svc:443" and version "v1"
I0118 07:22:00.044963       1 kube_factory.go:169] Using kubelet port 10250
I0118 07:22:00.045359       1 driver.go:491] Initialised Hawkular Sink with parameters {_system https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&us
er=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) 0xc208194360 }
I0118 07:22:00.925034       1 heapster.go:71] Starting heapster on port 8082
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x68 pc=0x4e6db6]

goroutine 9820 [running]:
k8s.io/heapster/sources.(*kubeNodeMetrics).updateStats(0xc208190630, 0xc20a2783a0, 0x19, 0xc20a9955a0, 0xf, 0xc20a9955c0, 0xf, 0xc20a9953e0, 0xf, 0x3e8, ...)
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:89 +0xc06
k8s.io/heapster/sources.func·002(0xc20a2783a0, 0x19, 0xc20a9955a0, 0xf, 0xc20a9955c0, 0xf, 0xc20a9953e0, 0xf, 0x3e8, 0xc1f15000)
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:106 +0x10a
created by k8s.io/heapster/sources.(*kubeNodeMetrics).getNodesInfo
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:114 +0x4a3

goroutine 1 [IO wait, 3 minutes]:
net.(*pollDesc).Wait(0xc208718680, 0x72, 0x0, 0x0)
        /usr/lib/golang/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc208718680, 0x0, 0x0)
        /usr/lib/golang/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).accept(0xc208718620, 0x0, 0x7f90450c20f0, 0xc208635430)
        /usr/lib/golang/src/net/fd_unix.go:419 +0x40b
net.(*TCPListener).AcceptTCP(0xc20803a270, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/tcpsock_posix.go:234 +0x4e
net/http.tcpKeepAliveListener.Accept(0xc20803a270, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/net/http/server.go:1976 +0x4c
crypto/tls.(*listener).Accept(0xc208568d00, 0x0, 0x0, 0x0, 0x0)
        /usr/lib/golang/src/crypto/tls/tls.go:46 +0x6b
net/http.(*Server).Serve(0xc2083912c0, 0x7f90450dd9c8, 0xc208568d00, 0x0, 0x0)
        /usr/lib/golang/src/net/http/server.go:1728 +0x92
net/http.(*Server).ListenAndServeTLS(0xc2083912c0, 0x7fffeb20d5b0, 0x16, 0x7fffeb20d5d1, 0x15, 0x0, 0x0)
        /usr/lib/golang/src/net/http/server.go:1879 +0x55b
main.main()
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/heapster.go:88 +0xa3f

goroutine 5 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x13e5ee0)
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/Godeps/_workspace/src/github.com/golang/glog/glog.go:879 +0x78
created by github.com/golang/glog.init·1
        /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/Godeps/_workspace/src/github.com/golang/glog/glog.go:410 +0x2a7

goroutine 17 [syscall, 3 minutes, locked to thread]:
runtime.goexit()
        /usr/lib/golang/src/runtime/asm_amd64.s:2232 +0x1

goroutine 12 [syscall, 3 minutes]:
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
        /usr/lib/golang/src/os/signal/signal_unix.go:27 +0x35

goroutine 15 [sleep]:

Expected results:

heapster works fine, graphs have no interruptions

Additional info:

[1] https://access.redhat.com/documentation/en/openshift-enterprise/3.1/installation-and-configuration/chapter-16-enabling-cluster-metrics

Comment 1 Evgheni Dereveanchin 2016-01-18 12:42:13 UTC
Created attachment 1115843 [details]
metrics graphs during issue

Attaching a screenshot of what I saw in the graphs that caused me investigate more: the pod was idle and stats were gathered just fine. Then I pointed some traffic at the pod at 13:17 - a spike is seen and then the heapster pod starts crashing causing gaps in metrics data. Around 13:32 the load test was stopped and the heapster cartridge stopped rebooting - notice no more gaps in metrics.

Comment 2 Jeff Cantrill 2016-01-18 16:33:37 UTC
Appears to have been fixed here: https://github.com/kubernetes/heapster/pull/693

Comment 3 Jeff Cantrill 2016-01-18 17:00:55 UTC
Working on patching the previous build with fix from https://github.com/kubernetes/heapster/pull/693 to minimize overall changes to the heapster release.

Comment 5 Jeff Cantrill 2016-01-18 18:18:46 UTC
Can you edit your deployment config to use the image from https://brewweb.devel.redhat.com/buildinfo?buildID=474926 to see if it resolves the issue?

Comment 6 chunchen 2016-01-19 08:47:50 UTC
Tried with latest openshift3/metrics-heapster(feddf8a2c405), met below errors in the heapster pod:

[chunchen@F17-CCY daily]$ oc get pod
NAME                              READY     STATUS              RESTARTS   AGE
hawkular-cassandra-1-gmkxd        1/1       Running             0          58m
hawkular-metrics-i6xxs            1/1       Running             0          58m
heapster-ljzj1                    1/1       Running             4          58m

[chunchen@F17-CCY daily]$ curl  -kI -H "Authorization: Bearer `oc whoami -t`" https://hawkular-metrics.0119-7pd.qe.rhcloud.com/hawkular/metrics
HTTP/1.0 200 Connection established
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Accept-Ranges: bytes
ETag: W/"1487-1444516544000"
Last-Modified: Sat, 10 Oct 2015 22:35:44 GMT
Content-Type: text/html
Content-Length: 1487
Date: Tue, 19 Jan 2016 08:47:01 GMT


[chunchen@F17-CCY daily]$ oc logs heapster-ljzj1
<-------------snip-------------->
I0119 02:44:07.240608       1 heapster.go:71] Starting heapster on port 8082
E0119 02:44:50.882027       1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.58.121:9042] Operation timed out))
E0119 03:30:04.391991       1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
E0119 03:30:04.405392       1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
E0119 03:30:04.462373       1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)

Comment 8 Matt Wringe 2016-01-19 14:30:30 UTC
When you updated the Heapster image, did you make any other changes to your deployment? It seems like this is a problem with Hawkular Metrics being able to communicate with Cassandra.

What exactly is your setup with your pod which is performing a load? How many of them are you running and are they all just doing a wget in a loop? How many Openshift nodes, etc

Comment 9 Jeff Cantrill 2016-01-21 12:45:23 UTC
Moving to ON_QA in hopes of getting a response to https://bugzilla.redhat.com/show_bug.cgi?id=1299466#c8

Comment 18 chunchen 2016-02-03 11:27:09 UTC
Checked with latest openshift3/metrics-heapster(32b5f5bab5e7) image, the nil pointer issue and the other one raised by me all both do not reproduce, so I mark the bug as verified.

Comment 19 Matt Wringe 2016-02-03 14:01:47 UTC
@chunchen do you know if anything special happened when you got the second new error? Its probably something we should see if we can reproduce or not

Comment 22 errata-xmlrpc 2016-05-12 16:26:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064