Description of problem: After deploying Metrics in a 3.1 environment according to [1] and simulating load on some pods, the heapster cartridge seems to be crashing all the time Version-Release number of selected component (if applicable): REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE registry.access.redhat.com/openshift3/metrics-heapster 3.1.0 800434e62203 3 months ago 228.4 MB How reproducible: every couple of minutes on my test environment with a couple of projects and several dozen running pods Steps to Reproduce: 1. deploy metrics [1] 2. ensure graphs are visible on pod/metrics page in the WebUI 3. simulate load on pod (wget in a loop in my case) Actual results: gaps in graphs start appearing, heapster pod reports errors: oc logs heapster-31ks6 Mon Jan 18 13:26:04 2016 Starting Heapster with the following arguments: --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_system&la belToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heapster.ce rt --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy I0118 07:21:59.943799 1 heapster.go:60] heapster --source=kubernetes:https://kubernetes.default.svc:443?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --sink=hawkular:https://hawkular-metrics:443?tenant=_sy stem&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&user=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) --logtostderr=true --tls_cert=/secrets/heap ster.cert --tls_key=/secrets/heapster.key --tls_client_ca=/secrets/heapster.client-ca --allowed_users=system:master-proxy I0118 07:22:00.040800 1 heapster.go:61] Heapster version 0.18.0 I0118 07:22:00.044948 1 kube_factory.go:168] Using Kubernetes client with master "https://kubernetes.default.svc:443" and version "v1" I0118 07:22:00.044963 1 kube_factory.go:169] Using kubelet port 10250 I0118 07:22:00.045359 1 driver.go:491] Initialised Hawkular Sink with parameters {_system https://hawkular-metrics:443?tenant=_system&labelToTenant=pod_namespace&caCert=/hawkular-cert/hawkular-metrics-ca.certificate&us er=hawkular&pass=QarkQCrkkm5yV5Y&filter=label(container_name:^/system.slice.*|^/user.slice) 0xc208194360 } I0118 07:22:00.925034 1 heapster.go:71] Starting heapster on port 8082 panic: runtime error: invalid memory address or nil pointer dereference [signal 0xb code=0x1 addr=0x68 pc=0x4e6db6] goroutine 9820 [running]: k8s.io/heapster/sources.(*kubeNodeMetrics).updateStats(0xc208190630, 0xc20a2783a0, 0x19, 0xc20a9955a0, 0xf, 0xc20a9955c0, 0xf, 0xc20a9953e0, 0xf, 0x3e8, ...) /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:89 +0xc06 k8s.io/heapster/sources.func·002(0xc20a2783a0, 0x19, 0xc20a9955a0, 0xf, 0xc20a9955c0, 0xf, 0xc20a9953e0, 0xf, 0x3e8, 0xc1f15000) /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:106 +0x10a created by k8s.io/heapster/sources.(*kubeNodeMetrics).getNodesInfo /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/_build/src/k8s.io/heapster/sources/kube_nodes.go:114 +0x4a3 goroutine 1 [IO wait, 3 minutes]: net.(*pollDesc).Wait(0xc208718680, 0x72, 0x0, 0x0) /usr/lib/golang/src/net/fd_poll_runtime.go:84 +0x47 net.(*pollDesc).WaitRead(0xc208718680, 0x0, 0x0) /usr/lib/golang/src/net/fd_poll_runtime.go:89 +0x43 net.(*netFD).accept(0xc208718620, 0x0, 0x7f90450c20f0, 0xc208635430) /usr/lib/golang/src/net/fd_unix.go:419 +0x40b net.(*TCPListener).AcceptTCP(0xc20803a270, 0x0, 0x0, 0x0) /usr/lib/golang/src/net/tcpsock_posix.go:234 +0x4e net/http.tcpKeepAliveListener.Accept(0xc20803a270, 0x0, 0x0, 0x0, 0x0) /usr/lib/golang/src/net/http/server.go:1976 +0x4c crypto/tls.(*listener).Accept(0xc208568d00, 0x0, 0x0, 0x0, 0x0) /usr/lib/golang/src/crypto/tls/tls.go:46 +0x6b net/http.(*Server).Serve(0xc2083912c0, 0x7f90450dd9c8, 0xc208568d00, 0x0, 0x0) /usr/lib/golang/src/net/http/server.go:1728 +0x92 net/http.(*Server).ListenAndServeTLS(0xc2083912c0, 0x7fffeb20d5b0, 0x16, 0x7fffeb20d5d1, 0x15, 0x0, 0x0) /usr/lib/golang/src/net/http/server.go:1879 +0x55b main.main() /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/heapster.go:88 +0xa3f goroutine 5 [chan receive]: github.com/golang/glog.(*loggingT).flushDaemon(0x13e5ee0) /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/Godeps/_workspace/src/github.com/golang/glog/glog.go:879 +0x78 created by github.com/golang/glog.init·1 /builddir/build/BUILD/heapster-af4752e8f366cdce38d5b768c2f6ec36248ec9e3/Godeps/_workspace/src/github.com/golang/glog/glog.go:410 +0x2a7 goroutine 17 [syscall, 3 minutes, locked to thread]: runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:2232 +0x1 goroutine 12 [syscall, 3 minutes]: os/signal.loop() /usr/lib/golang/src/os/signal/signal_unix.go:21 +0x1f created by os/signal.init·1 /usr/lib/golang/src/os/signal/signal_unix.go:27 +0x35 goroutine 15 [sleep]: Expected results: heapster works fine, graphs have no interruptions Additional info: [1] https://access.redhat.com/documentation/en/openshift-enterprise/3.1/installation-and-configuration/chapter-16-enabling-cluster-metrics
Created attachment 1115843 [details] metrics graphs during issue Attaching a screenshot of what I saw in the graphs that caused me investigate more: the pod was idle and stats were gathered just fine. Then I pointed some traffic at the pod at 13:17 - a spike is seen and then the heapster pod starts crashing causing gaps in metrics data. Around 13:32 the load test was stopped and the heapster cartridge stopped rebooting - notice no more gaps in metrics.
Appears to have been fixed here: https://github.com/kubernetes/heapster/pull/693
Working on patching the previous build with fix from https://github.com/kubernetes/heapster/pull/693 to minimize overall changes to the heapster release.
Can you edit your deployment config to use the image from https://brewweb.devel.redhat.com/buildinfo?buildID=474926 to see if it resolves the issue?
Tried with latest openshift3/metrics-heapster(feddf8a2c405), met below errors in the heapster pod: [chunchen@F17-CCY daily]$ oc get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-gmkxd 1/1 Running 0 58m hawkular-metrics-i6xxs 1/1 Running 0 58m heapster-ljzj1 1/1 Running 4 58m [chunchen@F17-CCY daily]$ curl -kI -H "Authorization: Bearer `oc whoami -t`" https://hawkular-metrics.0119-7pd.qe.rhcloud.com/hawkular/metrics HTTP/1.0 200 Connection established HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Accept-Ranges: bytes ETag: W/"1487-1444516544000" Last-Modified: Sat, 10 Oct 2015 22:35:44 GMT Content-Type: text/html Content-Length: 1487 Date: Tue, 19 Jan 2016 08:47:01 GMT [chunchen@F17-CCY daily]$ oc logs heapster-ljzj1 <-------------snip--------------> I0119 02:44:07.240608 1 heapster.go:71] Starting heapster on port 8082 E0119 02:44:50.882027 1 driver.go:234] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.OperationTimedOutException: [hawkular-cassandra/172.30.58.121:9042] Operation timed out)) E0119 03:30:04.391991 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))) E0119 03:30:04.405392 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.58.121:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections))) E0119 03:30:04.462373 1 driver.go:311] Hawkular returned status code 500, error message: Failed to perform operation due to an error: Cassandra timeout during write query at consistency ONE (1 replica were required but only 0 acknowledged the write)
When you updated the Heapster image, did you make any other changes to your deployment? It seems like this is a problem with Hawkular Metrics being able to communicate with Cassandra. What exactly is your setup with your pod which is performing a load? How many of them are you running and are they all just doing a wget in a loop? How many Openshift nodes, etc
Moving to ON_QA in hopes of getting a response to https://bugzilla.redhat.com/show_bug.cgi?id=1299466#c8
Checked with latest openshift3/metrics-heapster(32b5f5bab5e7) image, the nil pointer issue and the other one raised by me all both do not reproduce, so I mark the bug as verified.
@chunchen do you know if anything special happened when you got the second new error? Its probably something we should see if we can reproduce or not
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064