Bug 1673178 - node-exporter and kube-state metrics pods in CrashLoopBackOff
Summary: node-exporter and kube-state metrics pods in CrashLoopBackOff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-06 20:36 UTC by Sai Sindhur Malleni
Modified: 2020-02-20 02:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-20 14:11:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0326 0 None None None 2019-02-20 14:11:14 UTC

Description Sai Sindhur Malleni 2019-02-06 20:36:15 UTC
Description of problem: On Instaling a fresh OCP 3.11 cluster on OSP 14, I am seeing that several of the monitoring related pods are in CrashLoopBackOff
[openshift@master-0 ~]$ oc get pods --all-namespaces
NAMESPACE               NAME                                                READY     STATUS             RESTARTS   AGE
default                 docker-registry-1-znmb9                             1/1       Running            0          1d
default                 registry-console-1-g59d5                            1/1       Running            0          1d
default                 router-1-t6vtk                                      1/1       Running            0          1d
kube-system             master-api-master-0.openshift.example.com           1/1       Running            0          1d
kube-system             master-controllers-master-0.openshift.example.com   1/1       Running            0          1d
kube-system             master-etcd-master-0.openshift.example.com          1/1       Running            0          1d
openshift-console       console-66549ff897-rd95z                            1/1       Running            0          1d
openshift-infra         bootstrap-autoapprover-0                            1/1       Running            0          1d
openshift-infra         kuryr-cni-ds-bkvfx                                  2/2       Running            0          1d
openshift-infra         kuryr-cni-ds-fk8gg                                  2/2       Running            0          1d
openshift-infra         kuryr-cni-ds-fnb8n                                  2/2       Running            0          1d
openshift-infra         kuryr-cni-ds-nw2th                                  2/2       Running            0          1d
openshift-infra         kuryr-controller-7bdfdf4ddb-cw5gt                   1/1       Running            0          1d
openshift-monitoring    alertmanager-main-0                                 3/3       Running            0          1d
openshift-monitoring    alertmanager-main-1                                 3/3       Running            0          1d
openshift-monitoring    alertmanager-main-2                                 3/3       Running            0          1d
openshift-monitoring    cluster-monitoring-operator-75c6b544dd-r2xxc        1/1       Running            0          1d
openshift-monitoring    grafana-c7d5bc87c-dl69x                             2/2       Running            0          1d
openshift-monitoring    kube-state-metrics-c57bd9dfd-hlghv                  1/3       CrashLoopBackOff   1078       1d
openshift-monitoring    node-exporter-7tbv5                                 1/2       CrashLoopBackOff   540        1d
openshift-monitoring    node-exporter-d8rcx                                 1/2       CrashLoopBackOff   534        1d
openshift-monitoring    node-exporter-ghxrx                                 1/2       CrashLoopBackOff   540        1d
openshift-monitoring    node-exporter-wdv79                                 1/2       CrashLoopBackOff   541        1d
openshift-monitoring    prometheus-k8s-0                                    4/4       Running            1          1d
openshift-monitoring    prometheus-k8s-1                                    4/4       Running            1          1d
openshift-monitoring    prometheus-operator-5b47ff445b-nz8sf                1/1       Running            0          1d
openshift-node          sync-lmksf                                          1/1       Running            0          1d
openshift-node          sync-ls6t5                                          1/1       Running            0          1d
openshift-node          sync-mnj4s                                          1/1       Running            0          1d
openshift-node          sync-qvw9z                                          1/1       Running            0          1d
openshift-web-console   webconsole-787f54c7f8-p2lcv                         1/1       Running            0          1d



Version-Release number of selected component (if applicable):
OCP Enterprise 3.11
[openshift@master-0 ~]$ rpm -qa | grep openshift
atomic-openshift-hyperkube-3.11.77-1.git.0.8baa0fb.el7.x86_64
atomic-openshift-clients-3.11.77-1.git.0.8baa0fb.el7.x86_64
atomic-openshift-3.11.77-1.git.0.8baa0fb.el7.x86_64
atomic-openshift-docker-excluder-3.11.77-1.git.0.8baa0fb.el7.noarch
atomic-openshift-node-3.11.77-1.git.0.8baa0fb.el7.x86_64



How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11
2. oc get pods --all-namespaces
3.

Actual results:
The monitoring pods are in CrashLoopBackOff

Expected results:
The pods should be in running state

Additional info:
[openshift@master-0 ~]$ oc get ev
LAST SEEN   FIRST SEEN   COUNT     NAME                                                  KIND      SUBOBJECT                               TYPE      REASON    SOURCE                                        MESSAGE
1h          1d           526       kube-state-metrics-c57bd9dfd-hlghv.158047ad2fb3f7f1   Pod       spec.containers{kube-rbac-proxy-main}   Normal    Pulled    kubelet, infra-node-0.openshift.example.com   Container image "registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11" already present on machine
1h          1d           527       node-exporter-7tbv5.158047ad0b9a2181                  Pod       spec.containers{kube-rbac-proxy}        Normal    Pulled    kubelet, app-node-0.openshift.example.com     Container image "registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11" already present on machine
45m         1d           533       node-exporter-wdv79.158047ad159af523                  Pod       spec.containers{kube-rbac-proxy}        Normal    Pulled    kubelet, app-node-1.openshift.example.com     Container image "registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11" already present on machine
39m         1d           533       node-exporter-ghxrx.158047acf3e366bd                  Pod       spec.containers{kube-rbac-proxy}        Normal    Pulled    kubelet, infra-node-0.openshift.example.com   Container image "registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11" already present on machine
29m         1d           12648     kube-state-metrics-c57bd9dfd-hlghv.158047b45eab46e6   Pod       spec.containers{kube-rbac-proxy-main}   Warning   BackOff   kubelet, infra-node-0.openshift.example.com   Back-off restarting failed container
5m          1d           12309     node-exporter-7tbv5.158047ae3c8dd05b                  Pod       spec.containers{kube-rbac-proxy}        Warning   BackOff   kubelet, app-node-0.openshift.example.com     Back-off restarting failed container
5m          1d           12275     node-exporter-ghxrx.158047ae64407027                  Pod       spec.containers{kube-rbac-proxy}        Warning   BackOff   kubelet, infra-node-0.openshift.example.com   Back-off restarting failed container
4m          1d           12164     node-exporter-d8rcx.158047b0a6da0e9d                  Pod       spec.containers{kube-rbac-proxy}        Warning   BackOff   kubelet, master-0.openshift.example.com       Back-off restarting failed container
4m          1d           12180     kube-state-metrics-c57bd9dfd-hlghv.158047b4e3ccfb49   Pod       spec.containers{kube-rbac-proxy-self}   Warning   BackOff   kubelet, infra-node-0.openshift.example.com   Back-off restarting failed container
6s          1d           535       node-exporter-d8rcx.158047ae3e451bee                  Pod       spec.containers{kube-rbac-proxy}        Normal    Pulled    kubelet, master-0.openshift.example.com       Container image "registry.reg-aws.openshift.com:443/openshift3/ose-kube-rbac-proxy:v3.11" already present on machine
3s          1d           12313     node-exporter-wdv79.158047ae4985686b                  Pod       spec.containers{kube-rbac-proxy}        Warning   BackOff   kubelet, app-node-1.openshift.example.com     Back-off restarting failed container


[openshift@master-0 ~]$ oc logs node-exporter-ghxrx -c kube-rbac-proxy                                                                                                                      
F0206 20:28:10.780000  125990 main.go:240] failed to configure http2 server: http2: TLSConfig.CipherSuites index 11 contains an HTTP/2-approved cipher suite (0xc02f), but it comes after unapproved cipher suites. With this configuration, clients that don't support previous, approved cipher suites may be given an unapproved one and reject the connection.
goroutine 1 [running]:
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.stacks(0xc420406b00, 0xc4202b4000, 0x163, 0x1b7)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).output(0x1a80580, 0xc400000003, 0xc4200c26e0, 0x19e36de, 0x7, 0xf0, 0x0)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).printf(0x1a80580, 0xc400000003, 0x1210047, 0x24, 0xc420457da0, 0x1, 0x1)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.Fatalf(0x1210047, 0x24, 0xc420457da0, 0x1, 0x1)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.main()
        /go/src/github.com/brancz/kube-rbac-proxy/main.go:240 +0x18fc


[openshift@master-0 ~]$ oc logs kube-state-metrics-c57bd9dfd-hlghv -c kube-rbac-proxy-self
F0206 20:30:04.979538       1 main.go:240] failed to configure http2 server: http2: TLSConfig.CipherSuites index 11 contains an HTTP/2-approved cipher suite (0xc02f), but it comes after unapproved cipher suites. With this configuration, clients that don't support previous, approved cipher suites may be given an unapproved one and reject the connection.
goroutine 1 [running]:
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.stacks(0xc42040f800, 0xc4204ae000, 0x163, 0x1b7)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).output(0x1a80580, 0xc400000003, 0xc4200c2630, 0x19e36de, 0x7, 0xf0, 0x0)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).printf(0x1a80580, 0xc400000003, 0x1210047, 0x24, 0xc42014dda0, 0x1, 0x1)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.Fatalf(0x1210047, 0x24, 0xc42014dda0, 0x1, 0x1)
        /go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.main()
        /go/src/github.com/brancz/kube-rbac-proxy/main.go:240 +0x18fc

Comment 1 Frederic Branczyk 2019-02-06 20:52:18 UTC
Could you please share a Pod manifest of one of the pods in question? Thanks!

Comment 3 Frederic Branczyk 2019-02-06 21:26:22 UTC
Sorry about the inconvenience. What you are seeing is a defect that got in with this pull request: https://github.com/openshift/cluster-monitoring-operator/pull/210. This has already been fixed as of https://github.com/openshift/cluster-monitoring-operator/pull/225. We will just need to wait for the next OCP z stream release.

Comment 5 Junqi Zhao 2019-02-11 07:41:39 UTC
Tested with v3.11.82, issue is not fixed,
# oc get pod -n openshift-monitoring
NAME                                           READY     STATUS             RESTARTS   AGE
alertmanager-main-0                            3/3       Running            0          12m
alertmanager-main-1                            3/3       Running            0          12m
alertmanager-main-2                            3/3       Running            0          12m
cluster-monitoring-operator-548fc4f6d4-pmkfh   1/1       Running            0          13m
grafana-69bb9997f5-ppswq                       2/2       Running            0          13m
kube-state-metrics-946b9f84d-s4hzr             1/3       CrashLoopBackOff   14         12m
node-exporter-h7wlb                            1/2       CrashLoopBackOff   7          12m
node-exporter-nztr7                            1/2       CrashLoopBackOff   7          12m
node-exporter-zdnzg                            1/2       CrashLoopBackOff   7          12m
prometheus-k8s-0                               4/4       Running            1          13m
prometheus-k8s-1                               4/4       Running            1          13m
prometheus-operator-55bbdd949b-wq7bt           1/1       Running            0          13m

Comment 6 Junqi Zhao 2019-02-11 08:37:49 UTC
# oc -n openshift-monitoring logs node-exporter-h7wlb -c kube-rbac-proxy
F0211 08:32:06.230291   75189 main.go:240] failed to configure http2 server: http2: TLSConfig.CipherSuites index 11 contains an HTTP/2-approved cipher suite (0xc02f), but it comes after unapproved cipher suites. With this configuration, clients that don't support previous, approved cipher suites may be given an unapproved one and reject the connection.
goroutine 1 [running]:
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.stacks(0xc4202e6200, 0xc420326000, 0x163, 0x1b7)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).output(0x1a80580, 0xc400000003, 0xc4200de630, 0x19e36de, 0x7, 0xf0, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).printf(0x1a80580, 0xc400000003, 0x1210047, 0x24, 0xc420319da0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.Fatalf(0x1210047, 0x24, 0xc420319da0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.main()
	/go/src/github.com/brancz/kube-rbac-proxy/main.go:240 +0x18fc
****************************************************************************************************************************
# oc -n openshift-monitoring logs kube-state-metrics-946b9f84d-s4hzr -c kube-rbac-proxy-main
F0211 08:33:48.829740       1 main.go:240] failed to configure http2 server: http2: TLSConfig.CipherSuites index 11 contains an HTTP/2-approved cipher suite (0xc02f), but it comes after unapproved cipher suites. With this configuration, clients that don't support previous, approved cipher suites may be given an unapproved one and reject the connection.
goroutine 1 [running]:
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.stacks(0xc420070100, 0xc4205fc000, 0x163, 0x1b7)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).output(0x1a80580, 0xc400000003, 0xc4200de630, 0x19e36de, 0x7, 0xf0, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).printf(0x1a80580, 0xc400000003, 0x1210047, 0x24, 0xc42032fda0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.Fatalf(0x1210047, 0x24, 0xc42032fda0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.main()
	/go/src/github.com/brancz/kube-rbac-proxy/main.go:240 +0x18fc
****************************************************************************************************************************
# oc -n openshift-monitoring logs kube-state-metrics-946b9f84d-s4hzr -c kube-rbac-proxy-self
F0211 08:33:49.430652       1 main.go:240] failed to configure http2 server: http2: TLSConfig.CipherSuites index 11 contains an HTTP/2-approved cipher suite (0xc02f), but it comes after unapproved cipher suites. With this configuration, clients that don't support previous, approved cipher suites may be given an unapproved one and reject the connection.
goroutine 1 [running]:
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.stacks(0xc420411c00, 0xc420436000, 0x163, 0x1b7)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).output(0x1a80580, 0xc400000003, 0xc4200de630, 0x19e36de, 0x7, 0xf0, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.(*loggingT).printf(0x1a80580, 0xc400000003, 0x1210047, 0x24, 0xc420317da0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog.Fatalf(0x1210047, 0x24, 0xc420317da0, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.main()
	/go/src/github.com/brancz/kube-rbac-proxy/main.go:240 +0x18fc

Comment 7 minden 2019-02-11 12:47:58 UTC
@Junqi could you share one of the pod manifests? Just to check if the patch [1] mentioned above ever made it into the images.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/225/files

Comment 8 minden 2019-02-11 13:30:19 UTC
Looking into this some more, the described patch did not make it into the binary properly due to an issue in our build system. https://github.com/openshift/cluster-monitoring-operator/pull/241 should fix this.

Comment 10 minden 2019-02-13 13:10:27 UTC
I just validated the fix on Openshift 3.11. Once this got another code review we will go ahead and merge.

Comment 11 minden 2019-02-13 13:45:17 UTC
https://github.com/openshift/cluster-monitoring-operator/pull/241 is merged. Would you mind taking another look Junqi?

Comment 13 Junqi Zhao 2019-02-14 01:50:05 UTC
Tested with ose-cluster-monitoring-operator:v3.11.82-4, issue is fixed and cluster monitoring works well
# oc -n openshift-monitoring get po
NAME                                          READY     STATUS    RESTARTS   AGE
alertmanager-main-0                           3/3       Running   0          18m
alertmanager-main-1                           3/3       Running   0          18m
alertmanager-main-2                           3/3       Running   0          17m
cluster-monitoring-operator-98f84d4dd-st5v7   1/1       Running   0          25m
grafana-7fb8d6b4bf-b7nqs                      2/2       Running   0          22m
kube-state-metrics-9bf978578-z6pwz            3/3       Running   0          16m
node-exporter-887fk                           2/2       Running   0          17m
node-exporter-mfdvx                           2/2       Running   0          17m
node-exporter-w6fpx                           2/2       Running   0          17m
prometheus-k8s-0                              4/4       Running   1          21m
prometheus-k8s-1                              4/4       Running   1          20m
prometheus-operator-544d79d996-gmhnb          1/1       Running   0          24m

Comment 16 errata-xmlrpc 2019-02-20 14:11:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326


Note You need to log in before you can comment on or make changes to this bug.