Bug 1674372 - Prometheus unable to scrape kubelet metrics "x509: certificate signed by unknown authority"
Description Junqi Zhao 2019-02-11 07:20:44 UTC
Description of problem:
`oc adm top pod` only calculate pod metrics on master node, eg:
the belowing command only shows the following pods' metrics
$ oc adm top pod -n openshift-monitoring
NAME                                           CPU(cores)   MEMORY(bytes)   
cluster-monitoring-operator-8499bf9b58-m6dqk   1m           27Mi            
node-exporter-2qfxz                            0m           21Mi            
node-exporter-7v7sl                            3m           22Mi            
node-exporter-vb7f5                            4m           23Mi    

theres pods are on master node
$ oc get pod -n openshift-monitoring -o wide
alertmanager-main-0 3/3 Running 0 5h7m ip-10-0-138-64.us-east-2.compute.internal <none>
alertmanager-main-1 3/3 Running 0 5h7m ip-10-0-152-149.us-east-2.compute.internal <none>
alertmanager-main-2 3/3 Running 0 5h7m ip-10-0-171-54.us-east-2.compute.internal <none>
cluster-monitoring-operator-8499bf9b58-m6dqk 1/1 Running 0 5h12m ip-10-0-37-21.us-east-2.compute.internal <none>
grafana-78765ddcc7-p4vhn 2/2 Running 0 5h11m ip-10-0-152-149.us-east-2.compute.internal <none>
kube-state-metrics-67479bfb84-dmpb9 3/3 Running 0 5h6m ip-10-0-152-149.us-east-2.compute.internal <none>
node-exporter-2qfxz 2/2 Running 0 5h6m ip-10-0-2-30.us-east-2.compute.internal <none>
node-exporter-496nm 2/2 Running 0 5h6m ip-10-0-152-149.us-east-2.compute.internal <none>
node-exporter-7v7sl 2/2 Running 0 5h6m ip-10-0-24-53.us-east-2.compute.internal <none>
node-exporter-jm9sp 2/2 Running 0 5h6m ip-10-0-171-54.us-east-2.compute.internal <none>
node-exporter-qrdlg 2/2 Running 0 5h6m ip-10-0-138-64.us-east-2.compute.internal <none>
node-exporter-vb7f5 2/2 Running 0 5h6m ip-10-0-37-21.us-east-2.compute.internal <none>
prometheus-adapter-78bd784f5d-n8xgd 1/1 Running 0 6m9s ip-10-0-171-54.us-east-2.compute.internal <none>
prometheus-k8s-0 6/6 Running 1 5h9m ip-10-0-171-54.us-east-2.compute.internal <none>
prometheus-k8s-1 6/6 Running 1 5h9m ip-10-0-152-149.us-east-2.compute.internal <none>
prometheus-operator-6cbfc9949-kcllm 1/1 Running 0 5h12m ip-10-0-152-149.us-east-2.compute.internal <none>
telemeter-client-6b7dd49d98-jxxrn 3/3 Running 0 5h6m ip-10-0-152-149.us-east-2.compute.internal <none>

$ oc get node
ip-10-0-138-64.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-152-149.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-171-54.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-2-30.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a
ip-10-0-24-53.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a
ip-10-0-37-21.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a

Version-Release number of selected component (if applicable):
$ oc version
oc v4.0.0-0.168.0
kubernetes v1.12.4+bdfe8e3f3a
features: Basic-Auth GSSAPI Kerberos SPNEGO

How reproducible:

Steps to Reproduce:
1. `oc adm top pod`

Actual results:
`oc adm top pod` only calculate pod metrics on master node

Expected results:
Should calculate all pods' metrics

Additional info:

Comment 1 Junqi Zhao 2019-02-11 08:50:07 UTC
reason maybe 10250/metrics and 10250/metrics/cadvisor for worker nodes are all down, see attached picture targets_down.png

error is

x509: certificate signed by unknown authority

Comment 2 Junqi Zhao 2019-02-11 08:50:23 UTC
Created attachment 1528904 [details]

Comment 3 Seth Jennings 2019-02-11 16:25:11 UTC
This is a new regression after https://github.com/openshift/cluster-kube-controller-manager-operator/pull/132.

The issue is the Prometheus is not updating to trust the CA the kube-controller-manager is using to sign the kubelet server CSRs.

Working to resolve this...

Comment 4 Junqi Zhao 2019-02-12 02:03:50 UTC
BTW: After running for a few hours, 
$ oc adm top node
error: You must be logged in to the server (Unauthorized)
$ oc adm top po
error: You must be logged in to the server (Unauthorized)

I think it is also relate to x509 issue

Comment 5 Seth Jennings 2019-02-12 14:34:33 UTC
*** Bug 1674368 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-02-12 21:44:56 UTC
To fix this, Monitoring is going to have to add code to watch the csr-controller-ca configmap in openshift-config-managed namespace, update prometheus config, and reload prometheus.

Comment 9 Seth Jennings 2019-02-15 15:47:06 UTC
*** Bug 1674341 has been marked as a duplicate of this bug. ***

Comment 14 Junqi Zhao 2019-02-19 03:32:01 UTC
we should test with OCP images, tested with the following configurations, still see x509: certificate signed by unknown authority for worker node
see the attached picture.
Assign it back

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-18-224151   True        False         57m     Cluster version is 4.0.0-0.nightly-2019-02-18-224151

configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:24eb3125b5fec17e2db68b7fcd406d5aecba67ebe6da18fbd9c2c7e884ce00f8
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0d8d43b79fb970a7a090a759da06aebb1dec7e31fffd2d3ed455f92a998522
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f0b3aa9c8923c95233f2872a6d4842796ab202a91faa8595518ad6a154f1d87
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:580e5a5cd057e2c09ea132fed5c75b59423228587631dcd47f9471b0d1f9a872
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5de207bf1cdbdcbe54fe97684d6b3aaf9d362a46f7d0a7af1e989cdf57b59599
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8b88bd937ccf01b9cb2584ceb45b829406ebc3b35201f73eead00605b4fdfc
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b50f38e8f288fdba31527bfcb631d0a15bb2c9409631ef30275f5483946aba6f
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6cbfe8c7034edf8d0df1df4208543fe5f37a8ad306eaf736bcd7c1cbb999ffc
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:efe301356a6f40679e27e6e8287ed6d8316e54410415f4f3744f3182c1d4e07e
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest

RHCOS build: 47.318

# oc get node -o wide | grep worker | awk '{print $3"    "$6}'

Comment 15 Junqi Zhao 2019-02-19 03:32:38 UTC
Created attachment 1536207 [details]
"x509: certificate signed by unknown authority" for worker node

Comment 16 Junqi Zhao 2019-02-19 09:45:57 UTC
Tested with 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-19-024716   True        False         50m     Cluster version is 4.0.0-0.nightly-2019-02-19-024716

configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:037fa98f23ff812b6861675127d52eea43caa44bb138e7fe41c7199cb8d4d634
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36f168dc7fc6ada9af0f2eeb88f394f2e7311340acc25f801830fe509fd93911
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534a71a355e3b9c79ef5a192a200730b8641f5e266abe290b6f7c6342210d8a0
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bc582cfbe8b24935e4f9ee1fe6660e13353377473e09a63b51d4e3d24a7ade3
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6cb6cd27a308c2ae9e0c714c8633792cc151e17312bd74da45255980eabf5ecf
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8675adb4a2a367c9205e3879b986da69400b9187df7ac3f3fbf9882e6a356252
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9021d3e9ce028fc72301f8e0a40c37e488db658e1500a790c794bfd38903bef1
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba01869048bf44fc5e8c57f0a34369750ce27e3fb0b5eb47c78f42022640154c
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee79721af3078dfbcfaa75e9a47da1526464cf6685a7f4195ea214c840b59e9f
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest

Issue is partly fixed, openshift-monitoring/kubelet/1 target, endpoint 10250/metrics/cadvisor are Down for master nodes
# oc get node -o wide | grep master | awk '{print $3"    "$6}'

FYI, master does not expose public IP now, not sure it's related to it

Comment 17 Junqi Zhao 2019-02-19 09:47:35 UTC
Created attachment 1536277 [details]
endpoint 10250/metrics/cadvisor are Down for master nodes

Comment 18 lserven 2019-02-19 09:55:13 UTC
Great, this first PR only expected to fix scraping for worker nodes. As noted in the PR:

These changes fix scraping of all kubelets on worker nodes, however, scraping
master kubelets will be broken until
openshift/cluster-kube-apiserver-operator#247 lands and
makes it into the installer. Once that is in, we can change the CA configmap to

I’ll post a follow up shortly to fix scraping of master node kubelets and update this BZ.

Comment 19 Junqi Zhao 2019-02-19 10:21:24 UTC
kubelet_running_pod_count can not calculate pods number on master node should be also wait for openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer

result is No datapoints found.

# oc get node -o wide | grep master | awk '{print $3"    "$6}'
master (

Comment 20 lserven 2019-02-19 16:53:27 UTC
#247 landed and I created the follow up PR to be able to scrape both master AND worker kubelets: https://github.com/openshift/cluster-monitoring-operator/pull/256

Comment 21 lserven 2019-02-19 18:09:40 UTC
PR is merged. Please review, Junqi Zhao.

Comment 22 Junqi Zhao 2019-02-20 07:48:46 UTC
If pods are on master nodes, there is empty metrics diagrams from Pod Overview for pods
See the attached picture "empty metrics diagrams from Pod Overview for pods on master",
cluster-monitoring-operator-549bdc94fb-svvpr is on master, and there is empty metrics diagrams from Pod Overview

This issue is also related to Comment 18
$ oc -n openshift-monitoring get pod -o wide | grep cluster-monitoring-operator-549bdc94fb-svvpr
cluster-monitoring-operator-549bdc94fb-svvpr   1/1     Running   0          4h26m    ip-10-0-174-27.us-east-2.compute.internal    <none>

$ oc get node -o wide | grep master | grep ip-10-0-174-27.us-east-2.compute.internal
ip-10-0-174-27.us-east-2.compute.internal    Ready    master   4h42m   v1.12.4+ec459b84aa    <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.5-6.rhaos4.0.git80d1487.el7

Comment 23 Junqi Zhao 2019-02-20 07:49:19 UTC
Created attachment 1536600 [details]
empty metrics diagrams from Pod Overview for pods on master

Comment 26 Junqi Zhao 2019-02-22 08:34:51 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-21-215247   True        False         3h44m   Cluster version is 4.0.0-0.nightly-2019-02-21-215247

kubelet targets on masters are UP now and can scrape pods' metrics on masters, attach the targets file

Comment 27 Junqi Zhao 2019-02-22 08:36:27 UTC
Created attachment 1537372 [details]
kubelet targets on masters are UP

Comment 30 errata-xmlrpc 2019-06-04 10:42:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


