Bug 1674372

Summary:

Prometheus unable to scrape kubelet metrics "x509: certificate signed by unknown authority"

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Monitoring

Assignee:

lserven

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.1.0

CC:

aos-bugs, eparis, jokerman, lserven, mloibl, mmccomas, pweil, sjenning, surbania, weinliu

Target Milestone:

---

Keywords:

Regression, TestBlocker

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:42:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
targets_down.png	none
"x509: certificate signed by unknown authority" for worker node	none
endpoint 10250/metrics/cadvisor are Down for master nodes	none
empty metrics diagrams from Pod Overview for pods on master	none
kubelet targets on masters are UP	none

Description Junqi Zhao 2019-02-11 07:20:44 UTC

Description of problem:
`oc adm top pod` only calculate pod metrics on master node, eg:
the belowing command only shows the following pods' metrics
$ oc adm top pod -n openshift-monitoring
NAME                                           CPU(cores)   MEMORY(bytes)   
cluster-monitoring-operator-8499bf9b58-m6dqk   1m           27Mi            
node-exporter-2qfxz                            0m           21Mi            
node-exporter-7v7sl                            3m           22Mi            
node-exporter-vb7f5                            4m           23Mi    

theres pods are on master node
$ oc get pod -n openshift-monitoring -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
alertmanager-main-0 3/3 Running 0 5h7m 10.131.0.8 ip-10-0-138-64.us-east-2.compute.internal <none>
alertmanager-main-1 3/3 Running 0 5h7m 10.129.2.7 ip-10-0-152-149.us-east-2.compute.internal <none>
alertmanager-main-2 3/3 Running 0 5h7m 10.128.2.8 ip-10-0-171-54.us-east-2.compute.internal <none>
cluster-monitoring-operator-8499bf9b58-m6dqk 1/1 Running 0 5h12m 10.129.0.22 ip-10-0-37-21.us-east-2.compute.internal <none>
grafana-78765ddcc7-p4vhn 2/2 Running 0 5h11m 10.129.2.5 ip-10-0-152-149.us-east-2.compute.internal <none>
kube-state-metrics-67479bfb84-dmpb9 3/3 Running 0 5h6m 10.129.2.8 ip-10-0-152-149.us-east-2.compute.internal <none>
node-exporter-2qfxz 2/2 Running 0 5h6m 10.0.2.30 ip-10-0-2-30.us-east-2.compute.internal <none>
node-exporter-496nm 2/2 Running 0 5h6m 10.0.152.149 ip-10-0-152-149.us-east-2.compute.internal <none>
node-exporter-7v7sl 2/2 Running 0 5h6m 10.0.24.53 ip-10-0-24-53.us-east-2.compute.internal <none>
node-exporter-jm9sp 2/2 Running 0 5h6m 10.0.171.54 ip-10-0-171-54.us-east-2.compute.internal <none>
node-exporter-qrdlg 2/2 Running 0 5h6m 10.0.138.64 ip-10-0-138-64.us-east-2.compute.internal <none>
node-exporter-vb7f5 2/2 Running 0 5h6m 10.0.37.21 ip-10-0-37-21.us-east-2.compute.internal <none>
prometheus-adapter-78bd784f5d-n8xgd 1/1 Running 0 6m9s 10.128.2.11 ip-10-0-171-54.us-east-2.compute.internal <none>
prometheus-k8s-0 6/6 Running 1 5h9m 10.128.2.7 ip-10-0-171-54.us-east-2.compute.internal <none>
prometheus-k8s-1 6/6 Running 1 5h9m 10.129.2.6 ip-10-0-152-149.us-east-2.compute.internal <none>
prometheus-operator-6cbfc9949-kcllm 1/1 Running 0 5h12m 10.129.2.3 ip-10-0-152-149.us-east-2.compute.internal <none>
telemeter-client-6b7dd49d98-jxxrn 3/3 Running 0 5h6m 10.129.2.9 ip-10-0-152-149.us-east-2.compute.internal <none>

$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-138-64.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-152-149.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-171-54.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a
ip-10-0-2-30.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a
ip-10-0-24-53.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a
ip-10-0-37-21.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a

Version-Release number of selected component (if applicable):
$ oc version
oc v4.0.0-0.168.0
kubernetes v1.12.4+bdfe8e3f3a
features: Basic-Auth GSSAPI Kerberos SPNEGO


How reproducible:
Always

Steps to Reproduce:
1. `oc adm top pod`
2.
3.

Actual results:
`oc adm top pod` only calculate pod metrics on master node

Expected results:
Should calculate all pods' metrics

Additional info:

Comment 1 Junqi Zhao 2019-02-11 08:50:07 UTC

reason maybe 10250/metrics and 10250/metrics/cadvisor for worker nodes are all down, see attached picture targets_down.png

error is

x509: certificate signed by unknown authority

Comment 2 Junqi Zhao 2019-02-11 08:50:23 UTC

Created attachment 1528904 [details]
targets_down.png

Comment 3 Seth Jennings 2019-02-11 16:25:11 UTC

This is a new regression after https://github.com/openshift/cluster-kube-controller-manager-operator/pull/132.

The issue is the Prometheus is not updating to trust the CA the kube-controller-manager is using to sign the kubelet server CSRs.

Working to resolve this...

Comment 4 Junqi Zhao 2019-02-12 02:03:50 UTC

BTW: After running for a few hours, 
$ oc adm top node
error: You must be logged in to the server (Unauthorized)
$ oc adm top po
error: You must be logged in to the server (Unauthorized)

I think it is also relate to x509 issue

Comment 5 Seth Jennings 2019-02-12 14:34:33 UTC

*** Bug 1674368 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-02-12 21:44:56 UTC

To fix this, Monitoring is going to have to add code to watch the csr-controller-ca configmap in openshift-config-managed namespace, update prometheus config, and reload prometheus.

Comment 9 Seth Jennings 2019-02-15 15:47:06 UTC

*** Bug 1674341 has been marked as a duplicate of this bug. ***

Comment 10 lserven 2019-02-16 02:03:44 UTC

PR: https://github.com/openshift/cluster-monitoring-operator/pull/250

Comment 14 Junqi Zhao 2019-02-19 03:32:01 UTC

we should test with OCP images, tested with the following configurations, still see x509: certificate signed by unknown authority for worker node
see the attached picture.
Assign it back

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-18-224151   True        False         57m     Cluster version is 4.0.0-0.nightly-2019-02-18-224151


configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:24eb3125b5fec17e2db68b7fcd406d5aecba67ebe6da18fbd9c2c7e884ce00f8
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0d8d43b79fb970a7a090a759da06aebb1dec7e31fffd2d3ed455f92a998522
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f0b3aa9c8923c95233f2872a6d4842796ab202a91faa8595518ad6a154f1d87
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:580e5a5cd057e2c09ea132fed5c75b59423228587631dcd47f9471b0d1f9a872
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5de207bf1cdbdcbe54fe97684d6b3aaf9d362a46f7d0a7af1e989cdf57b59599
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8b88bd937ccf01b9cb2584ceb45b829406ebc3b35201f73eead00605b4fdfc
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b50f38e8f288fdba31527bfcb631d0a15bb2c9409631ef30275f5483946aba6f
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6cbfe8c7034edf8d0df1df4208543fe5f37a8ad306eaf736bcd7c1cbb999ffc
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:efe301356a6f40679e27e6e8287ed6d8316e54410415f4f3744f3182c1d4e07e
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest


RHCOS build: 47.318


# oc get node -o wide | grep worker | awk '{print $3"    "$6}'
worker    10.0.138.250
worker    10.0.154.163
worker    10.0.175.239

Comment 15 Junqi Zhao 2019-02-19 03:32:38 UTC

Created attachment 1536207 [details]
"x509: certificate signed by unknown authority" for worker node

Comment 16 Junqi Zhao 2019-02-19 09:45:57 UTC

Tested with 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-19-024716   True        False         50m     Cluster version is 4.0.0-0.nightly-2019-02-19-024716



configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:037fa98f23ff812b6861675127d52eea43caa44bb138e7fe41c7199cb8d4d634
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36f168dc7fc6ada9af0f2eeb88f394f2e7311340acc25f801830fe509fd93911
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534a71a355e3b9c79ef5a192a200730b8641f5e266abe290b6f7c6342210d8a0
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bc582cfbe8b24935e4f9ee1fe6660e13353377473e09a63b51d4e3d24a7ade3
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6cb6cd27a308c2ae9e0c714c8633792cc151e17312bd74da45255980eabf5ecf
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8675adb4a2a367c9205e3879b986da69400b9187df7ac3f3fbf9882e6a356252
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9021d3e9ce028fc72301f8e0a40c37e488db658e1500a790c794bfd38903bef1
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba01869048bf44fc5e8c57f0a34369750ce27e3fb0b5eb47c78f42022640154c
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee79721af3078dfbcfaa75e9a47da1526464cf6685a7f4195ea214c840b59e9f
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest

Issue is partly fixed, openshift-monitoring/kubelet/1 target, endpoint 10250/metrics/cadvisor are Down for master nodes
# oc get node -o wide | grep master | awk '{print $3"    "$6}'
master    10.0.140.224
master    10.0.152.209
master    10.0.163.116

FYI, master does not expose public IP now, not sure it's related to it

Comment 17 Junqi Zhao 2019-02-19 09:47:35 UTC

Created attachment 1536277 [details]
endpoint 10250/metrics/cadvisor are Down for master nodes

Comment 18 lserven 2019-02-19 09:55:13 UTC

Great, this first PR only expected to fix scraping for worker nodes. As noted in the PR:

These changes fix scraping of all kubelets on worker nodes, however, scraping
master kubelets will be broken until
openshift/cluster-kube-apiserver-operator#247 lands and
makes it into the installer. Once that is in, we can change the CA configmap to
kubelet-serving-ca.

I’ll post a follow up shortly to fix scraping of master node kubelets and update this BZ.

Comment 19 Junqi Zhao 2019-02-19 10:21:24 UTC

kubelet_running_pod_count can not calculate pods number on master node should be also wait for openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer

kubelet_running_pod_count{instance=~'.*10.0.163.116.*'}
result is No datapoints found.

# oc get node -o wide | grep master | awk '{print $3"    "$6}'
master    10.0.140.224
master    10.0.152.209
master    10.0.163.116 (

Comment 20 lserven 2019-02-19 16:53:27 UTC

#247 landed and I created the follow up PR to be able to scrape both master AND worker kubelets: https://github.com/openshift/cluster-monitoring-operator/pull/256

Comment 21 lserven 2019-02-19 18:09:40 UTC

PR is merged. Please review, Junqi Zhao.

Comment 22 Junqi Zhao 2019-02-20 07:48:46 UTC

If pods are on master nodes, there is empty metrics diagrams from Pod Overview for pods
See the attached picture "empty metrics diagrams from Pod Overview for pods on master",
cluster-monitoring-operator-549bdc94fb-svvpr is on master, and there is empty metrics diagrams from Pod Overview

This issue is also related to Comment 18
$ oc -n openshift-monitoring get pod -o wide | grep cluster-monitoring-operator-549bdc94fb-svvpr
cluster-monitoring-operator-549bdc94fb-svvpr   1/1     Running   0          4h26m   10.130.0.19    ip-10-0-174-27.us-east-2.compute.internal    <none>

$ oc get node -o wide | grep master | grep ip-10-0-174-27.us-east-2.compute.internal
ip-10-0-174-27.us-east-2.compute.internal    Ready    master   4h42m   v1.12.4+ec459b84aa   10.0.174.27    <none>        Red Hat CoreOS 4.0   3.10.0-957.5.1.el7.x86_64   cri-o://1.12.5-6.rhaos4.0.git80d1487.el7

Comment 23 Junqi Zhao 2019-02-20 07:49:19 UTC

Created attachment 1536600 [details]
empty metrics diagrams from Pod Overview for pods on master

Comment 26 Junqi Zhao 2019-02-22 08:34:51 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-21-215247   True        False         3h44m   Cluster version is 4.0.0-0.nightly-2019-02-21-215247

kubelet targets on masters are UP now and can scrape pods' metrics on masters, attach the targets file

Comment 27 Junqi Zhao 2019-02-22 08:36:27 UTC

Created attachment 1537372 [details]
kubelet targets on masters are UP

Comment 30 errata-xmlrpc 2019-06-04 10:42:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758