Bug 1674372
| Summary: | Prometheus unable to scrape kubelet metrics "x509: certificate signed by unknown authority" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> |
| Component: | Monitoring | Assignee: | lserven |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.1.0 | CC: | aos-bugs, eparis, jokerman, lserven, mloibl, mmccomas, pweil, sjenning, surbania, weinliu |
| Target Milestone: | --- | Keywords: | Regression, TestBlocker |
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-04 10:42:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Junqi Zhao
2019-02-11 07:20:44 UTC
reason maybe 10250/metrics and 10250/metrics/cadvisor for worker nodes are all down, see attached picture targets_down.png error is x509: certificate signed by unknown authority Created attachment 1528904 [details]
targets_down.png
This is a new regression after https://github.com/openshift/cluster-kube-controller-manager-operator/pull/132. The issue is the Prometheus is not updating to trust the CA the kube-controller-manager is using to sign the kubelet server CSRs. Working to resolve this... BTW: After running for a few hours, $ oc adm top node error: You must be logged in to the server (Unauthorized) $ oc adm top po error: You must be logged in to the server (Unauthorized) I think it is also relate to x509 issue *** Bug 1674368 has been marked as a duplicate of this bug. *** To fix this, Monitoring is going to have to add code to watch the csr-controller-ca configmap in openshift-config-managed namespace, update prometheus config, and reload prometheus. *** Bug 1674341 has been marked as a duplicate of this bug. *** we should test with OCP images, tested with the following configurations, still see x509: certificate signed by unknown authority for worker node
see the attached picture.
Assign it back
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.0.0-0.nightly-2019-02-18-224151 True False 57m Cluster version is 4.0.0-0.nightly-2019-02-18-224151
configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:24eb3125b5fec17e2db68b7fcd406d5aecba67ebe6da18fbd9c2c7e884ce00f8
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0d8d43b79fb970a7a090a759da06aebb1dec7e31fffd2d3ed455f92a998522
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f0b3aa9c8923c95233f2872a6d4842796ab202a91faa8595518ad6a154f1d87
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:580e5a5cd057e2c09ea132fed5c75b59423228587631dcd47f9471b0d1f9a872
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5de207bf1cdbdcbe54fe97684d6b3aaf9d362a46f7d0a7af1e989cdf57b59599
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8b88bd937ccf01b9cb2584ceb45b829406ebc3b35201f73eead00605b4fdfc
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b50f38e8f288fdba31527bfcb631d0a15bb2c9409631ef30275f5483946aba6f
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6cbfe8c7034edf8d0df1df4208543fe5f37a8ad306eaf736bcd7c1cbb999ffc
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:efe301356a6f40679e27e6e8287ed6d8316e54410415f4f3744f3182c1d4e07e
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest
RHCOS build: 47.318
# oc get node -o wide | grep worker | awk '{print $3" "$6}'
worker 10.0.138.250
worker 10.0.154.163
worker 10.0.175.239
Created attachment 1536207 [details]
"x509: certificate signed by unknown authority" for worker node
Tested with
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.0.0-0.nightly-2019-02-19-024716 True False 50m Cluster version is 4.0.0-0.nightly-2019-02-19-024716
configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:037fa98f23ff812b6861675127d52eea43caa44bb138e7fe41c7199cb8d4d634
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36f168dc7fc6ada9af0f2eeb88f394f2e7311340acc25f801830fe509fd93911
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534a71a355e3b9c79ef5a192a200730b8641f5e266abe290b6f7c6342210d8a0
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bc582cfbe8b24935e4f9ee1fe6660e13353377473e09a63b51d4e3d24a7ade3
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6cb6cd27a308c2ae9e0c714c8633792cc151e17312bd74da45255980eabf5ecf
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8675adb4a2a367c9205e3879b986da69400b9187df7ac3f3fbf9882e6a356252
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9021d3e9ce028fc72301f8e0a40c37e488db658e1500a790c794bfd38903bef1
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba01869048bf44fc5e8c57f0a34369750ce27e3fb0b5eb47c78f42022640154c
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee79721af3078dfbcfaa75e9a47da1526464cf6685a7f4195ea214c840b59e9f
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest
Issue is partly fixed, openshift-monitoring/kubelet/1 target, endpoint 10250/metrics/cadvisor are Down for master nodes
# oc get node -o wide | grep master | awk '{print $3" "$6}'
master 10.0.140.224
master 10.0.152.209
master 10.0.163.116
FYI, master does not expose public IP now, not sure it's related to it
Created attachment 1536277 [details]
endpoint 10250/metrics/cadvisor are Down for master nodes
Great, this first PR only expected to fix scraping for worker nodes. As noted in the PR: These changes fix scraping of all kubelets on worker nodes, however, scraping master kubelets will be broken until openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer. Once that is in, we can change the CA configmap to kubelet-serving-ca. I’ll post a follow up shortly to fix scraping of master node kubelets and update this BZ. kubelet_running_pod_count can not calculate pods number on master node should be also wait for openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer
kubelet_running_pod_count{instance=~'.*10.0.163.116.*'}
result is No datapoints found.
# oc get node -o wide | grep master | awk '{print $3" "$6}'
master 10.0.140.224
master 10.0.152.209
master 10.0.163.116 (
#247 landed and I created the follow up PR to be able to scrape both master AND worker kubelets: https://github.com/openshift/cluster-monitoring-operator/pull/256 PR is merged. Please review, Junqi Zhao. If pods are on master nodes, there is empty metrics diagrams from Pod Overview for pods See the attached picture "empty metrics diagrams from Pod Overview for pods on master", cluster-monitoring-operator-549bdc94fb-svvpr is on master, and there is empty metrics diagrams from Pod Overview This issue is also related to Comment 18 $ oc -n openshift-monitoring get pod -o wide | grep cluster-monitoring-operator-549bdc94fb-svvpr cluster-monitoring-operator-549bdc94fb-svvpr 1/1 Running 0 4h26m 10.130.0.19 ip-10-0-174-27.us-east-2.compute.internal <none> $ oc get node -o wide | grep master | grep ip-10-0-174-27.us-east-2.compute.internal ip-10-0-174-27.us-east-2.compute.internal Ready master 4h42m v1.12.4+ec459b84aa 10.0.174.27 <none> Red Hat CoreOS 4.0 3.10.0-957.5.1.el7.x86_64 cri-o://1.12.5-6.rhaos4.0.git80d1487.el7 Created attachment 1536600 [details]
empty metrics diagrams from Pod Overview for pods on master
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-21-215247 True False 3h44m Cluster version is 4.0.0-0.nightly-2019-02-21-215247 kubelet targets on masters are UP now and can scrape pods' metrics on masters, attach the targets file Created attachment 1537372 [details]
kubelet targets on masters are UP
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |