Hide Forgot
Description of problem: `oc adm top pod` only calculate pod metrics on master node, eg: the belowing command only shows the following pods' metrics $ oc adm top pod -n openshift-monitoring NAME CPU(cores) MEMORY(bytes) cluster-monitoring-operator-8499bf9b58-m6dqk 1m 27Mi node-exporter-2qfxz 0m 21Mi node-exporter-7v7sl 3m 22Mi node-exporter-vb7f5 4m 23Mi theres pods are on master node $ oc get pod -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE alertmanager-main-0 3/3 Running 0 5h7m 10.131.0.8 ip-10-0-138-64.us-east-2.compute.internal <none> alertmanager-main-1 3/3 Running 0 5h7m 10.129.2.7 ip-10-0-152-149.us-east-2.compute.internal <none> alertmanager-main-2 3/3 Running 0 5h7m 10.128.2.8 ip-10-0-171-54.us-east-2.compute.internal <none> cluster-monitoring-operator-8499bf9b58-m6dqk 1/1 Running 0 5h12m 10.129.0.22 ip-10-0-37-21.us-east-2.compute.internal <none> grafana-78765ddcc7-p4vhn 2/2 Running 0 5h11m 10.129.2.5 ip-10-0-152-149.us-east-2.compute.internal <none> kube-state-metrics-67479bfb84-dmpb9 3/3 Running 0 5h6m 10.129.2.8 ip-10-0-152-149.us-east-2.compute.internal <none> node-exporter-2qfxz 2/2 Running 0 5h6m 10.0.2.30 ip-10-0-2-30.us-east-2.compute.internal <none> node-exporter-496nm 2/2 Running 0 5h6m 10.0.152.149 ip-10-0-152-149.us-east-2.compute.internal <none> node-exporter-7v7sl 2/2 Running 0 5h6m 10.0.24.53 ip-10-0-24-53.us-east-2.compute.internal <none> node-exporter-jm9sp 2/2 Running 0 5h6m 10.0.171.54 ip-10-0-171-54.us-east-2.compute.internal <none> node-exporter-qrdlg 2/2 Running 0 5h6m 10.0.138.64 ip-10-0-138-64.us-east-2.compute.internal <none> node-exporter-vb7f5 2/2 Running 0 5h6m 10.0.37.21 ip-10-0-37-21.us-east-2.compute.internal <none> prometheus-adapter-78bd784f5d-n8xgd 1/1 Running 0 6m9s 10.128.2.11 ip-10-0-171-54.us-east-2.compute.internal <none> prometheus-k8s-0 6/6 Running 1 5h9m 10.128.2.7 ip-10-0-171-54.us-east-2.compute.internal <none> prometheus-k8s-1 6/6 Running 1 5h9m 10.129.2.6 ip-10-0-152-149.us-east-2.compute.internal <none> prometheus-operator-6cbfc9949-kcllm 1/1 Running 0 5h12m 10.129.2.3 ip-10-0-152-149.us-east-2.compute.internal <none> telemeter-client-6b7dd49d98-jxxrn 3/3 Running 0 5h6m 10.129.2.9 ip-10-0-152-149.us-east-2.compute.internal <none> $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-138-64.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a ip-10-0-152-149.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a ip-10-0-171-54.us-east-2.compute.internal Ready worker 3h53m v1.12.4+bdfe8e3f3a ip-10-0-2-30.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a ip-10-0-24-53.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a ip-10-0-37-21.us-east-2.compute.internal Ready master 4h7m v1.12.4+bdfe8e3f3a Version-Release number of selected component (if applicable): $ oc version oc v4.0.0-0.168.0 kubernetes v1.12.4+bdfe8e3f3a features: Basic-Auth GSSAPI Kerberos SPNEGO How reproducible: Always Steps to Reproduce: 1. `oc adm top pod` 2. 3. Actual results: `oc adm top pod` only calculate pod metrics on master node Expected results: Should calculate all pods' metrics Additional info:
reason maybe 10250/metrics and 10250/metrics/cadvisor for worker nodes are all down, see attached picture targets_down.png error is x509: certificate signed by unknown authority
Created attachment 1528904 [details] targets_down.png
This is a new regression after https://github.com/openshift/cluster-kube-controller-manager-operator/pull/132. The issue is the Prometheus is not updating to trust the CA the kube-controller-manager is using to sign the kubelet server CSRs. Working to resolve this...
BTW: After running for a few hours, $ oc adm top node error: You must be logged in to the server (Unauthorized) $ oc adm top po error: You must be logged in to the server (Unauthorized) I think it is also relate to x509 issue
*** Bug 1674368 has been marked as a duplicate of this bug. ***
To fix this, Monitoring is going to have to add code to watch the csr-controller-ca configmap in openshift-config-managed namespace, update prometheus config, and reload prometheus.
*** Bug 1674341 has been marked as a duplicate of this bug. ***
PR: https://github.com/openshift/cluster-monitoring-operator/pull/250
we should test with OCP images, tested with the following configurations, still see x509: certificate signed by unknown authority for worker node see the attached picture. Assign it back # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-18-224151 True False 57m Cluster version is 4.0.0-0.nightly-2019-02-18-224151 configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:24eb3125b5fec17e2db68b7fcd406d5aecba67ebe6da18fbd9c2c7e884ce00f8 cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2d0d8d43b79fb970a7a090a759da06aebb1dec7e31fffd2d3ed455f92a998522 prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0 kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f0b3aa9c8923c95233f2872a6d4842796ab202a91faa8595518ad6a154f1d87 kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1 k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:580e5a5cd057e2c09ea132fed5c75b59423228587631dcd47f9471b0d1f9a872 prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5de207bf1cdbdcbe54fe97684d6b3aaf9d362a46f7d0a7af1e989cdf57b59599 prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d8b88bd937ccf01b9cb2584ceb45b829406ebc3b35201f73eead00605b4fdfc prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b50f38e8f288fdba31527bfcb631d0a15bb2c9409631ef30275f5483946aba6f telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c6cbfe8c7034edf8d0df1df4208543fe5f37a8ad306eaf736bcd7c1cbb999ffc prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:efe301356a6f40679e27e6e8287ed6d8316e54410415f4f3744f3182c1d4e07e grafana: quay.io/openshift/origin-grafana:latest oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest RHCOS build: 47.318 # oc get node -o wide | grep worker | awk '{print $3" "$6}' worker 10.0.138.250 worker 10.0.154.163 worker 10.0.175.239
Created attachment 1536207 [details] "x509: certificate signed by unknown authority" for worker node
Tested with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-19-024716 True False 50m Cluster version is 4.0.0-0.nightly-2019-02-19-024716 configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:037fa98f23ff812b6861675127d52eea43caa44bb138e7fe41c7199cb8d4d634 prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:31905d24b331859b99852c6f4ef916539508bfb61f443c94e0f46a83093f7dc0 kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36f168dc7fc6ada9af0f2eeb88f394f2e7311340acc25f801830fe509fd93911 kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1 cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534a71a355e3b9c79ef5a192a200730b8641f5e266abe290b6f7c6342210d8a0 prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5b4ba55ab5ec5bb1b4c024a7b99bc67fe108a28e564288734f9884bc1055d4ed prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5bc582cfbe8b24935e4f9ee1fe6660e13353377473e09a63b51d4e3d24a7ade3 prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6cb6cd27a308c2ae9e0c714c8633792cc151e17312bd74da45255980eabf5ecf prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8675adb4a2a367c9205e3879b986da69400b9187df7ac3f3fbf9882e6a356252 telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9021d3e9ce028fc72301f8e0a40c37e488db658e1500a790c794bfd38903bef1 prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba01869048bf44fc5e8c57f0a34369750ce27e3fb0b5eb47c78f42022640154c k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee79721af3078dfbcfaa75e9a47da1526464cf6685a7f4195ea214c840b59e9f grafana: quay.io/openshift/origin-grafana:latest oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest Issue is partly fixed, openshift-monitoring/kubelet/1 target, endpoint 10250/metrics/cadvisor are Down for master nodes # oc get node -o wide | grep master | awk '{print $3" "$6}' master 10.0.140.224 master 10.0.152.209 master 10.0.163.116 FYI, master does not expose public IP now, not sure it's related to it
Created attachment 1536277 [details] endpoint 10250/metrics/cadvisor are Down for master nodes
Great, this first PR only expected to fix scraping for worker nodes. As noted in the PR: These changes fix scraping of all kubelets on worker nodes, however, scraping master kubelets will be broken until openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer. Once that is in, we can change the CA configmap to kubelet-serving-ca. I’ll post a follow up shortly to fix scraping of master node kubelets and update this BZ.
kubelet_running_pod_count can not calculate pods number on master node should be also wait for openshift/cluster-kube-apiserver-operator#247 lands and makes it into the installer kubelet_running_pod_count{instance=~'.*10.0.163.116.*'} result is No datapoints found. # oc get node -o wide | grep master | awk '{print $3" "$6}' master 10.0.140.224 master 10.0.152.209 master 10.0.163.116 (
#247 landed and I created the follow up PR to be able to scrape both master AND worker kubelets: https://github.com/openshift/cluster-monitoring-operator/pull/256
PR is merged. Please review, Junqi Zhao.
If pods are on master nodes, there is empty metrics diagrams from Pod Overview for pods See the attached picture "empty metrics diagrams from Pod Overview for pods on master", cluster-monitoring-operator-549bdc94fb-svvpr is on master, and there is empty metrics diagrams from Pod Overview This issue is also related to Comment 18 $ oc -n openshift-monitoring get pod -o wide | grep cluster-monitoring-operator-549bdc94fb-svvpr cluster-monitoring-operator-549bdc94fb-svvpr 1/1 Running 0 4h26m 10.130.0.19 ip-10-0-174-27.us-east-2.compute.internal <none> $ oc get node -o wide | grep master | grep ip-10-0-174-27.us-east-2.compute.internal ip-10-0-174-27.us-east-2.compute.internal Ready master 4h42m v1.12.4+ec459b84aa 10.0.174.27 <none> Red Hat CoreOS 4.0 3.10.0-957.5.1.el7.x86_64 cri-o://1.12.5-6.rhaos4.0.git80d1487.el7
Created attachment 1536600 [details] empty metrics diagrams from Pod Overview for pods on master
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-21-215247 True False 3h44m Cluster version is 4.0.0-0.nightly-2019-02-21-215247 kubelet targets on masters are UP now and can scrape pods' metrics on masters, attach the targets file
Created attachment 1537372 [details] kubelet targets on masters are UP
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758