Issue reported on bug #2013528 (see description below) could be reproduced in a fixed version: % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.15 True False 25h Cluster version is 4.9.15 % oc exec -n openshift-cluster-machine-approver -c machine-approver-controller machine-approver-74559b5df9-gdngs -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" https://localhost:9192/metrics | grep "mapi_current_pending_csr" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level # TYPE mapi_current_pending_csr gauge mapi_current_pending_csr 1 100 191k 0 191k 0 0 3996k 0 --:--:-- --:--:-- --:--:-- 3996k % oc get csr No resources found +++ This bug was initially created as a clone of Bug #2013528 +++ Description of problem: After installing or updating to OpenShift Container Platform 4.8 it was found that mapi_current_pending_csr metric from openshift-cluster-machine-approver is always reporting one, even though no pending CSR is reported: > $ oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.8.12 True False 43h Cluster version is 4.8.12 > $ oc get csr > No resources found > $ oc exec -c machine-approver-controller machine-approver-5fcfd56b9d-pc4g8 -- curl -H "Authorization: Bearer XXXXXX" -k https://localhost:9192/metrics | grep mapi_current_pending_csr > % Total % Received % Xferd Average Speed Time Time Time Current > Dload Upload Total Spent Left Speed > 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level > # TYPE mapi_current_pending_csr gauge > mapi_current_pending_csr 1 This is reported by a freshly installed OpenShift Container Platform 4.8 - Cluster on AWS using IPI. But the same can be seen with OpenShift Container Platform - Cluster running on OpenStack and using UPI installation mode. Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.8.12 How reproducible: - So far always Steps to Reproduce: 1. Install OpenShift Container Platform 4.8, wait some time and check metrics for `mapi_current_pending_csr` 2. Also make sure that there is no pending CSR in the Cluster Actual results: mapi_current_pending_csr is reporting 1 even though no pending CSR is reported Expected results: mapi_current_pending_csr should report 0 if there is no pending CSR Additional info: --- Additional comment from Simon Reber on 2021-10-13 06:12:37 UTC --- Hi all, The Screenshot attached is showing the history of the mapi_current_pending_csr metric which is set to one shortly after OpenShift Container Platform 4.8 is installed. As shown there are no pending CSR and after installation there was no Node added, removed or changed. So the cluster did not have pending CSR for a long time. --- Additional comment from Simon Reber on 2021-10-13 06:16:58 UTC --- Hi all, I also have a `must-gather` from a customer OpenShift Container Platform - Cluster that is showing the same behavior (but which was updated and not fresh installed). In addition I also have logs from `machine-approver-controller` with log-level 9. But except for some CSR activity (which is expected when scaling Nodes, etc.) there is nothing as Metrics collection does not appear to provide more verbose data: <snip from customer `machine-approver-controller` log with log-level 9> I1007 10:43:38.912482 1 csr_check.go:442] retrieving serving cert from ip-10-241-28-223.eu-central-1.compute.internal (10.241.28.223:10250) I1007 10:43:38.914685 1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error I1007 10:43:38.914705 1 csr_check.go:191] Falling back to machine-api authorization for ip-10-241-28-223.eu-central-1.compute.internal I1007 10:43:38.914869 1 request.go:1123] Request Body: {"kind":"CertificateSigningRequest","apiVersion":"certificates.k8s.io/v1","metadata":{"name":"csr-8d54c","generateName":"csr-","uid":"8d434333-bb43-4bcb-bc28-080ec539666e","resourceVersion":"49470864","creationTimestamp":"2021-10-07T10:43:38Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}}}}]},"spec":{"request":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1hERVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVNd1FRWURWUVFERXpwegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHlOREV0TWpndE1qSXpMbVYxTFdObGJuUnlZV3d0TVM1amIyMXdkWFJsCkxtbHVkR1Z5Ym1Gc01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUQvRXN3TWV6WFdEMERxZDgKeFN3VTdkWld3ZlZQRzgwSnVmamU3U3FCVEtDSWNSb3c0b1FCQTh0WmdVRVE2MnR0ZW92alhyd3BZN2xyb2R2SgpBUGF5UktCU01GQUdDU3FHU0liM0RRRUpEakZETUVFd1B3WURWUjBSQkRnd05vSXVhWEF0TVRBdE1qUXhMVEk0CkxUSXlNeTVsZFMxalpXNTBjbUZzTFRFdVkyOXRjSFYwWlM1cGJuUmxjbTVoYkljRUN2RWMzekFLQmdncWhrak8KUFFRREFnTklBREJGQWlBcmQxSklSQnBBVm5mVWwrUkV5Q2dUSW9lZTdjTTdidzhra3NFRm4xNjhFUUloQUxtNwphU1prRGtVcFpYbVZ3Ym1vZ3JtMEdFRzdCeXNFM2F0dGNQbFlwTEhLCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=","signerName":"kubernetes.io/kubelet-serving","usages":["digital signature","key encipherment","server auth"],"username":"system:node:ip-10-241-28-223.eu-central-1.compute.internal","groups":["system:nodes","system:authenticated"]},"status":{"conditions":[{"type":"Approved","status":"True","reason":"NodeCSRApprove","message":"This CSR was approved by the Node CSR Approver","lastUpdateTime":"2021-10-07T10:43:38Z","lastTransitionTime":null}]}} I1007 10:43:38.914970 1 round_trippers.go:435] curl -k -v -XPUT -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "Authorization: Bearer <masked>" 'https://198.18.0.1:443/apis/certificates.k8s.io/v1/certificatesigningrequests/csr-8d54c/approval' I1007 10:43:38.929892 1 round_trippers.go:454] PUT https://198.18.0.1:443/apis/certificates.k8s.io/v1/certificatesigningrequests/csr-8d54c/approval 200 OK in 14 milliseconds I1007 10:43:38.929910 1 round_trippers.go:460] Response Headers: I1007 10:43:38.929916 1 round_trippers.go:463] Content-Length: 2020 I1007 10:43:38.929919 1 round_trippers.go:463] Date: Thu, 07 Oct 2021 10:43:38 GMT I1007 10:43:38.929922 1 round_trippers.go:463] Audit-Id: 434045ce-4250-4d54-b349-2555c797386f I1007 10:43:38.929926 1 round_trippers.go:463] Cache-Control: no-cache, private I1007 10:43:38.929928 1 round_trippers.go:463] Content-Type: application/json I1007 10:43:38.929931 1 round_trippers.go:463] X-Kubernetes-Pf-Flowschema-Uid: 60a5fc8f-0d1e-410f-bd9c-210b7ef205fa I1007 10:43:38.929934 1 round_trippers.go:463] X-Kubernetes-Pf-Prioritylevel-Uid: 027324f9-e660-4929-9b52-8f1c525c9f52 I1007 10:43:38.929981 1 request.go:1123] Response Body: {"kind":"CertificateSigningRequest","apiVersion":"certificates.k8s.io/v1","metadata":{"name":"csr-8d54c","generateName":"csr-","uid":"8d434333-bb43-4bcb-bc28-080ec539666e","resourceVersion":"49470872","creationTimestamp":"2021-10-07T10:43:38Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}}}},{"manager":"machine-approver","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"Approved\"}":{".":{},"f:lastTransitionTime":{},"f:lastUpdateTime":{},"f:message":{},"f:reason":{},"f:status":{},"f:type":{}}}}}}]},"spec":{"request":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1hERVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVNd1FRWURWUVFERXpwegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHlOREV0TWpndE1qSXpMbVYxTFdObGJuUnlZV3d0TVM1amIyMXdkWFJsCkxtbHVkR1Z5Ym1Gc01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUQvRXN3TWV6WFdEMERxZDgKeFN3VTdkWld3ZlZQRzgwSnVmamU3U3FCVEtDSWNSb3c0b1FCQTh0WmdVRVE2MnR0ZW92alhyd3BZN2xyb2R2SgpBUGF5UktCU01GQUdDU3FHU0liM0RRRUpEakZETUVFd1B3WURWUjBSQkRnd05vSXVhWEF0TVRBdE1qUXhMVEk0CkxUSXlNeTVsZFMxalpXNTBjbUZzTFRFdVkyOXRjSFYwWlM1cGJuUmxjbTVoYkljRUN2RWMzekFLQmdncWhrak8KUFFRREFnTklBREJGQWlBcmQxSklSQnBBVm5mVWwrUkV5Q2dUSW9lZTdjTTdidzhra3NFRm4xNjhFUUloQUxtNwphU1prRGtVcFpYbVZ3Ym1vZ3JtMEdFRzdCeXNFM2F0dGNQbFlwTEhLCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=","signerName":"kubernetes.io/kubelet-serving","usages":["digital signature","key encipherment","server auth"],"username":"system:node:ip-10-241-28-223.eu-central-1.compute.internal","groups":["system:nodes","system:authenticated"]},"status":{"conditions":[{"type":"Approved","status":"True","reason":"NodeCSRApprove","message":"This CSR was approved by the Node CSR Approver","lastUpdateTime":"2021-10-07T10:43:38Z","lastTransitionTime":"2021-10-07T10:43:38Z"}]}} I1007 10:43:38.930104 1 controller.go:179] CSR csr-8d54c approved I1007 10:43:42.695100 1 reflector.go:381] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:241: forcing resync I1007 10:43:50.668919 1 metrics.go:44] collectMetrics exit I1007 10:43:50.800279 1 metrics.go:44] collectMetrics exit I1007 10:44:20.678445 1 metrics.go:44] collectMetrics exit I1007 10:44:20.802368 1 metrics.go:44] collectMetrics exit I1007 10:44:50.668943 1 metrics.go:44] collectMetrics exit I1007 10:44:50.804770 1 metrics.go:44] collectMetrics exit I1007 10:45:20.673141 1 metrics.go:44] collectMetrics exit I1007 10:45:20.801255 1 metrics.go:44] collectMetrics exit I1007 10:45:50.668982 1 metrics.go:44] collectMetrics exit I1007 10:45:50.799709 1 metrics.go:44] collectMetrics exit I1007 10:46:20.672427 1 metrics.go:44] collectMetrics exit I1007 10:46:20.800086 1 metrics.go:44] collectMetrics exit I1007 10:46:50.675045 1 metrics.go:44] collectMetrics exit I1007 10:46:50.802675 1 metrics.go:44] collectMetrics exit I1007 10:47:20.668899 1 metrics.go:44] collectMetrics exit I1007 10:47:20.800359 1 metrics.go:44] collectMetrics exit I1007 10:47:50.668449 1 metrics.go:44] collectMetrics exit I1007 10:47:50.800452 1 metrics.go:44] collectMetrics exit I1007 10:48:20.668760 1 metrics.go:44] collectMetrics exit I1007 10:48:20.799505 1 metrics.go:44] collectMetrics exit I1007 10:48:50.669279 1 metrics.go:44] collectMetrics exit I1007 10:48:50.799582 1 metrics.go:44] collectMetrics exit </snip> Also interesting, when restarting `machine-approver-controller` it goes down to 0 for a while but latest after 6 hours or so, it's again reporting 1 no matter if there is a pending CSR or not. It would be good to understanding why mapi_current_pending_csr is set to 1 even though no pending CSR is reported --- Additional comment from Joel Speed on 2021-10-13 11:10:36 UTC --- I think I can see why this is. We reconcile the limits (which the metrics are sourced from) at the beginning of the reconcile loop. We only reconcile when there is a pending CSR. So at the beginning of the reconcile loop, there is 1 pending CSR, we set the metric to 1. We then approve the CSR, now there are no pending CSRs, so we don't have anything to reconcile. With no reconciles happening, we never update the metric value, so it stays at 1. --- Additional comment from OpenShift Automated Release Tooling on 2021-10-25 16:41:52 UTC --- Elliott changed bug status from MODIFIED to ON_QA. This bug is expected to ship in the next 4.10 release created. --- Additional comment from sunzhaohua on 2021-10-27 06:52:54 UTC --- waiting for the new nightly build to test --- Additional comment from OpenShift BugZilla Robot on 2021-10-28 05:24:44 UTC --- Bug will not be automatically moved to VERIFIED for the following reasons: - PR openshift/cluster-machine-approver#135 not appoved by QA contact - PR openshift/cluster-machine-approver#135 does not have the qe-approved label - `Verified` field of this bug is not set to `Tested` --- Additional comment from sunzhaohua on 2021-10-29 03:23:15 UTC --- verified clusterversion: 4.10.0-0.nightly-2021-10-28-150422 $ oc get csr | grep Pending $ $ oc exec -c machine-approver-controller machine-approver-86bc4fc875-whdtc -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" https://10.0.139.161:9192/metrics | grep "mapi_current_pending_csr" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level # TYPE mapi_current_pending_csr gauge mapi_current_pending_csr 0
I think the issue here is that because the certificate is not approved by the Cluster Machine Approver (CMA), it is still pending at the end of the approval attempt. Then, something approves it out of band, and we don't re-reconcile at that point because the certificate is approved. This is a side effect of reconciling metrics over a group of items even though we reconcile single items. My suggestion for fixing this is that we change our filter (https://github.com/openshift/cluster-machine-approver/blob/49dd2dc8f511cdee7a846a0a6c49ca0caefeb902/pkg/controller/controller.go#L71-L74) so that it is `recentlyPendingCSRs`, ie, we look at the approval condition on the CSR and if it isn't set yet, or it was approved less than 30s ago, we reconcile the object. That will allow us to reconcile the metrics after any CSR is approved and, as we already filter out approved certificates straight after reconciling the metrics, should be a pretty low touch change. It will make the CMA chattier as it will reconcile CSRs more often (each approved cert was filtered before), but hopefully putting a short time period on this will mean it's not too bad. The alternative is to completely refactor how we do the metric collection and generate the metrics on each scrap from prometheus, but that is a much larger architectural change
Radek has been investigating the fix for this issue
[miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-29-152521 True False 8m24s Cluster version is 4.11.0-0.nightly-2022-03-29-152521 [miyadav@miyadav ~]$ oc get pods -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-5fbf8cbccb-dnpk6 2/2 Running 2 (16m ago) 32m [miyadav@miyadav ~]$ oc exec -n openshift-cluster-machine-approver -c machine-approver-controller machine-approver-5fbf8cbccb-dnpk6 -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" https://localhost:9192/metrics | grep "mapi_current_pending_csr" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 11321 0 11321 0 0 345k 0 --:--:-- --:--:-- --:--:-- 345k # HELP mapi_current_pending_csr Count of pending CSRs at the cluster level # TYPE mapi_current_pending_csr gauge mapi_current_pending_csr 0 Additional info : Moved to VERIFIED based on results.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069