Bug 2047702 - Issue described on bug #2013528 reproduced: mapi_current_pending_csr is always set to 1 on OpenShift Container Platform 4.8
Summary: Issue described on bug #2013528 reproduced: mapi_current_pending_csr is alway...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Radek Maňák
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On: 2013528
Blocks: 2072928
TreeView+ depends on / blocked
 
Reported: 2022-01-28 11:28 UTC by Lucas López Montero
Modified: 2022-10-12 09:45 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: CSR renewal is handled by kube-controller-manager and correctly left pending by machine approver, increasing mapi_current_pending_csr to 1. Then kube-controller-manager approves the CSR, but machine approver ignores it. Leaving the metric unchanged. Consequence: mapi_current_pending_csr is stuck at 1 until another machine approver reconcile Fix: Reconcile CSR approvals from other controllers to update metrics Result: mapi_current_pending_csr is always up-to-date after every reconcile.
Clone Of: 2013528
: 2072928 (view as bug list)
Environment:
Last Closed: 2022-08-23 19:39:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-machine-approver pull 160 0 None open Bug 2047702: Reconcile CSRs approved by other controllers 2022-03-15 10:32:26 UTC
Red Hat Knowledge Base (Solution) 6411541 0 None None None 2022-01-28 11:29:53 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-23 19:39:57 UTC

Description Lucas López Montero 2022-01-28 11:28:47 UTC
Issue reported on bug #2013528 (see description below) could be reproduced in a fixed version:

% oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.15    True        False         25h     Cluster version is 4.9.15

%  oc exec -n openshift-cluster-machine-approver -c machine-approver-controller  machine-approver-74559b5df9-gdngs -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://localhost:9192/metrics | grep "mapi_current_pending_csr"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level
# TYPE mapi_current_pending_csr gauge
mapi_current_pending_csr 1
100  191k    0  191k    0     0  3996k      0 --:--:-- --:--:-- --:--:-- 3996k

% oc get csr
No resources found


+++ This bug was initially created as a clone of Bug #2013528 +++

Description of problem:

After installing or updating to OpenShift Container Platform 4.8 it was found that  mapi_current_pending_csr metric from openshift-cluster-machine-approver is always reporting one, even though no pending CSR is reported:

> $ oc get clusterversion
> NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.8.12    True        False         43h     Cluster version is 4.8.12

> $ oc get csr
> No resources found

> $ oc exec -c machine-approver-controller machine-approver-5fcfd56b9d-pc4g8 -- curl -H "Authorization: Bearer XXXXXX" -k https://localhost:9192/metrics | grep mapi_current_pending_csr
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                  Dload  Upload   Total   Spent    Left  Speed
>   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level
> # TYPE mapi_current_pending_csr gauge
> mapi_current_pending_csr 1

This is reported by a freshly installed OpenShift Container Platform 4.8 - Cluster on AWS using IPI. But the same can be seen with OpenShift Container Platform - Cluster running on OpenStack and using UPI installation mode.

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.8.12

How reproducible:

 - So far always


Steps to Reproduce:
1. Install OpenShift Container Platform 4.8, wait some time and check metrics for `mapi_current_pending_csr`
2. Also make sure that there is no pending CSR in the Cluster

Actual results:

mapi_current_pending_csr is reporting 1 even though no pending CSR is reported

Expected results:

mapi_current_pending_csr should report 0 if there is no pending CSR

Additional info:

--- Additional comment from Simon Reber on 2021-10-13 06:12:37 UTC ---

Hi all,

The Screenshot attached is showing the history of the mapi_current_pending_csr metric which is set to one shortly after OpenShift Container Platform 4.8 is installed. As shown there are no pending CSR and after installation there was no Node added, removed or changed. So the cluster did not have pending CSR for a long time.

--- Additional comment from Simon Reber on 2021-10-13 06:16:58 UTC ---

Hi all,

I also have a `must-gather` from a customer OpenShift Container Platform - Cluster that is showing the same behavior (but which was updated and not fresh installed). In addition I also have logs from `machine-approver-controller` with log-level 9. But except for some CSR activity (which is expected when scaling Nodes, etc.) there is nothing as Metrics collection does not appear to provide more verbose data:

<snip from customer `machine-approver-controller` log with log-level 9>
I1007 10:43:38.912482       1 csr_check.go:442] retrieving serving cert from ip-10-241-28-223.eu-central-1.compute.internal (10.241.28.223:10250)
I1007 10:43:38.914685       1 csr_check.go:186] Failed to retrieve current serving cert: remote error: tls: internal error
I1007 10:43:38.914705       1 csr_check.go:191] Falling back to machine-api authorization for ip-10-241-28-223.eu-central-1.compute.internal
I1007 10:43:38.914869       1 request.go:1123] Request Body: {"kind":"CertificateSigningRequest","apiVersion":"certificates.k8s.io/v1","metadata":{"name":"csr-8d54c","generateName":"csr-","uid":"8d434333-bb43-4bcb-bc28-080ec539666e","resourceVersion":"49470864","creationTimestamp":"2021-10-07T10:43:38Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}}}}]},"spec":{"request":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1hERVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVNd1FRWURWUVFERXpwegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHlOREV0TWpndE1qSXpMbVYxTFdObGJuUnlZV3d0TVM1amIyMXdkWFJsCkxtbHVkR1Z5Ym1Gc01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUQvRXN3TWV6WFdEMERxZDgKeFN3VTdkWld3ZlZQRzgwSnVmamU3U3FCVEtDSWNSb3c0b1FCQTh0WmdVRVE2MnR0ZW92alhyd3BZN2xyb2R2SgpBUGF5UktCU01GQUdDU3FHU0liM0RRRUpEakZETUVFd1B3WURWUjBSQkRnd05vSXVhWEF0TVRBdE1qUXhMVEk0CkxUSXlNeTVsZFMxalpXNTBjbUZzTFRFdVkyOXRjSFYwWlM1cGJuUmxjbTVoYkljRUN2RWMzekFLQmdncWhrak8KUFFRREFnTklBREJGQWlBcmQxSklSQnBBVm5mVWwrUkV5Q2dUSW9lZTdjTTdidzhra3NFRm4xNjhFUUloQUxtNwphU1prRGtVcFpYbVZ3Ym1vZ3JtMEdFRzdCeXNFM2F0dGNQbFlwTEhLCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=","signerName":"kubernetes.io/kubelet-serving","usages":["digital signature","key encipherment","server auth"],"username":"system:node:ip-10-241-28-223.eu-central-1.compute.internal","groups":["system:nodes","system:authenticated"]},"status":{"conditions":[{"type":"Approved","status":"True","reason":"NodeCSRApprove","message":"This CSR was approved by the Node CSR Approver","lastUpdateTime":"2021-10-07T10:43:38Z","lastTransitionTime":null}]}}
I1007 10:43:38.914970       1 round_trippers.go:435] curl -k -v -XPUT  -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "Authorization: Bearer <masked>" 'https://198.18.0.1:443/apis/certificates.k8s.io/v1/certificatesigningrequests/csr-8d54c/approval'
I1007 10:43:38.929892       1 round_trippers.go:454] PUT https://198.18.0.1:443/apis/certificates.k8s.io/v1/certificatesigningrequests/csr-8d54c/approval 200 OK in 14 milliseconds
I1007 10:43:38.929910       1 round_trippers.go:460] Response Headers:
I1007 10:43:38.929916       1 round_trippers.go:463]     Content-Length: 2020
I1007 10:43:38.929919       1 round_trippers.go:463]     Date: Thu, 07 Oct 2021 10:43:38 GMT
I1007 10:43:38.929922       1 round_trippers.go:463]     Audit-Id: 434045ce-4250-4d54-b349-2555c797386f
I1007 10:43:38.929926       1 round_trippers.go:463]     Cache-Control: no-cache, private
I1007 10:43:38.929928       1 round_trippers.go:463]     Content-Type: application/json
I1007 10:43:38.929931       1 round_trippers.go:463]     X-Kubernetes-Pf-Flowschema-Uid: 60a5fc8f-0d1e-410f-bd9c-210b7ef205fa
I1007 10:43:38.929934       1 round_trippers.go:463]     X-Kubernetes-Pf-Prioritylevel-Uid: 027324f9-e660-4929-9b52-8f1c525c9f52
I1007 10:43:38.929981       1 request.go:1123] Response Body: {"kind":"CertificateSigningRequest","apiVersion":"certificates.k8s.io/v1","metadata":{"name":"csr-8d54c","generateName":"csr-","uid":"8d434333-bb43-4bcb-bc28-080ec539666e","resourceVersion":"49470872","creationTimestamp":"2021-10-07T10:43:38Z","managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}}}},{"manager":"machine-approver","operation":"Update","apiVersion":"certificates.k8s.io/v1","time":"2021-10-07T10:43:38Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{".":{},"k:{\"type\":\"Approved\"}":{".":{},"f:lastTransitionTime":{},"f:lastUpdateTime":{},"f:message":{},"f:reason":{},"f:status":{},"f:type":{}}}}}}]},"spec":{"request":"LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQmFqQ0NBUkFDQVFBd1hERVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVVNd1FRWURWUVFERXpwegplWE4wWlcwNmJtOWtaVHBwY0MweE1DMHlOREV0TWpndE1qSXpMbVYxTFdObGJuUnlZV3d0TVM1amIyMXdkWFJsCkxtbHVkR1Z5Ym1Gc01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRUQvRXN3TWV6WFdEMERxZDgKeFN3VTdkWld3ZlZQRzgwSnVmamU3U3FCVEtDSWNSb3c0b1FCQTh0WmdVRVE2MnR0ZW92alhyd3BZN2xyb2R2SgpBUGF5UktCU01GQUdDU3FHU0liM0RRRUpEakZETUVFd1B3WURWUjBSQkRnd05vSXVhWEF0TVRBdE1qUXhMVEk0CkxUSXlNeTVsZFMxalpXNTBjbUZzTFRFdVkyOXRjSFYwWlM1cGJuUmxjbTVoYkljRUN2RWMzekFLQmdncWhrak8KUFFRREFnTklBREJGQWlBcmQxSklSQnBBVm5mVWwrUkV5Q2dUSW9lZTdjTTdidzhra3NFRm4xNjhFUUloQUxtNwphU1prRGtVcFpYbVZ3Ym1vZ3JtMEdFRzdCeXNFM2F0dGNQbFlwTEhLCi0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=","signerName":"kubernetes.io/kubelet-serving","usages":["digital signature","key encipherment","server auth"],"username":"system:node:ip-10-241-28-223.eu-central-1.compute.internal","groups":["system:nodes","system:authenticated"]},"status":{"conditions":[{"type":"Approved","status":"True","reason":"NodeCSRApprove","message":"This CSR was approved by the Node CSR Approver","lastUpdateTime":"2021-10-07T10:43:38Z","lastTransitionTime":"2021-10-07T10:43:38Z"}]}}
I1007 10:43:38.930104       1 controller.go:179] CSR csr-8d54c approved
I1007 10:43:42.695100       1 reflector.go:381] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:241: forcing resync
I1007 10:43:50.668919       1 metrics.go:44] collectMetrics exit
I1007 10:43:50.800279       1 metrics.go:44] collectMetrics exit
I1007 10:44:20.678445       1 metrics.go:44] collectMetrics exit
I1007 10:44:20.802368       1 metrics.go:44] collectMetrics exit
I1007 10:44:50.668943       1 metrics.go:44] collectMetrics exit
I1007 10:44:50.804770       1 metrics.go:44] collectMetrics exit
I1007 10:45:20.673141       1 metrics.go:44] collectMetrics exit
I1007 10:45:20.801255       1 metrics.go:44] collectMetrics exit
I1007 10:45:50.668982       1 metrics.go:44] collectMetrics exit
I1007 10:45:50.799709       1 metrics.go:44] collectMetrics exit
I1007 10:46:20.672427       1 metrics.go:44] collectMetrics exit
I1007 10:46:20.800086       1 metrics.go:44] collectMetrics exit
I1007 10:46:50.675045       1 metrics.go:44] collectMetrics exit
I1007 10:46:50.802675       1 metrics.go:44] collectMetrics exit
I1007 10:47:20.668899       1 metrics.go:44] collectMetrics exit
I1007 10:47:20.800359       1 metrics.go:44] collectMetrics exit
I1007 10:47:50.668449       1 metrics.go:44] collectMetrics exit
I1007 10:47:50.800452       1 metrics.go:44] collectMetrics exit
I1007 10:48:20.668760       1 metrics.go:44] collectMetrics exit
I1007 10:48:20.799505       1 metrics.go:44] collectMetrics exit
I1007 10:48:50.669279       1 metrics.go:44] collectMetrics exit
I1007 10:48:50.799582       1 metrics.go:44] collectMetrics exit
</snip>

Also interesting, when restarting `machine-approver-controller` it goes down to 0 for a while but latest after 6 hours or so, it's again reporting 1 no matter if there is a pending CSR or not.

It would be good to understanding why mapi_current_pending_csr is set to 1 even though no pending CSR is reported

--- Additional comment from Joel Speed on 2021-10-13 11:10:36 UTC ---

I think I can see why this is.

We reconcile the limits (which the metrics are sourced from) at the beginning of the reconcile loop.

We only reconcile when there is a pending CSR.

So at the beginning of the reconcile loop, there is 1 pending CSR, we set the metric to 1.
We then approve the CSR, now there are no pending CSRs, so we don't have anything to reconcile.
With no reconciles happening, we never update the metric value, so it stays at 1.

--- Additional comment from OpenShift Automated Release Tooling on 2021-10-25 16:41:52 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release created.

--- Additional comment from sunzhaohua on 2021-10-27 06:52:54 UTC ---

waiting for the new nightly build to test

--- Additional comment from OpenShift BugZilla Robot on 2021-10-28 05:24:44 UTC ---

Bug will not be automatically moved to VERIFIED for the following reasons:
- PR openshift/cluster-machine-approver#135 not appoved by QA contact
- PR openshift/cluster-machine-approver#135 does not have the qe-approved label
- `Verified` field of this bug is not set to `Tested`

--- Additional comment from sunzhaohua on 2021-10-29 03:23:15 UTC ---

verified
clusterversion: 4.10.0-0.nightly-2021-10-28-150422

$ oc get csr | grep Pending 
$

$ oc exec -c machine-approver-controller  machine-approver-86bc4fc875-whdtc  -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://10.0.139.161:9192/metrics | grep "mapi_current_pending_csr"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level
# TYPE mapi_current_pending_csr gauge
mapi_current_pending_csr 0

Comment 2 Joel Speed 2022-01-28 13:33:21 UTC
I think the issue here is that because the certificate is not approved by the Cluster Machine Approver (CMA), it is still pending at the end of the approval attempt.
Then, something approves it out of band, and we don't re-reconcile at that point because the certificate is approved.
This is a side effect of reconciling metrics over a group of items even though we reconcile single items.

My suggestion for fixing this is that we change our filter (https://github.com/openshift/cluster-machine-approver/blob/49dd2dc8f511cdee7a846a0a6c49ca0caefeb902/pkg/controller/controller.go#L71-L74) so that it is `recentlyPendingCSRs`, ie, we look at the approval condition on the CSR and if it isn't set yet, or it was approved less than 30s ago, we reconcile the object.
That will allow us to reconcile the metrics after any CSR is approved and, as we already filter out approved certificates straight after reconciling the metrics, should be a pretty low touch change.

It will make the CMA chattier as it will reconcile CSRs more often (each approved cert was filtered before), but hopefully putting a short time period on this will mean it's not too bad.

The alternative is to completely refactor how we do the metric collection and generate the metrics on each scrap from prometheus, but that is a much larger architectural change

Comment 8 Joel Speed 2022-03-14 10:38:28 UTC
Radek has been investigating the fix for this issue

Comment 11 Milind Yadav 2022-03-31 07:40:02 UTC
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-29-152521   True        False         8m24s   Cluster version is 4.11.0-0.nightly-2022-03-29-152521


[miyadav@miyadav ~]$ oc get pods -n openshift-cluster-machine-approver
NAME                                READY   STATUS    RESTARTS      AGE
machine-approver-5fbf8cbccb-dnpk6   2/2     Running   2 (16m ago)   32m


[miyadav@miyadav ~]$ oc exec -n openshift-cluster-machine-approver -c machine-approver-controller  machine-approver-5fbf8cbccb-dnpk6 -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`"   -H "Content-type: application/json" https://localhost:9192/metrics | grep "mapi_current_pending_csr"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11321    0 11321    0     0   345k      0 --:--:-- --:--:-- --:--:--  345k
# HELP mapi_current_pending_csr Count of pending CSRs at the cluster level
# TYPE mapi_current_pending_csr gauge
mapi_current_pending_csr 0



Additional info :
Moved to VERIFIED based on results.

Comment 22 errata-xmlrpc 2022-08-23 19:39:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.