Bug 2047702
Summary: | Issue described on bug #2013528 reproduced: mapi_current_pending_csr is always set to 1 on OpenShift Container Platform 4.8 | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lucas López Montero <llopezmo> | |
Component: | Cloud Compute | Assignee: | Radek Maňák <rmanak> | |
Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | Milind Yadav <miyadav> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | dmoiseev, jspeed, miyadav, rmanak, sreber, zhsun | |
Version: | 4.9 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: CSR renewal is handled by kube-controller-manager and correctly left pending by machine approver, increasing mapi_current_pending_csr to 1. Then kube-controller-manager approves the CSR, but machine approver ignores it. Leaving the metric unchanged.
Consequence: mapi_current_pending_csr is stuck at 1 until another machine approver reconcile
Fix: Reconcile CSR approvals from other controllers to update metrics
Result: mapi_current_pending_csr is always up-to-date after every reconcile.
|
Story Points: | --- | |
Clone Of: | 2013528 | |||
: | 2072928 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-23 19:39:39 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2013528 | |||
Bug Blocks: | 2072928 |
Description
Lucas López Montero
2022-01-28 11:28:47 UTC
I think the issue here is that because the certificate is not approved by the Cluster Machine Approver (CMA), it is still pending at the end of the approval attempt. Then, something approves it out of band, and we don't re-reconcile at that point because the certificate is approved. This is a side effect of reconciling metrics over a group of items even though we reconcile single items. My suggestion for fixing this is that we change our filter (https://github.com/openshift/cluster-machine-approver/blob/49dd2dc8f511cdee7a846a0a6c49ca0caefeb902/pkg/controller/controller.go#L71-L74) so that it is `recentlyPendingCSRs`, ie, we look at the approval condition on the CSR and if it isn't set yet, or it was approved less than 30s ago, we reconcile the object. That will allow us to reconcile the metrics after any CSR is approved and, as we already filter out approved certificates straight after reconciling the metrics, should be a pretty low touch change. It will make the CMA chattier as it will reconcile CSRs more often (each approved cert was filtered before), but hopefully putting a short time period on this will mean it's not too bad. The alternative is to completely refactor how we do the metric collection and generate the metrics on each scrap from prometheus, but that is a much larger architectural change Radek has been investigating the fix for this issue [miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-29-152521 True False 8m24s Cluster version is 4.11.0-0.nightly-2022-03-29-152521 [miyadav@miyadav ~]$ oc get pods -n openshift-cluster-machine-approver NAME READY STATUS RESTARTS AGE machine-approver-5fbf8cbccb-dnpk6 2/2 Running 2 (16m ago) 32m [miyadav@miyadav ~]$ oc exec -n openshift-cluster-machine-approver -c machine-approver-controller machine-approver-5fbf8cbccb-dnpk6 -- curl -k -H "Authorization: Bearer `oc sa get-token prometheus-k8s -n openshift-monitoring`" -H "Content-type: application/json" https://localhost:9192/metrics | grep "mapi_current_pending_csr" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 11321 0 11321 0 0 345k 0 --:--:-- --:--:-- --:--:-- 345k # HELP mapi_current_pending_csr Count of pending CSRs at the cluster level # TYPE mapi_current_pending_csr gauge mapi_current_pending_csr 0 Additional info : Moved to VERIFIED based on results. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |