Bug 2028019

Summary:	Max pending serving CSRs allowed in cluster machine approver is not right for UPI clusters
Product:	OpenShift Container Platform	Reporter:	Pablo Alonso Rodriguez <palonsor>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	jiewu, jspeed, nelluri, palonsor
Version:	4.7	Flags:	miyadav: needinfo-
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: A large volume of Nodes created simultaneously on UPI clusters could lead to a large number of CSRs being generated, tripping the certificate approval short circuit mechanism Consequence: Certificate renewals are not automated as the approver stops approving certificates when there are over 100 pending certificate requests Fix: Account for existing Nodes when calculating the short circuit cut-off Result: UPI clusters can now benefit from automated certificate renewal even with large scale refresh requests	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:31:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Alonso Rodriguez 2021-12-01 10:25:29 UTC

Description of problem:

Cluster machine approver usually approves CSRs for kubelet serving certificates (i.e. that meet certain criteria) more details here[1]. It is important to note that, unlike other workflows, this one doesn't check machines, i.e. it is in action on UPI clusters where machine API is not used.

As a safety measure, cluster-machine-approver stops approving if there is more than 100 + numberOfMachines number of CSRs.

Problem comes when:
- Cluster is UPI, so no machines, so cluster-machine-approver won't work if there are more than 100 CSRs at all.
- Customer created more than 100 nodes at one shot (via their own automation) and those nodes need to renew their serving certs around the same time.

Version-Release number of selected component (if applicable):

4.7, but should be reproducible in the latest versions.

How reproducible:

Whenever kubelet serving certs need to be renewed under the described conditions.

Steps to Reproduce:
1. Install baremetal UPI cluster
2. Add more than 100 nodes to the cluster at the same time frame
3. Wait until their certs need to be renewed.

Actual results:

Cluster machine approver doesn't approve the CSRs due to max pending limit.

Expected results:

Cluster machine approve to approve the CSRs regardless of the number of nodes.

Additional info:

I see 2 possible ways to fix:
- Replace "100 + len(machines)" with "100 + max(len(machines),len(nodes))" in order to account for both the use case of many nodes without machines (UPI) or many provisioned VMs which didn't get a node object yet.
- Make it tunable.

References:

[1] - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L277

Comment 1 Joel Speed 2022-01-17 16:59:14 UTC

I notice in the customer case that the log they've linked says that they have 240 pending CSRs, could we double check how many Nodes they have?

When a Node joins the cluster, it has 2 CSRs, one for serving and one for client. I'm wondering if the 240 pending CSRs is actually just 120 nodes, so then we would need to account for 2 x num Nodes in our logic

Comment 2 Jie Wu 2022-01-18 05:28:22 UTC

Hi, Joel

Not 120 nodes. The cluster has over 100+ nodes, and this cluster has 124 nodes.

The "240 pending CSRs" was not the highest, the highest was "Pending CSRs: 529" in the must-gather logs. and the user described the highest time is over 5000+ pending CSRs when the issue happened.

It seems that the logic of the Max Pending CSRs expand to 2 x num Nodes should be enough during the cluster is healthy and openshift-cluster-machine-approver pod is running.
However in the bad case, if the openshift-cluster-machine-approver pod is crashing, the Pending CSRs is not approving in time, Pending CSRs will be increased more than the expected Max Pending CSRs limitation, the cluster cannot recovery to healthy status automatically.

And could we also adjust the monitoring alert rules, do you have any idea?
https://github.com/openshift/cluster-machine-approver/blob/release-4.7/manifests/0000_90_cluster-machine-approver_04_alertrules.yaml#L28

must-gather logs:
------------------------------------------------
# omg get node | tail -n +2 | wc -l
124


# omg project openshift-cluster-machine-approver

# omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep  -o 'Pending CSRs:\ ...' | sort -r -n -k3 | head -n1
Pending CSRs: 529

omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep  -o 'Pending CSRs:\ ...' | sort -r -n -k3 | tail -n1
Pending CSRs: 108
------------------------------------------------

Comment 3 Joel Speed 2022-01-18 10:21:01 UTC

I had a look into this a bit further after asking that question yesterday, and have a small update.

As this is only an issue during renewal, we expect the machine approver to only approve renewals for serving certificates and not client certificates.
Client certificate renewals are handled by the KCM.

So in a case where a renewal comes up, the KCM should quickly approve the client CSRs leaving just the serving CSRs for approval.

So we won't need to do the double nodes + 100 thing, we should be sufficient with nodes + 100 only.

Another thing to note is that the pending count which is displayed, 240, 108, 529 in various examples, is the count of CSRs which have not been approved and are less than 60 minutes old. Assuming the CSR approver is healthy, with the proposed fix I wouldn't expect to ever get to the situation where we have lots of older CSRs, but, when it is unhealthy, the Kubelet on each node will request a new certificate every 15 minutes. That will make the problem worse.

A workaround for if the cluster does end up in this unhealthy state would be to either delete all CSRs (kubelet will create new ones), or to delete all CSRs older than 15 minutes.

Comment 5 Jie Wu 2022-01-21 07:46:24 UTC

Hi, Joel

Noted, thanks for confirmation and explanation.
I wrote a workaround for deleting all CSRs older than 15 minutes.

# oc get csr

NAME        AGE     SIGNERNAME                                    REQUESTOR                               CONDITION
csr-44xbr   9m37s   kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-d7kp7   2m56s   kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-fshgx   10s     kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-jlwbj   7m5s    kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-k9qc5   3m52s   kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-lpr98   15s     kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-ngnlf   12m     kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-p8r6q   6m8s    kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-rwb57   15m     kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-vl457   17m     kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-vnr2m   16m     kubernetes.io/kube-apiserver-client-kubelet   system:node:worker-0.ocp4.example.com   Approved,Issued
csr-wft4q   9m47s   kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued
csr-wt9j7   12m     kubernetes.io/kubelet-serving                 system:node:worker-0.ocp4.example.com   Approved,Issued

# oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name"

csr-rwb57
csr-vl457
csr-vnr2m

# oc delete csr `oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name"`

certificatesigningrequest.certificates.k8s.io "csr-rwb57" deleted
certificatesigningrequest.certificates.k8s.io "csr-vl457" deleted
certificatesigningrequest.certificates.k8s.io "csr-vnr2m" deleted

Comment 8 Jie Wu 2022-01-27 07:56:12 UTC

Hi, Joel

The Cu has the question, could you please help to confirm whether the user can adjust the machine-approver pod to multiple instance?
Want to scale up from 1 to 3 for the high-availability in the ocp4.7 environment.

It seems that the cluster operator doesn't restrict the machine-approver deployment 'replicas' field, it can be changed by the user.

Comment 9 Joel Speed 2022-01-27 09:04:18 UTC

The machine approver uses leader election so I think it should be ok. Though I would expect that CVO would reset the replica count 1 after some period of time, I don't know how often it syncs

Comment 11 Jie Wu 2022-01-28 03:01:31 UTC

Thanks Joel for confirmation.
I see the CVO will not reset the 'replicas' fields.
It found that the manifests of the cluster-version pods doesn't defined the 'replicas' number, so it can be changed by the user.

# oc project openshift-cluster-version

# oc -n openshift-cluster-version get pods
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-5ddd56bb7c-qhnl5   1/1     Running   0          37d

# oc rsh cluster-version-operator-5ddd56bb7c-qhnl5
$ cat /release-manifests/0000_50_cluster-machine-approver_04-deployment.yaml | grep -A 10 ^spec: 
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: machine-approver
  template:
    metadata:
      name: machine-approver
      labels:
        app: machine-approver

Tried scale up to 3 replicas for machine-approver pods.
# oc project openshift-cluster-machine-approver

# oc scale --replicas=3 deploy/machine-approver

# oc -n openshift-cluster-machine-approver get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP              NODE                        NOMINATED NODE   READINESS GATES
machine-approver-59c7c8d5d8-d84xv   2/2     Running   0          55s   192.168.14.12   master-0.ocp4.example.com   <none>           <none>
machine-approver-59c7c8d5d8-ghbgn   2/2     Running   0          59d   192.168.14.13   master-1.ocp4.example.com   <none>           <none>
machine-approver-59c7c8d5d8-pp799   2/2     Running   0          55s   192.168.14.14   master-2.ocp4.example.com   <none>           <none>

Deleted the cluster version pods for forcing re-sync.
# oc -n openshift-cluster-version delete pods --all
pod "cluster-version-operator-5ddd56bb7c-qhnl5" deleted

Checked the machine-approver pods are still 3 replicas.
# oc -n openshift-cluster-machine-approver get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP              NODE                        NOMINATED NODE   READINESS GATES
machine-approver-59c7c8d5d8-d84xv   2/2     Running   0          5m27s   192.168.14.12   master-0.ocp4.example.com   <none>           <none>
machine-approver-59c7c8d5d8-ghbgn   2/2     Running   0          59d     192.168.14.13   master-1.ocp4.example.com   <none>           <none>
machine-approver-59c7c8d5d8-pp799   2/2     Running   0          5m27s   192.168.14.14   master-2.ocp4.example.com   <none>           <none>

Comment 19 errata-xmlrpc 2022-03-10 16:31:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056