Description of problem: Cluster machine approver usually approves CSRs for kubelet serving certificates (i.e. that meet certain criteria) more details here[1]. It is important to note that, unlike other workflows, this one doesn't check machines, i.e. it is in action on UPI clusters where machine API is not used. As a safety measure, cluster-machine-approver stops approving if there is more than 100 + numberOfMachines number of CSRs. Problem comes when: - Cluster is UPI, so no machines, so cluster-machine-approver won't work if there are more than 100 CSRs at all. - Customer created more than 100 nodes at one shot (via their own automation) and those nodes need to renew their serving certs around the same time. Version-Release number of selected component (if applicable): 4.7, but should be reproducible in the latest versions. How reproducible: Whenever kubelet serving certs need to be renewed under the described conditions. Steps to Reproduce: 1. Install baremetal UPI cluster 2. Add more than 100 nodes to the cluster at the same time frame 3. Wait until their certs need to be renewed. Actual results: Cluster machine approver doesn't approve the CSRs due to max pending limit. Expected results: Cluster machine approve to approve the CSRs regardless of the number of nodes. Additional info: I see 2 possible ways to fix: - Replace "100 + len(machines)" with "100 + max(len(machines),len(nodes))" in order to account for both the use case of many nodes without machines (UPI) or many provisioned VMs which didn't get a node object yet. - Make it tunable. References: [1] - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L277
I notice in the customer case that the log they've linked says that they have 240 pending CSRs, could we double check how many Nodes they have? When a Node joins the cluster, it has 2 CSRs, one for serving and one for client. I'm wondering if the 240 pending CSRs is actually just 120 nodes, so then we would need to account for 2 x num Nodes in our logic
Hi, Joel Not 120 nodes. The cluster has over 100+ nodes, and this cluster has 124 nodes. The "240 pending CSRs" was not the highest, the highest was "Pending CSRs: 529" in the must-gather logs. and the user described the highest time is over 5000+ pending CSRs when the issue happened. It seems that the logic of the Max Pending CSRs expand to 2 x num Nodes should be enough during the cluster is healthy and openshift-cluster-machine-approver pod is running. However in the bad case, if the openshift-cluster-machine-approver pod is crashing, the Pending CSRs is not approving in time, Pending CSRs will be increased more than the expected Max Pending CSRs limitation, the cluster cannot recovery to healthy status automatically. And could we also adjust the monitoring alert rules, do you have any idea? https://github.com/openshift/cluster-machine-approver/blob/release-4.7/manifests/0000_90_cluster-machine-approver_04_alertrules.yaml#L28 must-gather logs: ------------------------------------------------ # omg get node | tail -n +2 | wc -l 124 # omg project openshift-cluster-machine-approver # omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep -o 'Pending CSRs:\ ...' | sort -r -n -k3 | head -n1 Pending CSRs: 529 omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep -o 'Pending CSRs:\ ...' | sort -r -n -k3 | tail -n1 Pending CSRs: 108 ------------------------------------------------
I had a look into this a bit further after asking that question yesterday, and have a small update. As this is only an issue during renewal, we expect the machine approver to only approve renewals for serving certificates and not client certificates. Client certificate renewals are handled by the KCM. So in a case where a renewal comes up, the KCM should quickly approve the client CSRs leaving just the serving CSRs for approval. So we won't need to do the double nodes + 100 thing, we should be sufficient with nodes + 100 only. Another thing to note is that the pending count which is displayed, 240, 108, 529 in various examples, is the count of CSRs which have not been approved and are less than 60 minutes old. Assuming the CSR approver is healthy, with the proposed fix I wouldn't expect to ever get to the situation where we have lots of older CSRs, but, when it is unhealthy, the Kubelet on each node will request a new certificate every 15 minutes. That will make the problem worse. A workaround for if the cluster does end up in this unhealthy state would be to either delete all CSRs (kubelet will create new ones), or to delete all CSRs older than 15 minutes.
Hi, Joel Noted, thanks for confirmation and explanation. I wrote a workaround for deleting all CSRs older than 15 minutes. # oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-44xbr 9m37s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-d7kp7 2m56s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-fshgx 10s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-jlwbj 7m5s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-k9qc5 3m52s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-lpr98 15s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-ngnlf 12m kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-p8r6q 6m8s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-rwb57 15m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-vl457 17m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-vnr2m 16m kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-wft4q 9m47s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-wt9j7 12m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued # oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name" csr-rwb57 csr-vl457 csr-vnr2m # oc delete csr `oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name"` certificatesigningrequest.certificates.k8s.io "csr-rwb57" deleted certificatesigningrequest.certificates.k8s.io "csr-vl457" deleted certificatesigningrequest.certificates.k8s.io "csr-vnr2m" deleted
Hi, Joel The Cu has the question, could you please help to confirm whether the user can adjust the machine-approver pod to multiple instance? Want to scale up from 1 to 3 for the high-availability in the ocp4.7 environment. It seems that the cluster operator doesn't restrict the machine-approver deployment 'replicas' field, it can be changed by the user.
The machine approver uses leader election so I think it should be ok. Though I would expect that CVO would reset the replica count 1 after some period of time, I don't know how often it syncs
Thanks Joel for confirmation. I see the CVO will not reset the 'replicas' fields. It found that the manifests of the cluster-version pods doesn't defined the 'replicas' number, so it can be changed by the user. # oc project openshift-cluster-version # oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-5ddd56bb7c-qhnl5 1/1 Running 0 37d # oc rsh cluster-version-operator-5ddd56bb7c-qhnl5 $ cat /release-manifests/0000_50_cluster-machine-approver_04-deployment.yaml | grep -A 10 ^spec: spec: strategy: type: Recreate selector: matchLabels: app: machine-approver template: metadata: name: machine-approver labels: app: machine-approver Tried scale up to 3 replicas for machine-approver pods. # oc project openshift-cluster-machine-approver # oc scale --replicas=3 deploy/machine-approver # oc -n openshift-cluster-machine-approver get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-approver-59c7c8d5d8-d84xv 2/2 Running 0 55s 192.168.14.12 master-0.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-ghbgn 2/2 Running 0 59d 192.168.14.13 master-1.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-pp799 2/2 Running 0 55s 192.168.14.14 master-2.ocp4.example.com <none> <none> Deleted the cluster version pods for forcing re-sync. # oc -n openshift-cluster-version delete pods --all pod "cluster-version-operator-5ddd56bb7c-qhnl5" deleted Checked the machine-approver pods are still 3 replicas. # oc -n openshift-cluster-machine-approver get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-approver-59c7c8d5d8-d84xv 2/2 Running 0 5m27s 192.168.14.12 master-0.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-ghbgn 2/2 Running 0 59d 192.168.14.13 master-1.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-pp799 2/2 Running 0 5m27s 192.168.14.14 master-2.ocp4.example.com <none> <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056