Bug 2028019
Summary: | Max pending serving CSRs allowed in cluster machine approver is not right for UPI clusters | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pablo Alonso Rodriguez <palonsor> |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | jiewu, jspeed, nelluri, palonsor |
Version: | 4.7 | Flags: | miyadav:
needinfo-
|
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: A large volume of Nodes created simultaneously on UPI clusters could lead to a large number of CSRs being generated, tripping the certificate approval short circuit mechanism
Consequence: Certificate renewals are not automated as the approver stops approving certificates when there are over 100 pending certificate requests
Fix: Account for existing Nodes when calculating the short circuit cut-off
Result: UPI clusters can now benefit from automated certificate renewal even with large scale refresh requests
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:31:07 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Pablo Alonso Rodriguez
2021-12-01 10:25:29 UTC
I notice in the customer case that the log they've linked says that they have 240 pending CSRs, could we double check how many Nodes they have? When a Node joins the cluster, it has 2 CSRs, one for serving and one for client. I'm wondering if the 240 pending CSRs is actually just 120 nodes, so then we would need to account for 2 x num Nodes in our logic Hi, Joel Not 120 nodes. The cluster has over 100+ nodes, and this cluster has 124 nodes. The "240 pending CSRs" was not the highest, the highest was "Pending CSRs: 529" in the must-gather logs. and the user described the highest time is over 5000+ pending CSRs when the issue happened. It seems that the logic of the Max Pending CSRs expand to 2 x num Nodes should be enough during the cluster is healthy and openshift-cluster-machine-approver pod is running. However in the bad case, if the openshift-cluster-machine-approver pod is crashing, the Pending CSRs is not approving in time, Pending CSRs will be increased more than the expected Max Pending CSRs limitation, the cluster cannot recovery to healthy status automatically. And could we also adjust the monitoring alert rules, do you have any idea? https://github.com/openshift/cluster-machine-approver/blob/release-4.7/manifests/0000_90_cluster-machine-approver_04_alertrules.yaml#L28 must-gather logs: ------------------------------------------------ # omg get node | tail -n +2 | wc -l 124 # omg project openshift-cluster-machine-approver # omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep -o 'Pending CSRs:\ ...' | sort -r -n -k3 | head -n1 Pending CSRs: 529 omg logs machine-approver-547f655977-2v8k7 -c machine-approver-controller | grep -o 'Pending CSRs:\ ...' | sort -r -n -k3 | tail -n1 Pending CSRs: 108 ------------------------------------------------ I had a look into this a bit further after asking that question yesterday, and have a small update. As this is only an issue during renewal, we expect the machine approver to only approve renewals for serving certificates and not client certificates. Client certificate renewals are handled by the KCM. So in a case where a renewal comes up, the KCM should quickly approve the client CSRs leaving just the serving CSRs for approval. So we won't need to do the double nodes + 100 thing, we should be sufficient with nodes + 100 only. Another thing to note is that the pending count which is displayed, 240, 108, 529 in various examples, is the count of CSRs which have not been approved and are less than 60 minutes old. Assuming the CSR approver is healthy, with the proposed fix I wouldn't expect to ever get to the situation where we have lots of older CSRs, but, when it is unhealthy, the Kubelet on each node will request a new certificate every 15 minutes. That will make the problem worse. A workaround for if the cluster does end up in this unhealthy state would be to either delete all CSRs (kubelet will create new ones), or to delete all CSRs older than 15 minutes. Hi, Joel Noted, thanks for confirmation and explanation. I wrote a workaround for deleting all CSRs older than 15 minutes. # oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-44xbr 9m37s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-d7kp7 2m56s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-fshgx 10s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-jlwbj 7m5s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-k9qc5 3m52s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-lpr98 15s kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-ngnlf 12m kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-p8r6q 6m8s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-rwb57 15m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-vl457 17m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-vnr2m 16m kubernetes.io/kube-apiserver-client-kubelet system:node:worker-0.ocp4.example.com Approved,Issued csr-wft4q 9m47s kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued csr-wt9j7 12m kubernetes.io/kubelet-serving system:node:worker-0.ocp4.example.com Approved,Issued # oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name" csr-rwb57 csr-vl457 csr-vnr2m # oc delete csr `oc get csr -o json | jq -r "[.items[] | {name: .metadata.name, startTime: .metadata.creationTimestamp | fromdate } | select(.startTime < (now | . - 900))]" | jq -r ".[].name"` certificatesigningrequest.certificates.k8s.io "csr-rwb57" deleted certificatesigningrequest.certificates.k8s.io "csr-vl457" deleted certificatesigningrequest.certificates.k8s.io "csr-vnr2m" deleted Hi, Joel The Cu has the question, could you please help to confirm whether the user can adjust the machine-approver pod to multiple instance? Want to scale up from 1 to 3 for the high-availability in the ocp4.7 environment. It seems that the cluster operator doesn't restrict the machine-approver deployment 'replicas' field, it can be changed by the user. The machine approver uses leader election so I think it should be ok. Though I would expect that CVO would reset the replica count 1 after some period of time, I don't know how often it syncs Thanks Joel for confirmation. I see the CVO will not reset the 'replicas' fields. It found that the manifests of the cluster-version pods doesn't defined the 'replicas' number, so it can be changed by the user. # oc project openshift-cluster-version # oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-5ddd56bb7c-qhnl5 1/1 Running 0 37d # oc rsh cluster-version-operator-5ddd56bb7c-qhnl5 $ cat /release-manifests/0000_50_cluster-machine-approver_04-deployment.yaml | grep -A 10 ^spec: spec: strategy: type: Recreate selector: matchLabels: app: machine-approver template: metadata: name: machine-approver labels: app: machine-approver Tried scale up to 3 replicas for machine-approver pods. # oc project openshift-cluster-machine-approver # oc scale --replicas=3 deploy/machine-approver # oc -n openshift-cluster-machine-approver get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-approver-59c7c8d5d8-d84xv 2/2 Running 0 55s 192.168.14.12 master-0.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-ghbgn 2/2 Running 0 59d 192.168.14.13 master-1.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-pp799 2/2 Running 0 55s 192.168.14.14 master-2.ocp4.example.com <none> <none> Deleted the cluster version pods for forcing re-sync. # oc -n openshift-cluster-version delete pods --all pod "cluster-version-operator-5ddd56bb7c-qhnl5" deleted Checked the machine-approver pods are still 3 replicas. # oc -n openshift-cluster-machine-approver get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-approver-59c7c8d5d8-d84xv 2/2 Running 0 5m27s 192.168.14.12 master-0.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-ghbgn 2/2 Running 0 59d 192.168.14.13 master-1.ocp4.example.com <none> <none> machine-approver-59c7c8d5d8-pp799 2/2 Running 0 5m27s 192.168.14.14 master-2.ocp4.example.com <none> <none> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |