Description of problem: After installing cluster on OSP, CSRs related to workers system:node are pending to approve: $ oc get csr NAME AGE REQUESTOR CONDITION csr-4trj7 17m system:node:morenod-ocp-25r78-master-1 Approved,Issued csr-6fvjb 16m system:node:morenod-ocp-25r78-master-0 Approved,Issued csr-7tj7w 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-7tzvs 10m system:node:morenod-ocp-25r78-worker-k6wbl Pending csr-8rz5d 12m system:node:morenod-ocp-25r78-worker-kvkwk Pending csr-h4kxx 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qbjk7 17m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qzxpr 12m system:node:morenod-ocp-25r78-worker-94l6q Pending csr-twlt9 17m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vnbpm 17m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vpqf6 16m system:node:morenod-ocp-25r78-master-2 Approved,Issued csr-xjv77 11m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued But, if cluster is scaled up with new workers, this new workers gets their certs approved by default, with no manual intervention: $ oc get csr NAME AGE REQUESTOR CONDITION csr-2p7wj 6m3s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-4trj7 38m system:node:morenod-ocp-25r78-master-1 Approved,Issued csr-6fvjb 38m system:node:morenod-ocp-25r78-master-0 Approved,Issued csr-7tj7w 34m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-7tzvs 32m system:node:morenod-ocp-25r78-worker-k6wbl Pending csr-8rz5d 34m system:node:morenod-ocp-25r78-worker-kvkwk Pending csr-bhcrv 19m system:node:morenod-ocp-25r78-worker-kvkwk Approved,Issued csr-g6k4c 18m system:node:morenod-ocp-25r78-worker-94l6q Approved,Issued csr-h4kxx 34m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-mblhl 17m system:node:morenod-ocp-25r78-worker-k6wbl Approved,Issued csr-pzg58 5m46s system:node:morenod-ocp-25r78-worker-6fkdn Approved,Issued csr-qbjk7 38m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-qzxpr 33m system:node:morenod-ocp-25r78-worker-94l6q Pending csr-rkxqp 6m36s system:node:morenod-ocp-25r78-worker-gsxqq Approved,Issued csr-twlt9 38m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vnbpm 38m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-vpqf6 38m system:node:morenod-ocp-25r78-master-2 Approved,Issued csr-x4t5q 6m52s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xjv77 32m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued $ oc get node NAME STATUS ROLES AGE VERSION morenod-ocp-25r78-master-0 Ready master 40m v1.14.0+c4dfe254f morenod-ocp-25r78-master-1 Ready master 40m v1.14.0+c4dfe254f morenod-ocp-25r78-master-2 Ready master 40m v1.14.0+c4dfe254f morenod-ocp-25r78-worker-6fkdn Ready worker 7m49s v1.14.0+c4dfe254f morenod-ocp-25r78-worker-94l6q Ready worker 36m v1.14.0+c4dfe254f morenod-ocp-25r78-worker-gsxqq Ready worker 8m39s v1.14.0+c4dfe254f morenod-ocp-25r78-worker-k6wbl Ready worker 34m v1.14.0+c4dfe254f morenod-ocp-25r78-worker-kvkwk Ready worker 36m v1.14.0+c4dfe254f $ oc get machine -n openshift-machine-api -o wide NAME STATE TYPE REGION ZONE AGE NODE PROVIDERID morenod-ocp-25r78-master-0 46m morenod-ocp-25r78-master-0 morenod-ocp-25r78-master-1 46m morenod-ocp-25r78-master-1 morenod-ocp-25r78-master-2 46m morenod-ocp-25r78-master-2 morenod-ocp-25r78-worker-6fkdn 18m morenod-ocp-25r78-worker-6fkdn morenod-ocp-25r78-worker-94l6q 46m morenod-ocp-25r78-worker-94l6q morenod-ocp-25r78-worker-gsxqq 18m morenod-ocp-25r78-worker-gsxqq morenod-ocp-25r78-worker-k6wbl 46m morenod-ocp-25r78-worker-k6wbl morenod-ocp-25r78-worker-kvkwk 46m morenod-ocp-25r78-worker-kvkwk $ oc logs pod/machine-approver-557755856f-x6hqj -n openshift-cluster-machine-approver E0816 07:09:49.798878 1 main.go:174] No target machine E0816 07:09:56.225508 1 main.go:174] No target machine E0816 07:11:29.147787 1 main.go:174] No target machine E0816 07:12:40.999599 1 reflector.go:126] github.com/openshift/cluster-machine-approver/main.go:185: Failed to list *v1beta1.CertificateSigningRequest: certificatesigningrequests.certificates.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-machine-approver:machine-approver-sa" cannot list resource "certificatesigningrequests" in API group "certificates.k8s.io" at the cluster scope: RBAC: [clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:controller:machine-approver" not found, clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found, clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found] E0816 07:14:06.610629 1 reflector.go:126] github.com/openshift/cluster-machine-approver/main.go:185: Failed to list *v1beta1.CertificateSigningRequest: certificatesigningrequests.certificates.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-machine-approver:machine-approver-sa" cannot list resource "certificatesigningrequests" in API group "certificates.k8s.io" at the cluster scope Version-Release number of the following components: $ ./openshift-install version ./openshift-install v4.2.0-201908151419-dirty built from commit c3d44c216d73c373fc8ef401b541b002b5c98ed2 release image registry.svc.ci.openshift.org/ocp/release@sha256:7542ca7b3c8a7d57d13ecda3f7a0d1c67124edef84a3c9e8a2209a56d60986b3 Based on 4.2.0-0.nightly-2019-08-16-033314 payload image on RHCOS 42.80.20190815.3 How reproducible: Steps to Reproduce: 1.Install cluster using IPI on OSP 2.Check signing certificates using `oc get csr` 3.Scaleup cluster, check again certs Actual results: Workers created during the cluster installation are not being correctly signed, but workers created during a scaleup process are being signed correctly Expected results: Workers are signed correctly during the installation process Additional info: Please attach logs from ansible-playbook with the -vvv flag
I'm having trouble reproducing this right now: $ oc get csr -A NAME AGE REQUESTOR CONDITION csr-56kwj 7m32s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-ct456 14m system:node:tsedovic-t9t9w-worker-brqhj Approved,Issued csr-dl4nm 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-dln7f 12m system:node:tsedovic-t9t9w-worker-jrsfp Approved,Issued csr-f8w85 7m22s system:node:tsedovic-t9t9w-worker-l6vpz Approved,Issued csr-f996v 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-m2d65 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-n9297 23m system:node:tsedovic-t9t9w-master-1 Approved,Issued csr-pdf87 23m system:node:tsedovic-t9t9w-master-2 Approved,Issued csr-rbblz 23m system:node:tsedovic-t9t9w-master-0 Approved,Issued csr-tczl4 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-trdfq 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued $ oc get node -A NAME STATUS ROLES AGE VERSION tsedovic-t9t9w-master-0 Ready master 24m v1.14.0+b985ea310 tsedovic-t9t9w-master-1 Ready master 24m v1.14.0+b985ea310 tsedovic-t9t9w-master-2 Ready master 23m v1.14.0+b985ea310 tsedovic-t9t9w-worker-brqhj Ready worker 15m v1.14.0+b985ea310 tsedovic-t9t9w-worker-jrsfp Ready worker 13m v1.14.0+b985ea310 tsedovic-t9t9w-worker-l6vpz Ready worker 8m17s v1.14.0+b985ea310 I will keep trying but in the meantime, could you please try this again and see if it's still present (or how often does it happen)? The installer merged a change that increased the validity of the CSR certificates which could have fixed this.
I did a couple more deployments and this does happen to me, but not always. One time one CSR was pending and another time two were: $ oc get csr -A NAME AGE REQUESTOR CONDITION csr-29xv9 11m system:node:tsedovic-698r5-worker-cvnvb Pending csr-2m9jp 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8rvmf 23m system:node:tsedovic-698r5-master-1 Approved,Issued csr-c6vft 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-gs66l 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-h7qln 11m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-kmjln 12m system:node:tsedovic-698r5-worker-2mcbz Pending csr-mqrkr 23m system:node:tsedovic-698r5-master-0 Approved,Issued csr-s997c 12m system:node:tsedovic-698r5-worker-2hgh8 Approved,Issued csr-t9xcs 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tcdpm 23m system:node:tsedovic-698r5-master-2 Approved,Issued csr-wwbxf 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued Meaning even the number of CSRs that don't get approved is variable. This is on master (commit 6318c72cd078fb26f835e28117d7d02bc50d81a5) from today so it doesn't seem to depend on the CSR approve window. If the CSRs always get approved on scaleup, then a workaround is to run this post-deployment: oc get csr -o name --all-namespaces | xargs oc adm certificate approve
Checking the machine-approver logs I see the following messages for both of the CSRs stuck in Pending: $ oc logs -n openshift-cluster-machine-approver machine-approver-7487d96f9c-k28tj ... I0828 10:43:14.365504 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.380314 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.380364 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.385583 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.407421 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.407467 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.417663 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.447584 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.447625 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.467867 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.481924 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.481956 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.522224 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.530083 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.530111 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.610334 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.634045 1 main.go:132] CSR csr-29xv9 not authorized: No target machine E0828 10:43:14.634206 1 main.go:174] No target machine I0828 10:43:14.634260 1 main.go:175] Dropping CSR "csr-29xv9" out of the queue: No target machine ... In contrast, the Approved CSRs have just: ... I0828 10:41:54.708848 1 main.go:107] CSR csr-2m9jp added I0828 10:41:54.741139 1 main.go:147] CSR csr-2m9jp approved ... Not sure why they're failing to see the machines, they are all there: $ oc get machine -A NAMESPACE NAME STATE TYPE REGION ZONE AGE openshift-machine-api tsedovic-698r5-master-0 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-master-1 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-master-2 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-worker-2hgh8 ACTIVE m1.s2.large moc-kzn nova 74m openshift-machine-api tsedovic-698r5-worker-2mcbz ACTIVE m1.s2.large moc-kzn nova 74m openshift-machine-api tsedovic-698r5-worker-cvnvb ACTIVE m1.s2.large moc-kzn nova 74m
Looking inside the machine objects and the approver code, everything seemed to look fine. Which means what probably happened is that CAPO did not update the nodeRef in time, eventually machine-approver just dropped the CSR from the queue (and stopped looking at it completely) and after that happened the nodeRef got updated. I.e. we've got a race condition. I've recreated the approver pod to see if that helps and after that everything works fine: $ oc delete pod -n openshift-cluster-machine-approver machine-approver-7487d96f9c-k28tj pod "machine-approver-7487d96f9c-k28tj" deleted $ oc logs -n openshift-cluster-machine-approver machine-approver-7487d96f9c-6d6dd I0828 12:01:33.914998 1 main.go:107] CSR csr-kmjln added I0828 12:01:33.934326 1 main.go:147] CSR csr-kmjln approved $ oc get csr -A NAME AGE REQUESTOR CONDITION csr-29xv9 78m system:node:tsedovic-698r5-worker-cvnvb Approved,Issued csr-2m9jp 79m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8rvmf 90m system:node:tsedovic-698r5-master-1 Approved,Issued csr-c6vft 79m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-gs66l 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-h7qln 78m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-kmjln 79m system:node:tsedovic-698r5-worker-2mcbz Approved,Issued csr-mqrkr 90m system:node:tsedovic-698r5-master-0 Approved,Issued csr-prtnd 64m system:node:tsedovic-698r5-worker-2mcbz Approved,Issued csr-s997c 79m system:node:tsedovic-698r5-worker-2hgh8 Approved,Issued csr-t9xcs 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tcdpm 90m system:node:tsedovic-698r5-master-2 Approved,Issued csr-v6bkt 63m system:node:tsedovic-698r5-worker-cvnvb Approved,Issued csr-wwbxf 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued I'll see what we can do about the CSRs being dropped from the approver queue (seemingly) too soon.
This should be resolved by this pull request that just merged: https://github.com/openshift/cluster-machine-approver/pull/41
Tested on 4.2.0-0.nightly-2019-09-04-090826 after PR is merged. After some time waiting, there is still a worker CSR to be signed: # oc get csr NAME AGE REQUESTOR CONDITION csr-4q67q 6m36s system:node:morenod-ocp-p2nls-worker-mw29m Approved,Issued csr-92h7c 22m system:node:morenod-ocp-p2nls-worker-4zbfr Approved,Issued csr-9s4lt 21m system:node:morenod-ocp-p2nls-worker-mw29m Pending csr-b2t2c 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-c2q7z 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-cl6xh 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-fb252 27m system:node:morenod-ocp-p2nls-master-0 Approved,Issued csr-fctgb 23m system:node:morenod-ocp-p2nls-worker-wqnqx Approved,Issued csr-mb99t 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-mx5v8 27m system:node:morenod-ocp-p2nls-master-1 Approved,Issued csr-s9zlf 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tdvcq 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-wnnjr 27m system:node:morenod-ocp-p2nls-master-2 Approved,Issued
Looks like the fix didn't make it to the nightlies yet. The latest one: https://openshift-release.svc.ci.openshift.org/releasestream/4.2.0-0.nightly/release/4.2.0-0.nightly-2019-09-04-102339 Still points to this commit: https://github.com/openshift/cluster-machine-approver/commits/a3fe0bb76acddf86163629378d9df18601cdea9f Which doesn't contain https://github.com/openshift/cluster-machine-approver/pull/41 And the cluster-machine-approver log still times out after five attempts. Would you please check this again when the commit is in your release image?
Verified on 4.2.0-0.ci-2019-09-05-084944 # oc get csr NAME AGE REQUESTOR CONDITION csr-5llqn 15m system:node:morenod-ocp-54lrt-worker-9jbrm Approved,Issued csr-8lzvj 16m system:node:morenod-ocp-54lrt-worker-p7zbd Approved,Issued csr-8stn5 22m system:node:morenod-ocp-54lrt-master-1 Approved,Issued csr-lpg9g 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-m867n 23m system:node:morenod-ocp-54lrt-master-2 Approved,Issued csr-n5bf8 17m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-ndg4p 15m system:node:morenod-ocp-54lrt-worker-9q6rp Approved,Issued csr-psnqz 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-w6sqf 23m system:node:morenod-ocp-54lrt-master-0 Approved,Issued csr-wkxb4 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xbwbx 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xrbk5 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922