Bug 1741829
| Summary: | [IPI] [OSP] system:node workers CSRs are pending to approve when cluster installation is finished | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Sanz <dsanzmor> |
| Component: | Installer | Assignee: | Tomas Sedovic <tsedovic> |
| Installer sub component: | openshift-installer | QA Contact: | David Sanz <dsanzmor> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | eduen, ppitonak, tsedovic |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:36:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
David Sanz
2019-08-16 08:06:39 UTC
I'm having trouble reproducing this right now: $ oc get csr -A NAME AGE REQUESTOR CONDITION csr-56kwj 7m32s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-ct456 14m system:node:tsedovic-t9t9w-worker-brqhj Approved,Issued csr-dl4nm 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-dln7f 12m system:node:tsedovic-t9t9w-worker-jrsfp Approved,Issued csr-f8w85 7m22s system:node:tsedovic-t9t9w-worker-l6vpz Approved,Issued csr-f996v 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-m2d65 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-n9297 23m system:node:tsedovic-t9t9w-master-1 Approved,Issued csr-pdf87 23m system:node:tsedovic-t9t9w-master-2 Approved,Issued csr-rbblz 23m system:node:tsedovic-t9t9w-master-0 Approved,Issued csr-tczl4 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-trdfq 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued $ oc get node -A NAME STATUS ROLES AGE VERSION tsedovic-t9t9w-master-0 Ready master 24m v1.14.0+b985ea310 tsedovic-t9t9w-master-1 Ready master 24m v1.14.0+b985ea310 tsedovic-t9t9w-master-2 Ready master 23m v1.14.0+b985ea310 tsedovic-t9t9w-worker-brqhj Ready worker 15m v1.14.0+b985ea310 tsedovic-t9t9w-worker-jrsfp Ready worker 13m v1.14.0+b985ea310 tsedovic-t9t9w-worker-l6vpz Ready worker 8m17s v1.14.0+b985ea310 I will keep trying but in the meantime, could you please try this again and see if it's still present (or how often does it happen)? The installer merged a change that increased the validity of the CSR certificates which could have fixed this. I did a couple more deployments and this does happen to me, but not always. One time one CSR was pending and another time two were:
$ oc get csr -A
NAME AGE REQUESTOR CONDITION
csr-29xv9 11m system:node:tsedovic-698r5-worker-cvnvb Pending
csr-2m9jp 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-8rvmf 23m system:node:tsedovic-698r5-master-1 Approved,Issued
csr-c6vft 12m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-gs66l 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-h7qln 11m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-kmjln 12m system:node:tsedovic-698r5-worker-2mcbz Pending
csr-mqrkr 23m system:node:tsedovic-698r5-master-0 Approved,Issued
csr-s997c 12m system:node:tsedovic-698r5-worker-2hgh8 Approved,Issued
csr-t9xcs 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-tcdpm 23m system:node:tsedovic-698r5-master-2 Approved,Issued
csr-wwbxf 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
Meaning even the number of CSRs that don't get approved is variable.
This is on master (commit 6318c72cd078fb26f835e28117d7d02bc50d81a5) from today so it doesn't seem to depend on the CSR approve window.
If the CSRs always get approved on scaleup, then a workaround is to run this post-deployment:
oc get csr -o name --all-namespaces | xargs oc adm certificate approve
Checking the machine-approver logs I see the following messages for both of the CSRs stuck in Pending: $ oc logs -n openshift-cluster-machine-approver machine-approver-7487d96f9c-k28tj ... I0828 10:43:14.365504 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.380314 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.380364 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.385583 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.407421 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.407467 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.417663 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.447584 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.447625 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.467867 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.481924 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.481956 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.522224 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.530083 1 main.go:132] CSR csr-29xv9 not authorized: No target machine I0828 10:43:14.530111 1 main.go:164] Error syncing csr csr-29xv9: No target machine I0828 10:43:14.610334 1 main.go:107] CSR csr-29xv9 added I0828 10:43:14.634045 1 main.go:132] CSR csr-29xv9 not authorized: No target machine E0828 10:43:14.634206 1 main.go:174] No target machine I0828 10:43:14.634260 1 main.go:175] Dropping CSR "csr-29xv9" out of the queue: No target machine ... In contrast, the Approved CSRs have just: ... I0828 10:41:54.708848 1 main.go:107] CSR csr-2m9jp added I0828 10:41:54.741139 1 main.go:147] CSR csr-2m9jp approved ... Not sure why they're failing to see the machines, they are all there: $ oc get machine -A NAMESPACE NAME STATE TYPE REGION ZONE AGE openshift-machine-api tsedovic-698r5-master-0 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-master-1 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-master-2 ACTIVE m1.s2.xlarge moc-kzn nova 76m openshift-machine-api tsedovic-698r5-worker-2hgh8 ACTIVE m1.s2.large moc-kzn nova 74m openshift-machine-api tsedovic-698r5-worker-2mcbz ACTIVE m1.s2.large moc-kzn nova 74m openshift-machine-api tsedovic-698r5-worker-cvnvb ACTIVE m1.s2.large moc-kzn nova 74m Looking inside the machine objects and the approver code, everything seemed to look fine. Which means what probably happened is that CAPO did not update the nodeRef in time, eventually machine-approver just dropped the CSR from the queue (and stopped looking at it completely) and after that happened the nodeRef got updated. I.e. we've got a race condition. I've recreated the approver pod to see if that helps and after that everything works fine: $ oc delete pod -n openshift-cluster-machine-approver machine-approver-7487d96f9c-k28tj pod "machine-approver-7487d96f9c-k28tj" deleted $ oc logs -n openshift-cluster-machine-approver machine-approver-7487d96f9c-6d6dd I0828 12:01:33.914998 1 main.go:107] CSR csr-kmjln added I0828 12:01:33.934326 1 main.go:147] CSR csr-kmjln approved $ oc get csr -A NAME AGE REQUESTOR CONDITION csr-29xv9 78m system:node:tsedovic-698r5-worker-cvnvb Approved,Issued csr-2m9jp 79m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-8rvmf 90m system:node:tsedovic-698r5-master-1 Approved,Issued csr-c6vft 79m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-gs66l 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-h7qln 78m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-kmjln 79m system:node:tsedovic-698r5-worker-2mcbz Approved,Issued csr-mqrkr 90m system:node:tsedovic-698r5-master-0 Approved,Issued csr-prtnd 64m system:node:tsedovic-698r5-worker-2mcbz Approved,Issued csr-s997c 79m system:node:tsedovic-698r5-worker-2hgh8 Approved,Issued csr-t9xcs 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tcdpm 90m system:node:tsedovic-698r5-master-2 Approved,Issued csr-v6bkt 63m system:node:tsedovic-698r5-worker-cvnvb Approved,Issued csr-wwbxf 90m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued I'll see what we can do about the CSRs being dropped from the approver queue (seemingly) too soon. This should be resolved by this pull request that just merged: https://github.com/openshift/cluster-machine-approver/pull/41 Tested on 4.2.0-0.nightly-2019-09-04-090826 after PR is merged. After some time waiting, there is still a worker CSR to be signed: # oc get csr NAME AGE REQUESTOR CONDITION csr-4q67q 6m36s system:node:morenod-ocp-p2nls-worker-mw29m Approved,Issued csr-92h7c 22m system:node:morenod-ocp-p2nls-worker-4zbfr Approved,Issued csr-9s4lt 21m system:node:morenod-ocp-p2nls-worker-mw29m Pending csr-b2t2c 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-c2q7z 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-cl6xh 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-fb252 27m system:node:morenod-ocp-p2nls-master-0 Approved,Issued csr-fctgb 23m system:node:morenod-ocp-p2nls-worker-wqnqx Approved,Issued csr-mb99t 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-mx5v8 27m system:node:morenod-ocp-p2nls-master-1 Approved,Issued csr-s9zlf 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-tdvcq 28m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-wnnjr 27m system:node:morenod-ocp-p2nls-master-2 Approved,Issued Looks like the fix didn't make it to the nightlies yet. The latest one: https://openshift-release.svc.ci.openshift.org/releasestream/4.2.0-0.nightly/release/4.2.0-0.nightly-2019-09-04-102339 Still points to this commit: https://github.com/openshift/cluster-machine-approver/commits/a3fe0bb76acddf86163629378d9df18601cdea9f Which doesn't contain https://github.com/openshift/cluster-machine-approver/pull/41 And the cluster-machine-approver log still times out after five attempts. Would you please check this again when the commit is in your release image? Verified on 4.2.0-0.ci-2019-09-05-084944 # oc get csr NAME AGE REQUESTOR CONDITION csr-5llqn 15m system:node:morenod-ocp-54lrt-worker-9jbrm Approved,Issued csr-8lzvj 16m system:node:morenod-ocp-54lrt-worker-p7zbd Approved,Issued csr-8stn5 22m system:node:morenod-ocp-54lrt-master-1 Approved,Issued csr-lpg9g 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-m867n 23m system:node:morenod-ocp-54lrt-master-2 Approved,Issued csr-n5bf8 17m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-ndg4p 15m system:node:morenod-ocp-54lrt-worker-9q6rp Approved,Issued csr-psnqz 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-w6sqf 23m system:node:morenod-ocp-54lrt-master-0 Approved,Issued csr-wkxb4 22m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xbwbx 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-xrbk5 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |