Created attachment 1673136 [details] node.log Description of problem: when deploying OCP-4.4 cluster using UPI on RHOS, not all workers join the cluster. Sometimes it is just 1/3 joins, other time just 2/3 join, in some cases zero workers join .. I am not sure if it is some race-condition ... Note that this deployment has worked for us for a long time. Recently I was trying few latest nightly builds and I can not come to success. Version-Release number of the following components: My last attempt was with 4.4.0-rc.4. Also with 4.4.0-0.nightly-2020-03-06-013414 & 4.4.0-0.nightly-2020-03-03-175205 Please see attached log of node which did not come up (2/3 workers joint, but worker-1 did not) . One of my colleague saw same problem with 4.3.5, but the issue is more rare there. Here is example how does it look like when none of workers joins: [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get nodes NAME STATUS ROLES AGE VERSION host-172-16-0-21 Ready master 18m v1.17.1 host-172-16-0-35 Ready master 18m v1.17.1 host-172-16-0-47 Ready master 16m v1.17.1 >>> HERE SHOULD BE THREE MORE WORKERS <<< [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 49m Unable to apply 4.4.0-rc.4: some cluster operators have not yet rolled out [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 16m cloud-credential 4.4.0-rc.4 True False False 48m cluster-autoscaler 4.4.0-rc.4 True False False 10m console 4.4.0-rc.4 Unknown True False 10m dns 4.4.0-rc.4 True False False 14m etcd 4.4.0-rc.4 True False False 9m59s image-registry 4.4.0-rc.4 False True False 8m11s ingress unknown False True True 10m insights 4.4.0-rc.4 True False False 10m kube-apiserver 4.4.0-rc.4 True True False 13m kube-controller-manager 4.4.0-rc.4 True False False 13m kube-scheduler 4.4.0-rc.4 True False False 14m kube-storage-version-migrator 4.4.0-rc.4 False False False 16m machine-api 4.4.0-rc.4 True False False 15m machine-config 4.4.0-rc.4 True False False 13m marketplace 4.4.0-rc.4 True False False 10m monitoring False True True 7m4s network 4.4.0-rc.4 True False False 14m node-tuning 4.4.0-rc.4 True False False 16m openshift-apiserver 4.4.0-rc.4 True False False 7m2s openshift-controller-manager 4.4.0-rc.4 True False False 9m25s openshift-samples 4.4.0-rc.4 True False False 9m42s operator-lifecycle-manager 4.4.0-rc.4 True False False 15m operator-lifecycle-manager-catalog 4.4.0-rc.4 True False False 15m operator-lifecycle-manager-packageserver 4.4.0-rc.4 True False False 10m service-ca 4.4.0-rc.4 True False False 16m service-catalog-apiserver 4.4.0-rc.4 True False False 16m service-catalog-controller-manager 4.4.0-rc.4 True False False 16m storage 4.4.0-rc.4 True False False 12m
I tried again with ocp-4.4.0-rc.4 and now only two workers joint :-/ ... [cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get nodes NAME STATUS ROLES AGE VERSION host-172-16-0-16 Ready master 21m v1.17.1 host-172-16-0-19 Ready master 19m v1.17.1 host-172-16-0-27 Ready master 19m v1.17.1 host-172-16-0-41 Ready worker 12m v1.17.1 host-172-16-0-42 Ready worker 12m v1.17.1
It happened to me too with OCP-4.3.5. The cluster has 2 workers instead of 3. [cnv-qe-jenkins@cnv-executor-ginger2 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION host-172-16-0-13 Ready master 26h v1.16.2 host-172-16-0-20 Ready master 26h v1.16.2 host-172-16-0-25 Ready master 26h v1.16.2 host-172-16-0-40 Ready worker 26h v1.16.2 host-172-16-0-53 Ready worker 26h v1.16.2
Do you have pending CSR approvals? `oc get csr` Please provide `oc adm must-gather` and/or details of your CSR approval process.
(In reply to Scott Dodson from comment #3) > Do you have pending CSR approvals? `oc get csr` > Please provide `oc adm must-gather` and/or details of your CSR approval > process. You are right, there were pending CSRs, but this wasn't happening before. We did approve CSRs right after installation was completed. So now as a workaround, we run loop on background while openshift-installer is running: ``` worker_num=3 while true ; do sleep 120 for crs in $(oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name') ; do echo "Approve cert for node ${crs}" oc adm certificate approve ${crs} || true done worker_joint=$(oc get node -l node-role.kubernetes.io/worker --no-headers | wc -l) if [ ${worker_num} -eq ${worker_joint} ] ; then echo "All workers have joint cluster" break fi done ```
*** This bug has been marked as a duplicate of bug 1818961 ***