1816732 – [UPI][4.4] Worker nodes doesn't join cluster

Bug 1816732 - [UPI][4.4] Worker nodes doesn't join cluster

Summary: [UPI][4.4] Worker nodes doesn't join cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 1818961
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-24 15:50 UTC by Lukas Bednar
Modified:	2020-04-06 14:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-06 14:41:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node.log (147.19 KB, text/plain) 2020-03-24 15:50 UTC, Lukas Bednar	no flags	Details
View All

Description Lukas Bednar 2020-03-24 15:50:24 UTC

Created attachment 1673136 [details]
node.log

Description of problem:

when deploying OCP-4.4 cluster using UPI on RHOS, not all workers join the cluster.

Sometimes it is just 1/3 joins, other time just 2/3 join, in some cases zero workers join .. I am not sure if it is some race-condition ...

Note that this deployment has worked for us for a long time.
Recently I was trying few latest nightly builds and I can not come to success.

Version-Release number of the following components:
My last attempt was with 4.4.0-rc.4.
Also with 4.4.0-0.nightly-2020-03-06-013414 & 4.4.0-0.nightly-2020-03-03-175205

Please see attached log of node which did not come up (2/3 workers joint, but worker-1 did not) .

One of my colleague saw same problem with 4.3.5, but the issue is more rare there.

Here is example how does it look like when none of workers joins:

[cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get nodes 
NAME               STATUS   ROLES    AGE   VERSION
host-172-16-0-21   Ready    master   18m   v1.17.1
host-172-16-0-35   Ready    master   18m   v1.17.1
host-172-16-0-47   Ready    master   16m   v1.17.1
>>> HERE SHOULD BE THREE MORE WORKERS <<<

[cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          49m     Unable to apply 4.4.0-rc.4: some cluster operators have not yet rolled out

[cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get clusteroperators 
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                          Unknown     Unknown       True       16m
cloud-credential                           4.4.0-rc.4   True        False         False      48m
cluster-autoscaler                         4.4.0-rc.4   True        False         False      10m
console                                    4.4.0-rc.4   Unknown     True          False      10m
dns                                        4.4.0-rc.4   True        False         False      14m
etcd                                       4.4.0-rc.4   True        False         False      9m59s
image-registry                             4.4.0-rc.4   False       True          False      8m11s
ingress                                    unknown      False       True          True       10m
insights                                   4.4.0-rc.4   True        False         False      10m
kube-apiserver                             4.4.0-rc.4   True        True          False      13m
kube-controller-manager                    4.4.0-rc.4   True        False         False      13m
kube-scheduler                             4.4.0-rc.4   True        False         False      14m
kube-storage-version-migrator              4.4.0-rc.4   False       False         False      16m
machine-api                                4.4.0-rc.4   True        False         False      15m
machine-config                             4.4.0-rc.4   True        False         False      13m
marketplace                                4.4.0-rc.4   True        False         False      10m
monitoring                                              False       True          True       7m4s
network                                    4.4.0-rc.4   True        False         False      14m
node-tuning                                4.4.0-rc.4   True        False         False      16m
openshift-apiserver                        4.4.0-rc.4   True        False         False      7m2s
openshift-controller-manager               4.4.0-rc.4   True        False         False      9m25s
openshift-samples                          4.4.0-rc.4   True        False         False      9m42s
operator-lifecycle-manager                 4.4.0-rc.4   True        False         False      15m
operator-lifecycle-manager-catalog         4.4.0-rc.4   True        False         False      15m
operator-lifecycle-manager-packageserver   4.4.0-rc.4   True        False         False      10m
service-ca                                 4.4.0-rc.4   True        False         False      16m
service-catalog-apiserver                  4.4.0-rc.4   True        False         False      16m
service-catalog-controller-manager         4.4.0-rc.4   True        False         False      16m
storage                                    4.4.0-rc.4   True        False         False      12m

Comment 1 Lukas Bednar 2020-03-24 19:12:42 UTC

I tried again with ocp-4.4.0-rc.4 and now only two workers joint :-/ ...

[cnv-qe-jenkins@cnv-executor-lbednar ~]$ oc get nodes 
NAME               STATUS   ROLES    AGE   VERSION
host-172-16-0-16   Ready    master   21m   v1.17.1
host-172-16-0-19   Ready    master   19m   v1.17.1
host-172-16-0-27   Ready    master   19m   v1.17.1
host-172-16-0-41   Ready    worker   12m   v1.17.1
host-172-16-0-42   Ready    worker   12m   v1.17.1

Comment 2 Guy Inger 2020-03-25 16:11:33 UTC

It happened to me too with OCP-4.3.5. The cluster has 2 workers instead of 3.

[cnv-qe-jenkins@cnv-executor-ginger2 ~]$ oc get nodes
NAME               STATUS   ROLES    AGE   VERSION
host-172-16-0-13   Ready    master   26h   v1.16.2
host-172-16-0-20   Ready    master   26h   v1.16.2
host-172-16-0-25   Ready    master   26h   v1.16.2
host-172-16-0-40   Ready    worker   26h   v1.16.2
host-172-16-0-53   Ready    worker   26h   v1.16.2

Comment 3 Scott Dodson 2020-03-26 16:46:00 UTC

Do you have pending CSR approvals? `oc get csr`
Please provide `oc adm must-gather` and/or details of your CSR approval process.

Comment 4 Lukas Bednar 2020-03-31 07:27:26 UTC

(In reply to Scott Dodson from comment #3)
> Do you have pending CSR approvals? `oc get csr`
> Please provide `oc adm must-gather` and/or details of your CSR approval
> process.

You are right, there were pending CSRs, but this wasn't happening before.
We did approve CSRs right after installation was completed.

So now as a workaround, we run loop on background while openshift-installer is running:

```
worker_num=3
while true ;
do
    sleep 120
    for crs in $(oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name') ;
    do
        echo "Approve cert for node ${crs}"
        oc adm certificate approve ${crs} || true
    done
    worker_joint=$(oc get node -l node-role.kubernetes.io/worker --no-headers  | wc -l)
    if [ ${worker_num} -eq ${worker_joint} ] ;
    then
        echo "All workers have joint cluster"
        break
    fi
done
```

Comment 5 Brenton Leanhardt 2020-04-06 14:41:59 UTC


*** This bug has been marked as a duplicate of bug 1818961 ***

Note You need to log in before you can comment on or make changes to this bug.