Bug 1849387

Summary: [rhv] upgrading to 4.5 or new installing 4.5 ends with worker machines stuck in state 'Provisioned' though nodes are 'Ready'
Product: OpenShift Container Platform Reporter: daniel <dmoessne>
Component: InstallerAssignee: Roy Golan <rgolan>
Installer sub component: OpenShift on RHV QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: dougsland
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-14 11:16:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description daniel 2020-06-21 08:53:18 UTC
Description of problem:
When installing 4.5.0-rc1 or -rc2 and even when updating from 4.4.6 to either one of the rc's, 
worker machines are stuck in state 'Provisioned' though nodes are 'Ready'

Version-Release number of the following components:
- 4.5.0-rc1
- 4.5.0-rc2


How reproducible:

Steps to Reproduce:
1. Either directly install one of the 4.5.0-rc[12] versions or upgrade from 4.4.6 to it (no matter if -rc1 first then upgrade to 4.5-rc2 or directly to 4.5.0-rc2)
2. As soon as the cluster is upgraded or up and running a 4.5.0-rc[12] version check node status and machine of the worker
3. worker machines are still Provisioned


Actual results:
# oc get machines -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
cluster-fzgh5-master-0         Running                              104m
cluster-fzgh5-master-1         Running                              104m
cluster-fzgh5-master-2         Running                              104m
cluster-fzgh5-worker-0-4nc2g   Provisioned                          93m
cluster-fzgh5-worker-0-t8pxx   Provisioned                          93m
cluster-fzgh5-worker-0-wltcs   Provisioned                          93m
# oc get nodes 
NAME                           STATUS   ROLES    AGE    VERSION
cluster-fzgh5-master-0         Ready    master   102m   v1.18.3+a637491
cluster-fzgh5-master-1         Ready    master   102m   v1.18.3+a637491
cluster-fzgh5-master-2         Ready    master   103m   v1.18.3+a637491
cluster-fzgh5-worker-0-4nc2g   Ready    worker   85m    v1.18.3+a637491
cluster-fzgh5-worker-0-t8pxx   Ready    worker   87m    v1.18.3+a637491
cluster-fzgh5-worker-0-wltcs   Ready    worker   89m    v1.18.3+a637491
# 

Expected results:
worker machines also show state 'Running'

Additional info:
- RHV version is 4.3.9.4-11.el7
- RHV cluster can easily run 10 worker nodes with 4.4.x, so there seems to be no resource limitation and as stated, workers are ready running pods
- In case I do an upgrade from 4.4.6 -> 4.5.0-rc[12] machines are prior to upgrade 'Running'

- would love to attach a current must gather but because of https://bugzilla.redhat.com/show_bug.cgi?id=1848977 I am unable to do so ..

- attach file issue-20200620-1.tar.gz which contains the following:
  - oc get machinesets -n openshift-machine-api -o yaml >> machinesets.yaml
  - oc get machines -n openshift-machine-api -o yaml >> machines.yaml
  - oc get nodes -o yaml >> nodes.yaml
  - oc get all -A -o yaml >>all.yaml

- this is for me out of the 6 tries 100% reproducible

Comment 2 Douglas Schilling Landgraf 2020-07-09 12:21:44 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 3 daniel 2020-07-13 14:54:01 UTC
could be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1854787

I did run now a couple of tests with every version from 4.4.4 to 4.4.12 and the issue starts first time on 4.4.10 that machines are stuck in provisioning. 
Found that there are as well not approved csrs and once approving them I was at least able to run a must gather ...
Also with 4.5.1 I see exactly the same issue: 
# oc version 
Client Version: 4.5.1
Server Version: 4.5.1
Kubernetes Version: v1.18.3+8b0a82f
#
# oc get machines -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
cluster-gz9fc-master-0         Running                              30m
cluster-gz9fc-master-1         Running                              30m
cluster-gz9fc-master-2         Running                              30m
cluster-gz9fc-worker-0-5bmhf   Provisioned                          19m
cluster-gz9fc-worker-0-766x2   Provisioned                          19m
cluster-gz9fc-worker-0-h45mp   Provisioned                          19m
cluster-gz9fc-worker-0-tw8d4   Provisioned                          19m
#
# oc get csr 
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-2w5n6   14m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-4wkqj   10m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-5r9jr   29m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-master-2                                          Approved,Issued
csr-b7j5q   13m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-worker-0-tw8d4                                    Pending
csr-fxk6x   29m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-gqgkf   29m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-master-0                                          Approved,Issued
csr-gtdsp   29s   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-worker-0-766x2                                    Pending
csr-kb4m4   11m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-kjzbr   28m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-master-1                                          Approved,Issued
csr-l4q6m   16m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-m6rrn   28m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-mk458   29m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nfgc2   11m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-worker-0-h45mp                                    Pending
csr-twz2l   15m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-worker-0-766x2                                    Pending
csr-zslvg   10m   kubernetes.io/kubelet-serving                 system:node:cluster-gz9fc-worker-0-5bmhf                                    Pending
#
# oc get csr |awk '/Pending/ {print $1}'|xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-b7j5q approved
certificatesigningrequest.certificates.k8s.io/csr-gtdsp approved
certificatesigningrequest.certificates.k8s.io/csr-nfgc2 approved
certificatesigningrequest.certificates.k8s.io/csr-twz2l approved
certificatesigningrequest.certificates.k8s.io/csr-zslvg approved
#
# oc get nodes -o wide
NAME                           STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
cluster-gz9fc-master-0         Ready    master   29m   v1.18.3+6025c28   10.32.111.98    <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-master-1         Ready    master   29m   v1.18.3+6025c28   10.32.111.95    <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-master-2         Ready    master   29m   v1.18.3+6025c28   10.32.111.99    <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-worker-0-5bmhf   Ready    worker   10m   v1.18.3+6025c28   10.32.111.106   <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-worker-0-766x2   Ready    worker   16m   v1.18.3+6025c28   10.32.111.101   <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-worker-0-h45mp   Ready    worker   12m   v1.18.3+6025c28   10.32.111.105   <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
cluster-gz9fc-worker-0-tw8d4   Ready    worker   14m   v1.18.3+6025c28   10.32.111.102   <none>        Red Hat Enterprise Linux CoreOS 45.82.202007062333-0 (Ootpa)   4.18.0-193.12.1.el8_2.x86_64   cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
#
# oc get machines -n openshift-machine-api 
NAME                           PHASE         TYPE   REGION   ZONE   AGE
cluster-gz9fc-master-0         Running                              36m
cluster-gz9fc-master-1         Running                              36m
cluster-gz9fc-master-2         Running                              36m
cluster-gz9fc-worker-0-5bmhf   Provisioned                          25m
cluster-gz9fc-worker-0-766x2   Provisioned                          25m
cluster-gz9fc-worker-0-h45mp   Provisioned                          25m
cluster-gz9fc-worker-0-tw8d4   Provisioned                          25m
#


so to sum it up,
- starting with 4.4.10 (always fresh install)
- also affecting 4.5.1
- pending CSRs (need to be approved, otherwise must-gather wouldn't work, possily other things as well) 
- even after approving CSRs, machines are in Provisioned

- could be dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1854787 
- raised severity, as this will have direct impact on new 4.4 installs as well as 4.5 installs

Comment 4 Roy Golan 2020-07-14 11:16:31 UTC

*** This bug has been marked as a duplicate of bug 1854787 ***