Bug 1824426

Summary: [osp] Machines stuck in Provisioned status and have no nodeRef after insallation
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Mike Fedosin <mfedosin>
Cloud Compute sub component: OpenStack Provider QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: adduarte, agarcial, egarcia, m.andre, mfedosin, nsatsia, pprinett
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:27:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1839012    

Description sunzhaohua 2020-04-16 08:33:36 UTC
Description of problem:
Add workers to additional network by adding additionalNetworkIDs in the install-config.yaml. After installation, machines stuck in Provisioned status and have no nodeRef.

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-15-223247

How reproducible:
Always

Steps to Reproduce:
1. Setup env with additional networks for your machines
2. Check machine, node
3. Check logs

Actual results:
Machines stuck in Provisioned status and have no nodeRef after insallation
$ oc get node
NAME                             STATUS   ROLES    AGE     VERSION
zhsun416osp-lhbg5-master-0       Ready    master   4h27m   v1.18.0-rc.1
zhsun416osp-lhbg5-master-1       Ready    master   4h27m   v1.18.0-rc.1
zhsun416osp-lhbg5-master-2       Ready    master   4h27m   v1.18.0-rc.1
zhsun416osp-lhbg5-worker-8242k   Ready    worker   4h12m   v1.18.0-rc.1
zhsun416osp-lhbg5-worker-9hc8x   Ready    worker   4h15m   v1.18.0-rc.1
zhsun416osp-lhbg5-worker-kb7ws   Ready    worker   4h12m   v1.18.0-rc.1
$ oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
zhsun416osp-lhbg5-worker   3         3                             4h28m
$ oc get machine
NAME                             PHASE         TYPE        REGION      ZONE   AGE
zhsun416osp-lhbg5-master-0       Provisioned   m1.xlarge   regionOne   nova   4h28m
zhsun416osp-lhbg5-master-1       Provisioned   m1.xlarge   regionOne   nova   4h28m
zhsun416osp-lhbg5-master-2       Running       m1.xlarge   regionOne   nova   4h28m
zhsun416osp-lhbg5-worker-8242k   Provisioned   m1.xlarge   regionOne   nova   4h19m
zhsun416osp-lhbg5-worker-9hc8x   Provisioned   m1.xlarge   regionOne   nova   4h19m
zhsun416osp-lhbg5-worker-kb7ws   Provisioned   m1.xlarge   regionOne   nova   4h19m
$ oc get machine -o yaml | grep "noAllowedAddressPairs: true"
          noAllowedAddressPairs: true
          noAllowedAddressPairs: true
          noAllowedAddressPairs: true
          noAllowedAddressPairs: true
          noAllowedAddressPairs: true
          noAllowedAddressPairs: true

$ oc get machine zhsun416osp-lhbg5-worker-kb7ws -o yaml
status:
  addresses:
  - address: 172.16.34.33
    type: InternalIP
  - address: zhsun416osp-lhbg5-worker-kb7ws
    type: Hostname
  - address: zhsun416osp-lhbg5-worker-kb7ws
    type: InternalDNS
  lastUpdated: "2020-04-16T03:00:12Z"
  phase: Provisioned

I0416 03:04:19.174776       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0416 03:04:19.793463       1 controller.go:284] Reconciling machine "zhsun416osp-lhbg5-worker-9hc8x" triggers idempotent update
I0416 03:04:19.793868       1 controller.go:164] Reconciling Machine "zhsun416osp-lhbg5-worker-kb7ws"
I0416 03:04:19.793905       1 controller.go:376] Machine "zhsun416osp-lhbg5-worker-kb7ws" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0416 03:04:19.802110       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0416 03:04:20.329441       1 controller.go:284] Reconciling machine "zhsun416osp-lhbg5-worker-kb7ws" triggers idempotent update
I0416 03:04:20.329700       1 controller.go:164] Reconciling Machine "zhsun416osp-lhbg5-master-0"
I0416 03:04:20.329708       1 controller.go:376] Machine "zhsun416osp-lhbg5-master-0" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0416 03:04:20.339747       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0416 03:04:20.843770       1 controller.go:284] Reconciling machine "zhsun416osp-lhbg5-master-0" triggers idempotent update
I0416 03:04:20.843977       1 controller.go:164] Reconciling Machine "zhsun416osp-lhbg5-master-1"
I0416 03:04:20.843990       1 controller.go:376] Machine "zhsun416osp-lhbg5-master-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0416 03:04:20.850918       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0416 03:04:21.688438       1 controller.go:284] Reconciling machine "zhsun416osp-lhbg5-master-1" triggers idempotent update
I0416 03:04:21.688711       1 controller.go:164] Reconciling Machine "zhsun416osp-lhbg5-master-2"
I0416 03:04:21.688725       1 controller.go:376] Machine "zhsun416osp-lhbg5-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0416 03:04:21.700064       1 machineservice.go:230] Cloud provider CA cert not provided, using system trust bundle
I0416 03:04:22.412165       1 controller.go:284] Reconciling machine "zhsun416osp-lhbg5-master-2" triggers idempotent update

Expected results:
Machines status shoud be running and have nodeRef 

Additional info:

Comment 1 Mike Fedosin 2020-04-20 20:05:35 UTC
Important: https://github.com/openshift/installer/pull/3483 is just a part of the fix. The second part will be in CAPO.

Comment 4 David Sanz 2020-05-12 11:25:10 UTC
Verified on 4.5.0-0.nightly-2020-05-12-083345

[morenod@morenod-laptop ~]$ oc get nodes
NAME                            STATUS   ROLES    AGE     VERSION
mrnd-6nics-klfsc-master-0       Ready    master   14m     v1.18.2
mrnd-6nics-klfsc-master-1       Ready    master   15m     v1.18.2
mrnd-6nics-klfsc-master-2       Ready    master   14m     v1.18.2
mrnd-6nics-klfsc-worker-6846c   Ready    worker   60s     v1.18.2
mrnd-6nics-klfsc-worker-hzpp2   Ready    worker   3m16s   v1.18.2
[morenod@morenod-laptop ~]$ oc get machines
NAME                            PHASE     TYPE           REGION      ZONE   AGE
mrnd-6nics-klfsc-master-0       Running   ci.m1.xlarge   regionOne   nova   16m
mrnd-6nics-klfsc-master-1       Running   ci.m1.xlarge   regionOne   nova   16m
mrnd-6nics-klfsc-master-2       Running   ci.m1.xlarge   regionOne   nova   16m
mrnd-6nics-klfsc-worker-6846c   Running   ci.m1.xlarge   regionOne   nova   10m
mrnd-6nics-klfsc-worker-hzpp2   Running   ci.m1.xlarge   regionOne   nova   10m

Comment 5 Mike Fedosin 2020-05-14 14:26:00 UTC
*** Bug 1824425 has been marked as a duplicate of this bug. ***

Comment 6 errata-xmlrpc 2020-07-13 17:27:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409