Bug 1822345 - Vsphere: machine stuck in provisioning phase (failed to create machine: vm 'walt-45latest1-5vwzr-rhcos' not found) but nodes are created.
Summary: Vsphere: machine stuck in provisioning phase (failed to create machine: vm 'w...
Keywords:
Status: CLOSED DUPLICATE of bug 1834966
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: x86_64
OS: Linux
urgent
medium
Target Milestone: ---
: 4.5.0
Assignee: Patrick Dillon
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-08 19:02 UTC by krapohl
Modified: 2020-05-18 16:37 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-18 16:37:26 UTC
Target Upstream Version:
Embargoed:
agarcial: needinfo+


Attachments (Terms of Use)
oc get co (2.97 KB, text/plain)
2020-04-08 19:02 UTC, krapohl
no flags Details
oc get po (33.32 KB, text/plain)
2020-04-08 19:03 UTC, krapohl
no flags Details
part 2aa of must-gather (15.00 MB, application/gzip)
2020-04-08 19:09 UTC, krapohl
no flags Details
must-gather 2ab (15.00 MB, application/octet-stream)
2020-04-08 19:11 UTC, krapohl
no flags Details
must-gather 2ac (15.00 MB, application/octet-stream)
2020-04-08 19:11 UTC, krapohl
no flags Details
must-gather 2ad (5.86 MB, application/octet-stream)
2020-04-08 19:14 UTC, krapohl
no flags Details
Console output of issue (87.65 KB, image/png)
2020-04-10 18:43 UTC, krapohl
no flags Details
Version info (48.15 KB, image/png)
2020-04-10 18:44 UTC, krapohl
no flags Details
Console error on 4.4.40-rc.7 (77.29 KB, image/png)
2020-04-10 20:13 UTC, krapohl
no flags Details
Same failure rc8 on vmware (55.79 KB, image/png)
2020-04-14 19:35 UTC, krapohl
no flags Details
Failure on 4.5.0-0.nightly-2020-04-14-184903 (111.86 KB, image/png)
2020-04-15 01:20 UTC, krapohl
no flags Details
Same failure on 4.5.0-0.nightly-2020-04-14-184903 (88.16 KB, image/png)
2020-04-15 01:23 UTC, krapohl
no flags Details
Latest-4.5 nightly (155.38 KB, image/png)
2020-05-07 20:12 UTC, krapohl
no flags Details
From 4.5.0-0.nightly-2020-05-13-130344 (72.19 KB, image/png)
2020-05-13 16:28 UTC, krapohl
no flags Details
Part 1 of 4 must gather (15.00 MB, application/gzip)
2020-05-13 17:16 UTC, krapohl
no flags Details
Part 2 of 4 must gather (15.00 MB, application/octet-stream)
2020-05-13 17:17 UTC, krapohl
no flags Details
Part 3 of 4 must gather (15.00 MB, application/octet-stream)
2020-05-13 17:18 UTC, krapohl
no flags Details
Part 4 of 4 must gather (4.15 MB, application/octet-stream)
2020-05-13 17:19 UTC, krapohl
no flags Details

Description krapohl 2020-04-08 19:02:18 UTC
Created attachment 1677346 [details]
oc get co

Description of problem:
On OCP 4.4.0-rc.6 console see messages to the effect of

Machine test-pr-op3-ms6f9-master-0 does not have valid node reference

Machine test-pr-op3-ms6f9-master-1 does not have valid node reference

Machine test-pr-op3-ms6f9-master-2 does not have valid node reference

Version-Release number of selected component (if applicable):


How reproducible:
Every 4.4.0-rc.6 install. Do not see on 4.3 OCP installs

Steps to Reproduce:
1.Install OCP
2.
3.

Actual results:


Expected results:


Additional info:




Machine test-pr-op3-ms6f9-master-0 does not have valid node reference

Comment 1 krapohl 2020-04-08 19:03:52 UTC
Created attachment 1677348 [details]
oc get po

oc get po

Comment 2 krapohl 2020-04-08 19:09:13 UTC
Created attachment 1677349 [details]
part 2aa of must-gather

part 2aa of must-gather

Comment 3 krapohl 2020-04-08 19:11:06 UTC
Created attachment 1677350 [details]
must-gather 2ab

must-gather 2ab

Comment 4 krapohl 2020-04-08 19:11:57 UTC
Created attachment 1677351 [details]
must-gather 2ac

must-gather 2ac

Comment 5 krapohl 2020-04-08 19:14:09 UTC
Created attachment 1677352 [details]
must-gather 2ad

must-gather 2ad

To put back together 
cat must-gather.tar.gz2a* > must-gather.tar.gz

Comment 6 bpeterse 2020-04-09 14:02:11 UTC
When submitting bug reports, please fill out the steps to reproduce completely, to ensure the assignee can quickly reproduce & fix your bug, thanks!

Steps to Reproduce:
1.Install OCP
2.
3.

Comment 7 bpeterse 2020-04-09 15:50:52 UTC
Setting fix version 4.5 for now while awaiting feedback.
Checking a current dev cluster, I don't immediately see anything problematic.

Comment 8 krapohl 2020-04-10 13:38:03 UTC
Setup
1. VMware 3 master, 3 worker setup. 
2. esxi servers are setup with vSphere 6.7
3. SAN with 7 attached LUNs, SDRS turned off.


I assume much, if not all this information, is already apparent from the must-gather.

Comment 9 krapohl 2020-04-10 18:43:46 UTC
Created attachment 1677839 [details]
Console output of issue

Image of Console displaying issue

Comment 10 krapohl 2020-04-10 18:44:18 UTC
Created attachment 1677840 [details]
Version info

Version info

Comment 11 krapohl 2020-04-10 20:13:04 UTC
Created attachment 1677843 [details]
Console error on 4.4.40-rc.7

Same error occurring with 4.4.0-rc.7

Comment 12 bpeterse 2020-04-13 17:22:45 UTC
The extra info may be in the must-gather, but we can triage bugs much faster if the basics of all bugs are consistently reported.
Appreciate it!

Comment 13 krapohl 2020-04-14 19:35:35 UTC
Created attachment 1678803 [details]
Same failure rc8 on vmware

Waiting on fix

Comment 14 krapohl 2020-04-15 01:20:10 UTC
Created attachment 1678856 [details]
Failure on 4.5.0-0.nightly-2020-04-14-184903

Same failure occurring on 4.5.0-0.nightly-2020-04-14-184903

Comment 15 krapohl 2020-04-15 01:23:14 UTC
Created attachment 1678857 [details]
Same failure on  4.5.0-0.nightly-2020-04-14-184903


Failures on VMware, OCP console OVERview 4.5.0-0.nightly-2020-04-14-184903

Comment 16 krapohl 2020-04-28 21:47:17 UTC
So what is the status of this issue?

Comment 17 bpeterse 2020-05-04 14:09:04 UTC
The console seems to be simply reporting on the data is is receiving, and that there is a problem with the machines.  Passing this along to machine config.

Comment 19 Alexander Demicev 2020-05-05 12:41:36 UTC
Does this issue appear on newer builds?

Comment 20 krapohl 2020-05-07 20:12:31 UTC
Created attachment 1686329 [details]
Latest-4.5 nightly

Yes the latest-4.5 nightly is still showing the issue.

Comment 21 Alberto 2020-05-13 09:00:10 UTC
In 4.4 vSphere has no support for IPI nor automated machine management. The machine api operator will no-op.

The screenshot shared for 4.5 in https://bugzilla.redhat.com/show_bug.cgi?id=1822345#c20 shows your workers machines stuck in provisioning. Can you reproduce that against the latest nightly? If so can you share must gather logs?

Comment 22 krapohl 2020-05-13 16:28:22 UTC
Created attachment 1688131 [details]
From 4.5.0-0.nightly-2020-05-13-130344

Night console overview 4.5.0-0.nightly-2020-05-13-130344

Comment 23 krapohl 2020-05-13 16:29:12 UTC
Now gathering must-gather logs.

Comment 24 krapohl 2020-05-13 17:16:07 UTC
Created attachment 1688138 [details]
Part 1 of 4 must gather

Part 1 of 4 must gather

Comment 25 krapohl 2020-05-13 17:17:24 UTC
Created attachment 1688139 [details]
Part 2 of 4 must gather

Part 2 of 4 must gather

Comment 26 krapohl 2020-05-13 17:18:36 UTC
Created attachment 1688141 [details]
Part 3 of 4 must gather

Part 3 of 4 must gather

Comment 27 krapohl 2020-05-13 17:19:45 UTC
Created attachment 1688144 [details]
Part 4 of 4 must gather

Part 4 of 4 must gather

Comment 28 krapohl 2020-05-13 17:21:33 UTC
To put must gather parts back together 

cat must-gather.tar.gz.parta*  >must-gather.tar.gz.joined

Comment 29 Alexander Demicev 2020-05-14 09:04:40 UTC
It seems like the controller can't find template for creating a VM. Can you verify that "walt-45latest1-5vwzr-rhcos" exists?

Comment 30 krapohl 2020-05-14 13:39:23 UTC
I don't understand the comment .... obviously the VM's have been created successfully via the rhcos template .... the cluster is up .... we would not have gotten to this point where I can log onto the OCP console without that having happened.

I can tell you the way I find out what rhcos template to use. I go to the release.txt file and pull the machine-os information. In this case that would of been https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.5.0-0.nightly-2020-05-13-130344/release.txt which says I need the 45.81.202005131029-0 version of the rhcos.

Then I go here to pull the ova for that rhcos: https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.5/45.81.202005130448-0/x86_64/rhcos-45.81.202005130448-0-vmware.x86_64.ova

Once I pull that ova I upload that into our vCenter using "Deploy ovf template" function within the vCenter to up-load it into our vCenter so it is referencable by the VMware Terraform Create function.

As I said the VM's are getting created successfully, ignition files are being passed around and the VM's are up,  using this rhcos and basically are creating the cluster.

Now if there is a function within the install process that is somehow re-naming the rhcos template I used originally to create the VMs to walt-45latest1-5vwzr-rhcos, I don't know anything about that. I usually use this naming convention to name the rhcos ova's I upload to our vCenter, rhcos-45.81.202005130448-0-vmware.x86_64 for example in this case. So this does not match "walt-45latest1-5vwzr-rhcos" what you asking for.

I would not know where to look for walt-45latest1-5vwzr-rhcos, so if you can provide very specific information on where I would look on the VM itself or in the vCenter I would need that to answer your question on whether the file exists? Do you want me to look on the VMs (master/workers) themselves, or look someplace on the vCenter? If so where?

Comment 31 krapohl 2020-05-14 13:56:51 UTC
I just noticed that the version of the rhcos referenced in the release.txt, got updated from the 45.81.202005130448 to the 45.81.20200513102909 between the time I pulled it, and looked at it again today. Not sure why your process is doing this and how anyone can keep up-to-date if this is happening. I doubt this has anything to do with why these errors are coming up since I've been seeing them for months with 4.5 nightly builds. 

Anyway, my questions still stand. Please tell me (with very specific information) where I should be looking for the walt-45latest1-5vwzr-rhcos file?

Comment 32 krapohl 2020-05-18 12:59:54 UTC
I disagree with this being moved to 4.6. The problem originally happened on  4.4, and was not fixed and then moved to 4.5. It should be fixed in 4.5.

Comment 33 Alberto 2020-05-18 13:30:17 UTC
>I disagree with this being moved to 4.6. The problem originally happened on  4.4, and was not fixed and then moved to 4.5. It should be fixed in 4.5.

Hey krapohl the initial bug as reported is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1834966

we target any non release blocker bug against the current release under feature development i.e 4.6. Then we evaluate backward back ports to the version where it was first found.

I moved this one too fast though, I'm moving back to 4.5 to reevaluate severity and renaming the bz according to https://bugzilla.redhat.com/show_bug.cgi?id=1822345#c21

Comment 34 Alberto 2020-05-18 13:45:57 UTC
Hey krapohl to get the logs from https://bugzilla.redhat.com/show_bug.cgi?id=1822345#c23
can you clarify which steps did you run? did you just run IPI installer? did you run UPI steps?

Comment 35 krapohl 2020-05-18 15:19:37 UTC
I don't understand your first question about logs .... all I did was a must-gather ... there were not logs .. put whatever must-gather puts in the folder it creates.

For second question .... this is a VMware install,as is indicated in previous information, so it must be UPI.

Comment 36 Alberto 2020-05-18 15:37:25 UTC
>For second question .... this is a VMware install,as is indicated in previous information, so it must be UPI.

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1822345#c22 this is 4.5 so this might be as well IPI.

So this is a 4.5 UPI vSphere install which resulted on a running cluster by following documented steps.
During the UPI install steps the installer instantiated a machineSet object with a bad input for the template?

Moving this to installer to prevent this object from being instantiated this object or ensures it uses the right input. Relates to https://bugzilla.redhat.com/show_bug.cgi?id=1834966.

Comment 37 Abhinav Dahiya 2020-05-18 16:37:26 UTC

*** This bug has been marked as a duplicate of bug 1834966 ***


Note You need to log in before you can comment on or make changes to this bug.