Bug 1870151 - Vmware OCP 4.6 openshift bootstrapping not working.
Summary: Vmware OCP 4.6 openshift bootstrapping not working.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: 4.7.0
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-19 12:59 UTC by krapohl
Modified: 2020-09-09 09:33 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-09 09:33:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Bootnode Vmware 4.6 png (293.49 KB, image/png)
2020-08-25 22:54 UTC, krapohl
no flags Details
journalctl failed bootstrap node left side console screen (402.03 KB, image/png)
2020-08-28 20:37 UTC, krapohl
no flags Details
Right side console screen (271.35 KB, image/png)
2020-08-28 20:38 UTC, krapohl
no flags Details
Must-gather for VMware 4.6 install (19.00 MB, application/gzip)
2020-09-02 19:29 UTC, krapohl
no flags Details
Second part must-gather file (17.86 MB, application/octet-stream)
2020-09-02 19:30 UTC, krapohl
no flags Details
PublicIPbootnode1 (227.54 KB, image/png)
2020-09-03 17:47 UTC, krapohl
no flags Details
PublicIPbootnode2 (117.82 KB, image/png)
2020-09-03 17:48 UTC, krapohl
no flags Details
PublicIPbootnode3 (157.16 KB, image/png)
2020-09-03 17:48 UTC, krapohl
no flags Details
PublicIPbootnode4 (152.57 KB, image/png)
2020-09-03 17:50 UTC, krapohl
no flags Details
PublicIPbootnode5 (136.88 KB, image/png)
2020-09-03 17:50 UTC, krapohl
no flags Details

Description krapohl 2020-08-19 12:59:24 UTC
I have a terraform that I've been using since OCP 4.2 to create VMware OCP clusters. Most recently I've used it to create 4.5.4 clusters. I'm now trying to create 4.6 clusters using the latest nightly builds. Specifically, it picked up the "OCP version 4.6.0-0.nightly-2020-08-18-165040".

Using rhcos ova from https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64.ova


I keep getting this error:

module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get "https://api.walt-v0818.brown-chesterfield.com:6443/apis/config.openshift.io/v1/clusteroperators": EOF
module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Use the following commands to gather logs from the cluster
module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO openshift-install gather bootstrap --help
module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): FATAL failed waiting for Kubernetes API: Get "https://api.walt-v0818.brown-chesterfield.com:6443/version?timeout=32s": EOF
module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): Cluster bootstraping has failed, exit code was: 1


I went and ran openshift-install gather bootstrap --help, however, this will not work since the bootstrap node has failed to boot. To me this is suggesting there  are problems with the rhcos ova file that your a pointing everyone to for 4.6 here: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64.ova

Comment 1 krapohl 2020-08-21 13:28:12 UTC
Can we get an update. This is a blocker for any VMware OCP 4.6 setups.

Comment 2 krapohl 2020-08-25 16:16:04 UTC
Are there any updates on your vmware support?

Comment 3 Micah Abbott 2020-08-25 20:11:47 UTC
Could you provide any journal or console output from the bootstrap node that failed to boot?  The existing information is not enough to determine what is going wrong.

Additionally, is it possible to drop the `ibmconf` group from this BZ?  It makes it harder for the wider RHCOS team to investigate/triage this BZ if they are not part of the group (which is most of the team).

Comment 4 krapohl 2020-08-25 22:53:45 UTC
Attached is a picture of the console from the bootnode.

Comment 5 krapohl 2020-08-25 22:54:57 UTC
Created attachment 1712602 [details]
Bootnode Vmware 4.6 png

Bootnode vmware console png

Comment 6 Luca BRUNO 2020-08-26 07:39:27 UTC
Your bootstrap node is failing at some point in the initramfs, but your console screenshot is not enough to diagnose what's going on. Failure logs and an emergency shell are available on the node console, but the default console parameters need to be tweaked for your environment first, see https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell/.

If you tweak the kernel boot arguments via GRUB as described above, you will see the full error logs. A recording or screenshots of those would be helpful. Additionally, once there, you can pop into the emergency shell to interactively check what's going on.

Comment 7 krapohl 2020-08-26 16:03:11 UTC
Unfortunately, there is not enough information in https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell/ to give me enough direction on a VMware vCenter environment to gather the full error logs.

This is what I have available to me. I am using the version of the ova file mentioned previously. I've got 8 VMs created in my vCenter. The only access I have to the console on the bootstrap VM is via the vCenter option to "Launch Web Console". It is not obvious from your reference to https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell how to get to a Grub console on the bootstrap Web Console view I have. Please give better detailed instructions that pertain to using the vCenter Web Console of a VM to collect the information you want.

Comment 8 Luca BRUNO 2020-08-27 12:05:39 UTC
> The only access I have to the console on the bootstrap VM is via the vCenter option to "Launch Web Console".

That's indeed where you can access the Grub console for the bootstrap node. Grub menu shows up there when the node is booting up (right after BIOS/UEFI completed), with instructions on-screen on how to interrupt the boot countdown and adjust the kernel parameters to match the console arrangement of your setup (semantics explained in the above doc-page).

> I have a terraform that I've been using since OCP 4.2 to create VMware OCP clusters.

As a sidenote, my guess is that some entries in your custom Terraform plan are causing this failure. In particular, OCP 4.6 is going to use the newer Ignition configuration schema (3.x); third-party customization using the old schema (2.x), while previously working up to 4.5, may now cause bootstrapping failures with the same symptoms as shown in this ticket.

Comment 9 krapohl 2020-08-27 15:02:26 UTC
Seems I can now get to a grub command prompt now. From the grub command prompt, can you give detailed, step by step, instructions on how to get the information you need to debug. 

We are using the terraform VMware provider "ignition" which from the documentation (https://www.terraform.io/docs/providers/ignition/index.html#ignition-versions) I'm seeing only supports up to ignition 2.1.0. So this maybe the problem.

Comment 10 krapohl 2020-08-27 19:27:51 UTC
I think this change in 4.6 to go to a new version of the Ignition configuration schema without also making sure there are updates in the ignition provider in terraforms is going to hit the Openshift customers base very hard. 

Is Redhat going to continue to support VMware UPI?

Comment 11 krapohl 2020-08-27 21:01:46 UTC
Can you point me to information about the difference between Ignition configuration schema 3.x and 2.x?

Comment 12 Luca BRUNO 2020-08-28 08:57:05 UTC
> From the grub command prompt, can you give detailed, step by step, instructions on how to get the information you need to debug.

The specific details depend on your environment. In general, you can start editing the active entry by pressing 'e' before booting into Linux, then edit the parameters on the line starting with 'linux', and then finally boot it by pressing 'ctrl-x'.
In most cases, removing all `console=...` parameters will force kernel auto-detection, which may or may not work for your node too.

> Is Redhat going to continue to support VMware UPI?

I'm not in a position to answer this, and 4.6 is not yet out with official docs/release-notes.
Looking at the openshift/installer pending branch for 4.6, I can still see Terraform artifacts, https://github.com/openshift/installer/tree/release-4.6/upi/vsphere.

> Can you point me to information about the difference between Ignition configuration schema 3.x and 2.x?

For an overall view, see https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300.

Comment 13 krapohl 2020-08-28 20:37:26 UTC
Created attachment 1713009 [details]
journalctl failed bootstrap node left side console screen

Was able to go into grub and start the edit. Remove all console from linux line. then ctrl-x out. Got to # prompt, did journalctl -n 100 and started paging down came to what is shown in this image. This is the left side of screen. Right side on next image.

Comment 14 krapohl 2020-08-28 20:38:21 UTC
Created attachment 1713012 [details]
Right side console screen

This is right side of console screen.

Comment 15 krapohl 2020-08-28 20:43:34 UTC
Seems the journalctl agree with what you suspected. We are using an unsupported version of ignition config. @lucab you agree?

Comment 16 Luca BRUNO 2020-08-31 07:27:06 UTC
Yes, the errors logged on your console seem to confirm my initial guess. Either the initial config stub or an additional remote config referenced by that are using a 2.x schema version. Adapting all of those to 3.x should fix this, I don't think there is anything specifically broken at RHCOS/vmware level.

For reference, the vpshere UPI plan on the 4.6 branch at https://github.com/openshift/installer/tree/release-4.6/upi/vsphere seem to have a few changes for 3.x config, you may want to look into it.

Additionally, latest openshift/installer seem to be using the plugin at https://github.com/community-terraform-providers/terraform-provider-ignition to generate 3.x configs from Terraform.

Comment 17 krapohl 2020-08-31 12:37:50 UTC
If I have problems building the new ignition provider who do I contact?

Comment 18 krapohl 2020-08-31 17:17:13 UTC
How do I add the ignition provider that gets built by the instructions in https://github.com/community-terraform-providers/terraform-provider-ignition into my terraform. Unfortunately, the README for this repo only give instructions on how to build the provider, but not how to get it into a terraform.

Comment 19 Luca BRUNO 2020-09-01 12:59:37 UTC
I think that development/maintenance of the new provider falls under OCP "Installer" component.

Third-party providers can be consumed by Terraform as plugins, see https://www.terraform.io/docs/plugins/basics.html.

I'll close this ticket here as the RHCOS side seems fine.

Comment 20 krapohl 2020-09-02 15:19:02 UTC
I have successfully built the a new terraform ignition file plug giving the information previously referenced to use here that was suppose to support 3.x : https://github.com/community-terraform-providers/terraform-provider-ignition.

I have successfully used the new plugin in my terraform init procedure by coping the /root/go/bin/terraform-provider-ignition binary into my terraform home directory /git0901/tf_openshift_4/vsphere-upi/.terraform/plugins/linux_amd64 and renaming to terraform-provider-ignition_v2.1.1_x4. 

I have succesfully modified my terraform code following the directions referenced above here https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300. 1) Made any data.ignition_x.id references to data.ignition_x.rendered, 2) changed and append to merge, 3) Removed and filesystem references in ignition_file usage. Doing all these things allowed the terraform apply to run cleanly with out error.

However, the bootstrap node is still not booting. Analysis is showing that same issue with ignition versioning. At this point still think we have an RHCOS issue.

Comment 21 Luca BRUNO 2020-09-02 15:49:36 UTC
> However, the bootstrap node is still not booting. Analysis is showing that same issue with ignition versioning. At this point still think we have an RHCOS issue.

That's unlikely (but not impossible) as the vpshere UPI job in CI is exercising that path. Anyway, you can confirm that by grabbing the bootstrap configuration (plus all the references) and checking their versions.

Additionally, please note that the OVA you referenced in your top-comment does not exist anymore; at this point you are likely using an outdated 4.6-development version of the installer too.

Comment 22 Steve Milner 2020-09-02 18:52:48 UTC
(In reply to krapohl from comment #20)
> However, the bootstrap node is still not booting. Analysis is showing that
> same issue with ignition versioning. At this point still think we have an
> RHCOS issue.

From what I'm hearing it sounds like you're saying the wrong ignition version content is flowing and, thus, the OCP 4.6 boot image starting on the bootstrap fails. Is that accurate? Looking in the boot images I see the referenced ignition version in 4.6.0-0.nightly-2020-07-16-122837 as ignition-0.35.1-14.rhaos4.6.gitb4d18ad.el8.x86_64. Checking /var/run/ignition.json I see it's using spec 2 still. Switching to the current nightly of 4.6.0-0.nightly-2020-08-26-093617 (which is also "latest") I see the ignition at ignition-2.6.0-3.rhaos4.6.git947598e.el8.x86_64 which and I see it booting using a spec 3 ignition.

Have you been able to try the updated nightly?

Comment 23 krapohl 2020-09-02 19:16:38 UTC
I've re-run my terraform making the  following changes only. I'm still using my new ignition v2.1.1 provider that is suppose to support 3.x built using the information provided here: https://github.com/community-terraform-providers/terraform-provider-ignition

1) I moved from using https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-08-25-182234/ 

to using https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-09-02-131630/

2) I moved from using rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64 

to using 

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64.ova

So some progress, with only these changes the boot node is now coming up and master nodes are coming up, however, worker nodes are not coming up.


[root@walt-v0902-keg ~]# oc get no
NAME                        STATUS   ROLES    AGE   VERSION
master-0.cluster.internal   Ready    master   62m   v1.19.0-rc.2+b5dc585-dirty
master-1.cluster.internal   Ready    master   62m   v1.19.0-rc.2+b5dc585-dirty
master-2.cluster.internal   Ready    master   62m   v1.19.0-rc.2+b5dc585-dirty


I'm attaching the must-gather now for the cluster.

Comment 24 krapohl 2020-09-02 19:29:19 UTC
Created attachment 1713500 [details]
Must-gather for VMware 4.6 install

This is the must-gather from using 

https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-09-02-131630/


and 

rhcos

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64.ova

boot, master nodes booting, however, workers are not coming up.

Two parts to the must-gather since it exceeds max size of 19.6M

part 1 -> must-gather-aa
part 2 -> must-gather-ab


to put them back together

cat must-gather-a* > must-gather.tar.gz

Comment 25 krapohl 2020-09-02 19:30:16 UTC
Created attachment 1713502 [details]
Second part must-gather file

Second part must-gather

Comment 26 Luca BRUNO 2020-09-03 14:07:06 UTC
I did not see spot anything relevant in your must-gather, but at the same time I didn't find any traces of worked nodes activities (but I may have missed some).

My feeling is that case workers are not completing the boot, thus not joining the cluster; if so, you can check their status on the consoles (as you previously did with the bootstrap node).

Comment 27 krapohl 2020-09-03 17:46:52 UTC
We have two cluster scenarios that we use on VMware. One uses vlan networking where the cluster nodes are on private IP and the other uses all public (9 dot) IPs.

Initially, and relative to the problems in this case previously discussed I was building my cluster using a VMware vlan for the cluster and with only two public IPs for the apps and api. The cluster nodes are on private IPs including the bootnode. So on my last try at building a vlan cluster, the bootnode is coming up now. And on my last two vlan clusters build attemps all workers came up too. Not sure what the problems was on my first attempt with the workers.

The other cluster build we use is with all public (9 dot addresses) IPs on all the cluster nodes. This is our primary way of building clusters.
So using public IPs I'm not getting the bootnode on a public IP coming up now. 

So I've broken into the bootnode and it looked at the journalctl. I'm including some image shots of what I'm seeing in the journalctl now on a bootnode on a public IP.

Comment 28 krapohl 2020-09-03 17:47:59 UTC
Created attachment 1713660 [details]
PublicIPbootnode1

Journalctl from bootnode public IP page 1. Following images are one page down from this one.

Comment 29 krapohl 2020-09-03 17:48:33 UTC
Created attachment 1713661 [details]
PublicIPbootnode2

Page2

Comment 30 krapohl 2020-09-03 17:48:59 UTC
Created attachment 1713662 [details]
PublicIPbootnode3

page3

Comment 31 krapohl 2020-09-03 17:50:06 UTC
Created attachment 1713663 [details]
PublicIPbootnode4

page4

Comment 32 krapohl 2020-09-03 17:50:33 UTC
Created attachment 1713664 [details]
PublicIPbootnode5

page5

Comment 33 krapohl 2020-09-03 22:06:44 UTC
New update, I've gotten past the bootnode issue with public IPs. Had to add an 'overwrite = true' to all my ignition_file definitions, saw in the doc here https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300, the new default is `overwrite = false'.

Comment 34 Micah Abbott 2020-09-08 19:16:09 UTC
(In reply to krapohl from comment #33)
> New update, I've gotten past the bootnode issue with public IPs. Had to add
> an 'overwrite = true' to all my ignition_file definitions, saw in the doc
> here
> https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-
> version-230-to-300, the new default is `overwrite = false'.

Does this mean you have successfully installed a cluster?  I'm not certain what, if any, errors remain.

Comment 35 krapohl 2020-09-08 22:46:11 UTC
I'm successfully installing latest-4.6 clusters now, with public IPs. Still having troubles with vlan, but public IPs is our main method. I think we can close, I don't have time for vlan now.


Note You need to log in before you can comment on or make changes to this bug.