I have a terraform that I've been using since OCP 4.2 to create VMware OCP clusters. Most recently I've used it to create 4.5.4 clusters. I'm now trying to create 4.6 clusters using the latest nightly builds. Specifically, it picked up the "OCP version 4.6.0-0.nightly-2020-08-18-165040". Using rhcos ova from https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64.ova I keep getting this error: module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get "https://api.walt-v0818.brown-chesterfield.com:6443/apis/config.openshift.io/v1/clusteroperators": EOF module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Use the following commands to gather logs from the cluster module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO openshift-install gather bootstrap --help module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): FATAL failed waiting for Kubernetes API: Get "https://api.walt-v0818.brown-chesterfield.com:6443/version?timeout=32s": EOF module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): Cluster bootstraping has failed, exit code was: 1 I went and ran openshift-install gather bootstrap --help, however, this will not work since the bootstrap node has failed to boot. To me this is suggesting there are problems with the rhcos ova file that your a pointing everyone to for 4.6 here: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64.ova
Can we get an update. This is a blocker for any VMware OCP 4.6 setups.
Are there any updates on your vmware support?
Could you provide any journal or console output from the bootstrap node that failed to boot? The existing information is not enough to determine what is going wrong. Additionally, is it possible to drop the `ibmconf` group from this BZ? It makes it harder for the wider RHCOS team to investigate/triage this BZ if they are not part of the group (which is most of the team).
Attached is a picture of the console from the bootnode.
Created attachment 1712602 [details] Bootnode Vmware 4.6 png Bootnode vmware console png
Your bootstrap node is failing at some point in the initramfs, but your console screenshot is not enough to diagnose what's going on. Failure logs and an emergency shell are available on the node console, but the default console parameters need to be tweaked for your environment first, see https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell/. If you tweak the kernel boot arguments via GRUB as described above, you will see the full error logs. A recording or screenshots of those would be helpful. Additionally, once there, you can pop into the emergency shell to interactively check what's going on.
Unfortunately, there is not enough information in https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell/ to give me enough direction on a VMware vCenter environment to gather the full error logs. This is what I have available to me. I am using the version of the ova file mentioned previously. I've got 8 VMs created in my vCenter. The only access I have to the console on the bootstrap VM is via the vCenter option to "Launch Web Console". It is not obvious from your reference to https://docs.fedoraproject.org/en-US/fedora-coreos/emergency-shell how to get to a Grub console on the bootstrap Web Console view I have. Please give better detailed instructions that pertain to using the vCenter Web Console of a VM to collect the information you want.
> The only access I have to the console on the bootstrap VM is via the vCenter option to "Launch Web Console". That's indeed where you can access the Grub console for the bootstrap node. Grub menu shows up there when the node is booting up (right after BIOS/UEFI completed), with instructions on-screen on how to interrupt the boot countdown and adjust the kernel parameters to match the console arrangement of your setup (semantics explained in the above doc-page). > I have a terraform that I've been using since OCP 4.2 to create VMware OCP clusters. As a sidenote, my guess is that some entries in your custom Terraform plan are causing this failure. In particular, OCP 4.6 is going to use the newer Ignition configuration schema (3.x); third-party customization using the old schema (2.x), while previously working up to 4.5, may now cause bootstrapping failures with the same symptoms as shown in this ticket.
Seems I can now get to a grub command prompt now. From the grub command prompt, can you give detailed, step by step, instructions on how to get the information you need to debug. We are using the terraform VMware provider "ignition" which from the documentation (https://www.terraform.io/docs/providers/ignition/index.html#ignition-versions) I'm seeing only supports up to ignition 2.1.0. So this maybe the problem.
I think this change in 4.6 to go to a new version of the Ignition configuration schema without also making sure there are updates in the ignition provider in terraforms is going to hit the Openshift customers base very hard. Is Redhat going to continue to support VMware UPI?
Can you point me to information about the difference between Ignition configuration schema 3.x and 2.x?
> From the grub command prompt, can you give detailed, step by step, instructions on how to get the information you need to debug. The specific details depend on your environment. In general, you can start editing the active entry by pressing 'e' before booting into Linux, then edit the parameters on the line starting with 'linux', and then finally boot it by pressing 'ctrl-x'. In most cases, removing all `console=...` parameters will force kernel auto-detection, which may or may not work for your node too. > Is Redhat going to continue to support VMware UPI? I'm not in a position to answer this, and 4.6 is not yet out with official docs/release-notes. Looking at the openshift/installer pending branch for 4.6, I can still see Terraform artifacts, https://github.com/openshift/installer/tree/release-4.6/upi/vsphere. > Can you point me to information about the difference between Ignition configuration schema 3.x and 2.x? For an overall view, see https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300.
Created attachment 1713009 [details] journalctl failed bootstrap node left side console screen Was able to go into grub and start the edit. Remove all console from linux line. then ctrl-x out. Got to # prompt, did journalctl -n 100 and started paging down came to what is shown in this image. This is the left side of screen. Right side on next image.
Created attachment 1713012 [details] Right side console screen This is right side of console screen.
Seems the journalctl agree with what you suspected. We are using an unsupported version of ignition config. @lucab you agree?
Yes, the errors logged on your console seem to confirm my initial guess. Either the initial config stub or an additional remote config referenced by that are using a 2.x schema version. Adapting all of those to 3.x should fix this, I don't think there is anything specifically broken at RHCOS/vmware level. For reference, the vpshere UPI plan on the 4.6 branch at https://github.com/openshift/installer/tree/release-4.6/upi/vsphere seem to have a few changes for 3.x config, you may want to look into it. Additionally, latest openshift/installer seem to be using the plugin at https://github.com/community-terraform-providers/terraform-provider-ignition to generate 3.x configs from Terraform.
If I have problems building the new ignition provider who do I contact?
How do I add the ignition provider that gets built by the instructions in https://github.com/community-terraform-providers/terraform-provider-ignition into my terraform. Unfortunately, the README for this repo only give instructions on how to build the provider, but not how to get it into a terraform.
I think that development/maintenance of the new provider falls under OCP "Installer" component. Third-party providers can be consumed by Terraform as plugins, see https://www.terraform.io/docs/plugins/basics.html. I'll close this ticket here as the RHCOS side seems fine.
I have successfully built the a new terraform ignition file plug giving the information previously referenced to use here that was suppose to support 3.x : https://github.com/community-terraform-providers/terraform-provider-ignition. I have successfully used the new plugin in my terraform init procedure by coping the /root/go/bin/terraform-provider-ignition binary into my terraform home directory /git0901/tf_openshift_4/vsphere-upi/.terraform/plugins/linux_amd64 and renaming to terraform-provider-ignition_v2.1.1_x4. I have succesfully modified my terraform code following the directions referenced above here https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300. 1) Made any data.ignition_x.id references to data.ignition_x.rendered, 2) changed and append to merge, 3) Removed and filesystem references in ignition_file usage. Doing all these things allowed the terraform apply to run cleanly with out error. However, the bootstrap node is still not booting. Analysis is showing that same issue with ignition versioning. At this point still think we have an RHCOS issue.
> However, the bootstrap node is still not booting. Analysis is showing that same issue with ignition versioning. At this point still think we have an RHCOS issue. That's unlikely (but not impossible) as the vpshere UPI job in CI is exercising that path. Anyway, you can confirm that by grabbing the bootstrap configuration (plus all the references) and checking their versions. Additionally, please note that the OVA you referenced in your top-comment does not exist anymore; at this point you are likely using an outdated 4.6-development version of the installer too.
(In reply to krapohl from comment #20) > However, the bootstrap node is still not booting. Analysis is showing that > same issue with ignition versioning. At this point still think we have an > RHCOS issue. From what I'm hearing it sounds like you're saying the wrong ignition version content is flowing and, thus, the OCP 4.6 boot image starting on the bootstrap fails. Is that accurate? Looking in the boot images I see the referenced ignition version in 4.6.0-0.nightly-2020-07-16-122837 as ignition-0.35.1-14.rhaos4.6.gitb4d18ad.el8.x86_64. Checking /var/run/ignition.json I see it's using spec 2 still. Switching to the current nightly of 4.6.0-0.nightly-2020-08-26-093617 (which is also "latest") I see the ignition at ignition-2.6.0-3.rhaos4.6.git947598e.el8.x86_64 which and I see it booting using a spec 3 ignition. Have you been able to try the updated nightly?
I've re-run my terraform making the following changes only. I'm still using my new ignition v2.1.1 provider that is suppose to support 3.x built using the information provided here: https://github.com/community-terraform-providers/terraform-provider-ignition 1) I moved from using https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-08-25-182234/ to using https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-09-02-131630/ 2) I moved from using rhcos-4.6.0-0.nightly-2020-07-16-122837-x86_64-vmware.x86_64 to using https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64.ova So some progress, with only these changes the boot node is now coming up and master nodes are coming up, however, worker nodes are not coming up. [root@walt-v0902-keg ~]# oc get no NAME STATUS ROLES AGE VERSION master-0.cluster.internal Ready master 62m v1.19.0-rc.2+b5dc585-dirty master-1.cluster.internal Ready master 62m v1.19.0-rc.2+b5dc585-dirty master-2.cluster.internal Ready master 62m v1.19.0-rc.2+b5dc585-dirty I'm attaching the must-gather now for the cluster.
Created attachment 1713500 [details] Must-gather for VMware 4.6 install This is the must-gather from using https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/4.6.0-0.nightly-2020-09-02-131630/ and rhcos https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64.ova boot, master nodes booting, however, workers are not coming up. Two parts to the must-gather since it exceeds max size of 19.6M part 1 -> must-gather-aa part 2 -> must-gather-ab to put them back together cat must-gather-a* > must-gather.tar.gz
Created attachment 1713502 [details] Second part must-gather file Second part must-gather
I did not see spot anything relevant in your must-gather, but at the same time I didn't find any traces of worked nodes activities (but I may have missed some). My feeling is that case workers are not completing the boot, thus not joining the cluster; if so, you can check their status on the consoles (as you previously did with the bootstrap node).
We have two cluster scenarios that we use on VMware. One uses vlan networking where the cluster nodes are on private IP and the other uses all public (9 dot) IPs. Initially, and relative to the problems in this case previously discussed I was building my cluster using a VMware vlan for the cluster and with only two public IPs for the apps and api. The cluster nodes are on private IPs including the bootnode. So on my last try at building a vlan cluster, the bootnode is coming up now. And on my last two vlan clusters build attemps all workers came up too. Not sure what the problems was on my first attempt with the workers. The other cluster build we use is with all public (9 dot addresses) IPs on all the cluster nodes. This is our primary way of building clusters. So using public IPs I'm not getting the bootnode on a public IP coming up now. So I've broken into the bootnode and it looked at the journalctl. I'm including some image shots of what I'm seeing in the journalctl now on a bootnode on a public IP.
Created attachment 1713660 [details] PublicIPbootnode1 Journalctl from bootnode public IP page 1. Following images are one page down from this one.
Created attachment 1713661 [details] PublicIPbootnode2 Page2
Created attachment 1713662 [details] PublicIPbootnode3 page3
Created attachment 1713663 [details] PublicIPbootnode4 page4
Created attachment 1713664 [details] PublicIPbootnode5 page5
New update, I've gotten past the bootnode issue with public IPs. Had to add an 'overwrite = true' to all my ignition_file definitions, saw in the doc here https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from-version-230-to-300, the new default is `overwrite = false'.
(In reply to krapohl from comment #33) > New update, I've gotten past the bootnode issue with public IPs. Had to add > an 'overwrite = true' to all my ignition_file definitions, saw in the doc > here > https://github.com/coreos/ignition/blob/v2.6.0/doc/migrating-configs.md#from- > version-230-to-300, the new default is `overwrite = false'. Does this mean you have successfully installed a cluster? I'm not certain what, if any, errors remain.
I'm successfully installing latest-4.6 clusters now, with public IPs. Still having troubles with vlan, but public IPs is our main method. I think we can close, I don't have time for vlan now.