Description of problem: Doing a Vmware UPI install, when running command openshift-install wait-for bootstrap-complete It is not waiting for the masters and bootstrapping to complete before exiting. With the new change for "Starting Ignition (fetch-offline)" The boot of the bootstrap and master nodes is taking much longer and the openshift-install wait-for bootstrap-complete is not waiting and exits with rc=1. Sometime it takes very long for the masters to complete the step "Starting Ignition (fetch-offline)" and it is very apparent the command has not been updated to take this into account, and we get bootstrapping failures very often. Either give the server that is providing the Ignition greater bandwidth so it does not take so long for nodes to boot themselves or fix the "openshift-install wait-for bootstrap-complete" so it pings each node before exiting. Version-Release number of selected component (if applicable): 4.6 How reproducible: Very often, not every time. Steps to Reproduce: 1. VMware UPI install 3 masters and 4 workers, when automation it runs "openshift-install wait-for bootstrap-complete" exiting with rc=1 2. Using RHCOS version rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-vmware.x86_64.ova, and latest-4.6 (4.6.0-0.nightly-2020-09-24-111253). 3. Actual results: Expected results: Additional info:
Can you provide the full logs from the bootstrap and master nodes? The fetch-offline stage shouldn't normally take long.
Here is some of the information dumped when it exits. module.ocp4_bootstrap.null_resource.do_action[0]: Still creating... [42m0s elapsed] module.ocp4_bootstrap.null_resource.do_action[0]: Still creating... [42m10s elapsed] module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServerConfigObservation_Error::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError::RouterCerts_NoRouterCertSecret: OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.111.37:443/healthz": dial tcp 172.30.111.37:443: connect: connection refused module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): RouterCertsDegraded: secret/v4-0-config-system-router-certs -n openshift-authentication: could not be retrieved: secret "v4-0-config-system-router-certs" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Progressing is Unknown with NoData: module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes: ReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods). module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "https://172.30.111.37:443/healthz": dial tcp 172.30.111.37:443: connect: connection refused module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator csi-snapshot-controller Progressing is True with _AsExpected: Progressing: Waiting for Deployment to deploy csi-snapshot-controller pods module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator csi-snapshot-controller Available is False with _AsExpected: Available: Waiting for Deployment to deploy csi-snapshot-controller pods module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Cluster operator dns Degraded is True with DNSDegraded: DNS default is degraded module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Cluster operator etcd Degraded is True with StaticPods_Error: StaticPodsDegraded: pods "etcd-ip-9-37-37-147.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): StaticPodsDegraded: pods "etcd-ip-9-37-37-148.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2 module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Cluster operator kube-apiserver Degraded is True with StaticPods_Error: StaticPodsDegraded: pods "kube-apiserver-ip-9-37-37-147.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): StaticPodsDegraded: pods "kube-apiserver-ip-9-37-37-146.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): StaticPodsDegraded: pods "kube-apiserver-ip-9-37-37-148.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4 module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4 module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): ERROR Cluster operator kube-controller-manager Degraded is True with NodeInstaller_InstallerPodFailed::StaticPods_Error: StaticPodsDegraded: pods "kube-controller-manager-ip-9-37-37-147.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): StaticPodsDegraded: pods "kube-controller-manager-ip-9-37-37-148.cluster.internal" not found module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): NodeInstallerDegraded: 1 nodes are failing on revision 4: module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): NodeInstallerDegraded: static pod of revision 4 has been installed, but is not ready while new revision 5 is pending module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 5 module.ocp4_bootstrap.null_resource.do_action[0] (remote-exec): INFO Cluster operator kube-controller-manager Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 5
Created attachment 1716410 [details] Boot node log Log from runnin ./openshift-install gather bootstrap --bootstrap 9.37.37.168 --master 9.37.37.146
Created attachment 1716415 [details] Additional bootstrap logs with debug First split file for bootstrap log from running command: ./openshift-install gather bootstrap --bootstrap 9.37.37.168 --master 9.37.37.146 --log-level debug --key /root/.ssh/id_rsa
Created attachment 1716416 [details] Part two of log bundle Second part of log bundle from command ./openshift-install gather bootstrap --bootstrap 9.37.37.168 --master 9.37.37.146 --log-level debug --key /root/.ssh/id_rsa
Created attachment 1716427 [details] Part three of log-bundle [root@walt-0924a-keg bin]# ./openshift-install gather bootstrap --bootstrap 9.37.37.168 --master 9.37.37.146 --log-level debug --key /root/.ssh/id_rsa DEBUG OpenShift Installer 4.6.0-0.nightly-2020-09-24-111253 DEBUG Built from commit c5352b951b83a15db68530973b5b0a36634fdd07 DEBUG Fetching Bootstrap SSH Key Pair... DEBUG Loading Bootstrap SSH Key Pair... DEBUG Using Bootstrap SSH Key Pair loaded from state file DEBUG Reusing previously-fetched Bootstrap SSH Key Pair INFO Pulling debug logs from the bootstrap machine DEBUG Added /root/.ssh/id_rsa to installer's internal agent DEBUG Added /tmp/bootstrap-ssh471273692 to installer's internal agent DEBUG Gathering bootstrap systemd summary ... DEBUG Gathering bootstrap failed systemd unit status ... DEBUG Gathering bootstrap journals ... DEBUG Gathering bootstrap containers ... DEBUG Gathering rendered assets... DEBUG rm: cannot remove '/tmp/artifacts-20200924152649/rendered-assets/openshift/etcd-bootstrap/bootstrap-manifests/secrets': Is a directory DEBUG Gathering cluster resources ... DEBUG Waiting for logs ... DEBUG Gather remote logs DEBUG Collecting info from 9.37.37.146 DEBUG scp: ./installer-masters-gather.sh: Permission denied EBUG Warning: Permanently added '9.37.37.146' (ECDSA) to the list of known hosts. DEBUG Gathering master systemd summary ... DEBUG Gathering master failed systemd unit status ... DEBUG Gathering master journals ... DEBUG Gathering master containers ... DEBUG Waiting for logs ... DEBUG Log bundle written to /var/home/core/log-bundle-20200924152649.tar.gz To put parts back together: cat log-bundle* > log-bundle.tar.gz
Hmm, looks like `openshift-install gather` doesn't gather the full journal logs, just some select services. Full journal logs (or even just serial console logs) would be useful. > Sometime it takes very long for the masters to complete the step > "Starting Ignition (fetch-offline)" and it is very apparent the command has not been updated to take this into account, and we get bootstrapping failures very often. How are you monitoring this? Are you looking at the VM console output scroll by and seeing it hang at the fetch-offline stage for a while? Can you also paste the Ignition configs you're passing to VMware?
Created attachment 1716466 [details] Console Image From vCenter I view the Console output from the vCenter. We use a modified Terraform Ignition Provider terraform-provider-ignition_v2.1.1_x4 See bugzilla case https://bugzilla.redhat.com/show_bug.cgi?id=1870151 where they helped us work through how to build this provider.
Created attachment 1716467 [details] Terraform ignition information passed to ignition provider This it the terraform file used to create the information passed to the igntion terraform provider.
Hmm, from that screenshot it looks like the machine isn't getting enough entropy. We nowadays ship with random.trust_cpu=on, though it's not much use if the CPU doesn't support RDRAND. Can you verify that the (v)CPU you're exposing supports it? There's also the jitter entropy patches which should be coming in 8.3: https://bugzilla.redhat.com/show_bug.cgi?id=1778762#c23. There are a bunch of other RHBZs related to early-boot entropy starvation, so once we confirm that's the issue, I'll likely close this one as a dupe.
Well, talking it over with our VMware support personal and running the following tests against a few VMs on our various cat /proc/cpuinfo | grep rdrand Shows no RDRAND support on the CPUs. Are we testing for RDRAND correctly? So this must of been a change since the 8/26 RHCOS. This seems like a bad assumption and a big problem if your now assuming that RDRAND needs to be supported on all ESXI servers that OCP 4.6 gets installed onto now. I can't believe this, am I miss-understanding?
> This seems like a bad assumption and a big problem if your now assuming that RDRAND needs to be supported on all ESXI servers that OCP 4.6 gets installed onto now. > > I can't believe this, am I miss-understanding? No we don't assume RDRAND, but we do use it if it's available. If it isn't and there isn't an RNG device attached to the VM like virtio-rng, then it's possible for the node to not have enough entropy during early-boot. > So this must of been a change since the 8/26 RHCOS. Do you have the exact RHCOS/OCP version you're referring to here?
When I was working through this case https://bugzilla.redhat.com/show_bug.cgi?id=1870151 to get to a version of the ignition provider that would work for us, when this RHCOS version came out rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64 we finally got something working and it had been working on a regular basis for us. Now with the introduction of the latest RHCOS rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-vmware.x86_64.ova, we get through the bootstrapping 1 in about 10 tries.
> when this RHCOS version came out rhcos-4.6.0-0.nightly-2020-08-26-093617-x86_64-vmware.x86_64 we finally got something working and it had been working on a regular basis for us. The payload for that version has been GC'ed, but based on that date and looking at the bootimage history, the RHCOS that was used then was 46.82.202008181646-0 which shipped the 8.3 kernel which had the jitter patches. That would explain why it worked then. But this was a transitional period before we pinned the kernel to the 8.2 one. I'll see whether there are any plans to backport those patches to 8.2 as well. Two questions: 1. Is there no way in VMware to attach a virtualized RNG device similar to virtio-rng? (Or alternatively, expose a more modern vCPU which supports RDRAND, or does the hypervisor host itself not support it?) 2. Circling back, even if the `wait-for` eventually times out, does the installation eventually succeed?
(In reply to Jonathan Lebon from comment #14) > Two questions: > 1. Is there no way in VMware to attach a virtualized RNG device similar to > virtio-rng? (Or alternatively, expose a more modern vCPU which supports > RDRAND, or does the hypervisor host itself not support it?) Not sure I understand your proposal with virtio-rng? How would this be injected in the bootstrpping process, and why is RH making this change if it is breaking everyones VMware installation. Our ESXI servers do not support RDRAND. We have about 30 of them, it would be a huge cost to replace them all to support OCP 4.6 on our VMware. Don't see this as a customer solution. > 2. Circling back, even if the `wait-for` eventually times out, does the > installation eventually succeed? The Masters do eventually fall out of waiting for random data I think, it maybe happening because I did something on the console too, that created the event it needed for entropy. I'd have to look at this more. However, in the end the OCP installation is a fail.
Can you provide more information about your setup? What version of ESXi? What hardware is it running on? What VM hardware version are you using?
Created attachment 1717317 [details] Esxi Server Info We have 27 ESXI servers all have the same config as displayed in the attached image of the Summary tab from our vCenter. They are connect via SAN to 10 SAN LUNs. See next attached info on SAN LUNs.
Created attachment 1717318 [details] Esxi SAN configuration All our ESXI are attach to common SAN that has 10 LUNs.
Additional comments from VMware support group about the CPUs in VMware env "Near as I can tell all your CPUs are Sandy Bridge, predecessor of Ivy Bridge when Intel introduced RDRAND."
Any updates on your fix for this problem?
> So this must of been a change since the 8/26 RHCOS. This seems like a bad assumption and a big problem if your now assuming that RDRAND needs to be supported on all ESXI servers that OCP 4.6 gets installed onto now. As Jonathan said nothing is explicitly requiring RDRAND or newer harder, but it does seem like something may have started trying to gather entropy earlier. The RHEL 8.3 fix is really ideal for this...but I wouldn't be really comfortable trying to introduce that in an 8.2.z update.
Do we have a fix yet for this issue. All VMware development is completely stopped right now!
Could you test another UPI install using this OVA: https://s3.amazonaws.com/rhcos-jlebon/rhcos-46.82.202009301446-0-vmware.x86_64.ova ? I think I've narrowed down where we're requesting entropy in fetch-offline but would like to double-check if even with that addressed the VM doesn't just block again for a while on entropy in the rest of the boot process (e.g. if it has to fetch things over TLS or when generating host keys).
Making a run now.
Created attachment 1718043 [details] GoodOva46 for Sept 30th Good news this ova worked fine, the master and worker node did not get stuck like rhcos-4.6.0-0.nightly-2020-09-10-195619-x86_64-vmware.x86_64.ova version of ova. My install completed, see image.
One thought too, we started seeing the issues with are bare metal per case https://bugzilla.redhat.com/show_bug.cgi?id=1879510 at the same time we started to have problems with VMware. Wondering if they maybe the same issue. If you could provide the rhcos-live-installer-kernal, rhcos-live-installer-intramfs and rhcos-metal from same build as rhcos-46.82.202009301446-0-vmware.x86_64.ova, I might be able to try an install.
The culprit turned out to be a golang `init()` function in one of our vendored projects in Ignition. I've submitted a patch upstream to alleviate the problem. From https://bugzilla.redhat.com/show_bug.cgi?id=1882515#c26 and local testing, it seems like that should be enough to unblock the VMware on old hardware case, but clearly we're barely passing by here. Things like host key generation might still delay the boot a bit, but it's at least bearable enough for the install to succeed. So ideally, we do get the jitter entropy patches backported, but this fix should be able to hold us off until 8.3 if that's not possible.
Any projections on when we might see something in a regular https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest-4.6/ build?
Thank you to Jonathan for the excellent triage and analysis! Since we have identified this as an issue which could affect multiple RHCOS installs, we need to address it as part of 4.6 Thankfully, Jonathan has a patch proposed upstream that we can include as part of Ignition in 4.6 https://github.com/census-instrumentation/opencensus-go/pull/1228
(In reply to krapohl from comment #30) > Any projections on when we might see something in a regular > https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/ > latest-4.6/ build? It will likely take 2-3 days
The new version of Ignition (2.6.0-5.rhaos4.6.git947598e.el8) landed as part of RHCOS 46.82.202010011740-0; if you have access to the RHCOS builds at https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?stream=releases/rhcos-4.6&release=46.82.202010011740-0, you can try out that build again while waiting for the `openshift-install` PR to merge.
We don't have access, we'll be looking for it to show up on https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/ before we validate.
Dropping dep on https://bugzilla.redhat.com/show_bug.cgi?id=1778762 because it makes Prow unhappy: https://github.com/openshift/installer/pull/4234#issuecomment-703637259
Can we get a ideal of when the fix is going to drop to https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release, first eta has passed.
We are waiting for https://github.com/openshift/installer/pull/4234 to merge; the CI system has been giving us headaches recently, so I don't have an accurate timetable to give. Hopefully it merges today and we'll get artifacts synced to the mirror. But we are at the mercy of our robust CI system and can't force the process.
Dropping `UpcomingSprint` because it makes queries unhappy
> We don't have access Everyone in the world does, see https://github.com/openshift/installer/blob/release-4.6/data/data/rhcos.json Notably the BaseURI is public: https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.6/46.82.202010011740-0/x86_64
Created attachment 1719852 [details] Acess Denied Getting AcessDenied to that location
I see now you meant https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.6/46.82.202010011740-0/x86_64/rhcos-46.82.202010011740-0-vmware.x86_64.ova
Using this rhcos is working for us: https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.6/46.82.202010011740-0/x86_64/rhcos-46.82.202010011740-0-vmware.x86_64.ova
Marking verified based on comment #43
When should we expect to see the image placed in https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/latest-4.6/?
(In reply to Ronda Marshall from comment #45) > When should we expect to see the image placed in > https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre- > release/latest-4.6/? The mirror was updated today with the latest RHCOS artifacts.
Thanks so much. We have taken that RHCOS and added to the vCenter and successfully deploy OCP 4.6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196