Created attachment 1819343 [details] openshift_install.log Created attachment 1819343 [details] openshift_install.log Description of problem: When trying to deploy IPI OCP on ibm cloud hitting the failure due one of master is slow. The problem began when CoreOS changed to newer version Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-30-070917 4.9.0-0.nightly-2021-08-30-192239 How reproducible: 100% Steps to Reproduce: Follow https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html with changes mentioned https://docs.google.com/document/d/1tlOYAvdju_iTjF9dLUl2ZfR4LiR5tBPjQa3V-_1TMN8/edit to deploy IPI OCP on IBM cloud Actual results: Failure ERROR Bootstrap failed to complete: timed out waiting for the condition ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. FATAL Bootstrap failed to complete Expected results: deploy should succeed Additional info: adding installation log and screenshot of console bootstrap logs http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/log-bundle-20210830214738.tar.gz
I don't think this is specific to ibmcloud, The problem appears to happen when there is a mismatch on POST times on servers (not likely to happen in virt environments) I believe that the api server on the bootstrap node is shutting down before the slowest server manages to get its ignition data as one of other master servers (with a quicker POST time) has signalled it is ready. I've been able to reproduce this on virt by adding a 5 minute delay on one of the master VM's, This wasn't happening last week before the rhcos image version was bumped, I assume the longer time to pivot gave the slower master enough time to get it ignition data.
Moving over to installer where somebody familiar with installer can investigate.
Removing Triaged to mark for re-triaging by the team.
https://bugzilla.redhat.com/show_bug.cgi?id=1998643 urgent bug and this bug was opened nearly in same times(after the CoreOS bump). Although the root causes might be different(both of them were suffered from bootstrap apiserver unhealthiness), latter one was fixed. Could you please retry again to assure that this bug is still relevant or not?
@derekh do we still have the slow machine in out setup?
(In reply to Lubov from comment #6) > @derekh do we still have the slow machine in out setup? We don't but if I remember correctly at the time we did come to the conclusion that this problem happened as a result of the RHCOS version bump. So I'm happy to close this as a DUP. Alternatively if you want to try a reproduce in virt you could pause a master for 5+ minutes after it gets provisioned and rebooted while still in POST. Then unpause and see if everything comes up.
> (In reply to Lubov from comment #6) > > @derekh do we still have the slow machine in out setup? > > We don't but if I remember correctly at the time we did come to the > conclusion that this > problem happened as a result of the RHCOS version bump. So I'm happy to > close this as a DUP. I'm happy with you closing the bz :)(In reply to Derek Higgins from comment #7)
I'm closing this bug as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1998643. Since parent bug is fixed, this bug should also be fixed. *** This bug has been marked as a duplicate of bug 1998643 ***