Bug 1999594

Summary:

IPI deployment fails when master POST time differ

Product:

OpenShift Container Platform

Reporter:

Lubov <lshilin>

Component:

Installer

Assignee:

Beth White <beth.white>

Installer sub component:

OpenShift on Bare Metal IPI

QA Contact:

Amit Ugol <augol>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

high

CC:

aguclu, bfournie, ccrum, derekh, tsedovic

Version:

4.9

Target Milestone:

---

Target Release:

4.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-10-18 10:42:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
openshift_install.log	none

Description Lubov 2021-08-31 11:47:23 UTC

Created attachment 1819343 [details]
openshift_install.log

Created attachment 1819343 [details]
openshift_install.log

Description of problem:
When trying to deploy IPI OCP on ibm cloud hitting the failure due one of master is slow. The problem began when CoreOS changed to newer version


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-30-070917
4.9.0-0.nightly-2021-08-30-192239

How reproducible:
100%

Steps to Reproduce:
Follow https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html with changes mentioned https://docs.google.com/document/d/1tlOYAvdju_iTjF9dLUl2ZfR4LiR5tBPjQa3V-_1TMN8/edit to deploy IPI OCP on IBM cloud

Actual results:
Failure 
ERROR Bootstrap failed to complete: timed out waiting for the condition 
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. 
FATAL Bootstrap failed to complete 

Expected results:
deploy should succeed

Additional info:
adding installation log and screenshot of console
bootstrap logs http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/log-bundle-20210830214738.tar.gz

Comment 2 Derek Higgins 2021-08-31 15:08:44 UTC

I don't think this is specific to ibmcloud, The problem appears to happen when there
is a mismatch on POST times on servers (not likely to happen in virt environments)

I believe that the api server on the bootstrap node is shutting down before the slowest server
manages to get its ignition data as one of other master servers (with a quicker POST time) has
signalled it is ready.

I've been able to reproduce this on virt by adding a 5 minute delay on one of the master VM's,

This wasn't happening last week before the rhcos image version was bumped, I assume the longer time
to pivot gave the slower master enough time to get it ignition data.

Comment 3 Derek Higgins 2021-09-03 10:14:17 UTC

Moving over to installer where somebody familiar with installer can investigate.

Comment 4 Dmitry Tantsur 2021-10-15 16:06:00 UTC

Removing Triaged to mark for re-triaging by the team.

Comment 5 Arda Guclu 2021-10-18 06:58:11 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1998643 urgent bug and this bug was opened nearly in same times(after the CoreOS bump). Although the root causes might be different(both of them were suffered from bootstrap apiserver unhealthiness), latter one was fixed. 

Could you please retry again to assure that this bug is still relevant or not?

Comment 6 Lubov 2021-10-18 07:41:31 UTC

@derekh do we still have the slow machine in out setup?

Comment 7 Derek Higgins 2021-10-18 07:54:46 UTC

(In reply to Lubov from comment #6)
> @derekh do we still have the slow machine in out setup?

We don't but if I remember correctly at the time we did come to the conclusion that this
problem happened as a result of the RHCOS version bump. So I'm happy to close this as a DUP.

Alternatively if you want to try a reproduce in virt you could pause a master for 5+ minutes
after it gets provisioned and rebooted while still in POST. Then unpause and see if everything
comes up.

Comment 8 Lubov 2021-10-18 10:26:11 UTC

> (In reply to Lubov from comment #6)
> > @derekh do we still have the slow machine in out setup?
> 
> We don't but if I remember correctly at the time we did come to the
> conclusion that this
> problem happened as a result of the RHCOS version bump. So I'm happy to
> close this as a DUP.

I'm happy with you closing the bz :)(In reply to Derek Higgins from comment #7)

Comment 9 Arda Guclu 2021-10-18 10:42:52 UTC

I'm closing this bug as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1998643. Since parent bug is fixed, this bug should also be fixed.

*** This bug has been marked as a duplicate of bug 1998643 ***