Bug 1887744 - waitForInitializedCluster timeout reached very often when installing on RHOS PSI
Summary: waitForInitializedCluster timeout reached very often when installing on RHOS PSI
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.6.0
Assignee: aos-install
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On: 1785399 1881868 1913411
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-13 09:16 UTC by Inbar Rose
Modified: 2021-05-05 12:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-15 18:11:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Inbar Rose 2020-10-13 09:16:08 UTC
This is an issue that the CNV (openshift virtualization) team is facing very often. 

We are reaching timeouts when installing openshift on our PSI RHOS-IPI.

the function waitForInitializedCluster defines a timeout which is not always enough for us.

we suspect it happens when there is a lot of load on PSI environment and so we see it is a flaky install issue which occasionally fails our deployments.

we would like to be able to extend the timeout. It would be ideal if we could specify a custom value, or there could be some mechanism to increase the timeout if the install detects it is running in openstack.

This is a minor nuisance but it disrupts our testing process to the point where it causes delays up to 48 hours sometimes in verification.

Comment 2 Pierre Prinetti 2020-10-14 14:17:33 UTC
Unfortunately, the timeout is installer-wide and can't be changed or tuned for a specific installation process. Please refer to the Installer team for guidance.

We've been advised to introduce automation by wrapping the call to Install in a retry script, that would call `$installer wait-for install-complete` N times after the installation failed. Where N is a reasonable number for your use-case (represents 40 more minutes).

Potentially linked: there is one current issue affecting PSI that is precisely making it hit timeouts: https://bugzilla.redhat.com/show_bug.cgi?id=1881868

Comment 3 Inbar Rose 2020-10-15 05:18:25 UTC
(In reply to Pierre Prinetti from comment #2)
> Unfortunately, the timeout is installer-wide and can't be changed or tuned
> for a specific installation process. Please refer to the Installer team for
> guidance.
> 
> We've been advised to introduce automation by wrapping the call to Install
> in a retry script, that would call `$installer wait-for install-complete` N
> times after the installation failed. Where N is a reasonable number for your
> use-case (represents 40 more minutes).
> 
> Potentially linked: there is one current issue affecting PSI that is
> precisely making it hit timeouts:
> https://bugzilla.redhat.com/show_bug.cgi?id=1881868

I see there is a workaround for baremetal that increases the timeout.

https://github.com/openshift/installer/blob/release-4.6/cmd/openshift-install/create.go#L354

// Wait longer for baremetal, due to length of time it takes to boot
	if assetStore, err := assetstore.NewStore(rootOpts.dir); err == nil {
		if installConfig, err := assetStore.Load(&installconfig.InstallConfig{}); err == nil && installConfig != nil {
			if installConfig.(*installconfig.InstallConfig).Config.Platform.Name() == baremetal.Name {
				timeout = 60 * time.Minute
			}
		}
	}

If you don't want to add *additional* special cases I can understand how that could become messy. But I think it should be very reasonable that a timeout value should be able to be adjusted by some kind of method. For instance an Environment Variable. It at least warrants consideration.

Comment 4 Dan Kenigsberg 2020-10-15 05:41:15 UTC
Inbar, the case of the installer team against timeout as a parameter is quite interesting and strong. They say that if your infrastructure is so weak that you cannot deploy OCP on time, you should not really use it is in production. Or at the very least, you should know that you have something wrong, fix it, and intentionally retry installation.

Comment 5 Inbar Rose 2020-10-15 05:46:18 UTC
(In reply to Dan Kenigsberg from comment #4)
> Inbar, the case of the installer team against timeout as a parameter is
> quite interesting and strong. They say that if your infrastructure is so
> weak that you cannot deploy OCP on time, you should not really use it is in
> production. Or at the very least, you should know that you have something
> wrong, fix it, and intentionally retry installation.

We have no control over this infrastructure, it is the PSI RHOS-IPI. All we can do is trigger the installation.

Comment 6 Dan Kenigsberg 2020-10-15 06:03:48 UTC
(In reply to Inbar Rose from comment #5)
> 
> We have no control over this infrastructure, it is the PSI RHOS-IPI. All we
> can do is trigger the installation.

This does not weaken their argument. You now know that there is a regression in PSI RHOS-IPI. You can wait for the bug to be solved, or consciously choose to ignore it. What you cannot do is file a generic customer case saying "I don't know why my cluster is broken" because you have a clue. It's a "shift-left" attitude.

Comment 7 Inbar Rose 2020-10-15 06:17:09 UTC
(In reply to Dan Kenigsberg from comment #6)
> (In reply to Inbar Rose from comment #5)
> > 
> > We have no control over this infrastructure, it is the PSI RHOS-IPI. All we
> > can do is trigger the installation.
> 
> This does not weaken their argument. You now know that there is a regression
> in PSI RHOS-IPI. You can wait for the bug to be solved, or consciously
> choose to ignore it. What you cannot do is file a generic customer case
> saying "I don't know why my cluster is broken" because you have a clue. It's
> a "shift-left" attitude.

We know that increasing the timeout will work because we used to have a patch that increased the timeout.

The default timeout used to be 30 and our patch modified it to 60. Now they changed their default timeout to 40 and our patch no longer works.
instead of making a new patch which might again stop working, I am trying to see if the problem can be solved more elegantly.
for instance, if they already have special consideration for baremetal (increasing timeout to 60) why not also for RHOS-IPI?
Or, better yet, allow the timeout to be set when running installer openshift-install --timeout=60 for instance.

Wrapping the install process in retries is not a good solution. If the issue is that the timeout is too short than it will simply take longer (until retries are exhausted) for the deployment to fail. And if problem is flaky infrastructure then it at least solves the instability of our deployments.

a timeout should not be a limiting factor. it should exist to catch stuck processes, in this case we reach 85% completion and are only a few minutes away from a successful deployment. It is a real shame that we can not just extend the timeout a little but so that we can prevent this bottleneck in our pipeline.

Comment 8 Pierre Prinetti 2020-10-15 08:35:48 UTC
(In reply to Inbar Rose from comment #3)

> I think it should be very reasonable that a
> timeout value should be able to be adjusted by some kind of method. For
> instance an Environment Variable. It at least warrants consideration.

We (the OpenShift-on-OpenStack team) are not the decision makers in this particular instance. As much as I personally understand your arguments, I can't do anything about them.

If you want to raise awareness around the issue (or get further guidance), please refer to the Installer team.

Comment 9 Inbar Rose 2020-10-15 11:22:18 UTC
I have re-assigned this to the openshift-installer sub-component 
and re-opened this ticket.

Comment 10 Scott Dodson 2020-10-15 18:11:22 UTC
Right, if your cluster cannot complete the installation in the allotted time then it's not likely that the cluster will deliver the level of service necessary and that needs to be acknowledged. This doesn't prevent the installation from completing, all that's happening is that the installer is waiting for completion. You can wait additional time by running `openshift-install wait-for install-complete`


Note You need to log in before you can comment on or make changes to this bug.