2039965 – [IBMCLOUD] Poor network performance (mostly detected in NA-based regions) cause "wait-for install-complete" to fail but installation succeeds on its own after some minutes

Bug 2039965 - [IBMCLOUD] Poor network performance (mostly detected in NA-based regions) cause "wait-for install-complete" to fail but installation succeeds on its own after some minutes

Summary: [IBMCLOUD] Poor network performance (mostly detected in NA-based regions) cau...

Keywords:
Status:	CLOSED DUPLICATE of bug 2045916
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	aos-install
QA Contact:	Pedro Amoedo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-12 19:35 UTC by Pedro Amoedo
Modified:	2022-01-28 10:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-28 10:23:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pedro Amoedo 2022-01-12 19:35:31 UTC

Version:

4.10.0-fc.0

Platform:

IPI IBMCloud

What happened?

- Installation fails due to poor network performance, Ingress is degraded and not able to progress in time:

~~~
01-12 13:34:27.914  level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
01-12 13:34:27.915  level=info msg=Cluster operator insights Disabled is False with AsExpected: 
01-12 13:34:27.915  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
01-12 13:34:27.915  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
01-12 13:34:27.915  level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
01-12 13:34:27.915  level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
01-12 13:34:27.915  level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
01-12 13:34:27.915  level=fatal msg=failed to initialize the cluster: Some cluster operators are still updating: authentication, console
~~~

However, after few minutes, the cluster finish the installation on its own:

~~~
$ oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-fc.0   True        False         False      6h28m   
baremetal                                  4.10.0-fc.0   True        False         False      7h38m   
cloud-controller-manager                   4.10.0-fc.0   True        False         False      7h46m   
cloud-credential                           4.10.0-fc.0   True        False         False      7h38m   
cluster-autoscaler                         4.10.0-fc.0   True        False         False      7h38m   
config-operator                            4.10.0-fc.0   True        False         False      7h39m   
console                                    4.10.0-fc.0   True        False         False      6h27m   
csi-snapshot-controller                    4.10.0-fc.0   True        False         False      7h39m   
dns                                        4.10.0-fc.0   True        False         False      7h38m   
etcd                                       4.10.0-fc.0   True        False         False      7h37m   
image-registry                             4.10.0-fc.0   True        False         False      7h28m   
ingress                                    4.10.0-fc.0   True        False         False      7h27m   
insights                                   4.10.0-fc.0   True        False         False      7h27m   
kube-apiserver                             4.10.0-fc.0   True        False         False      7h29m   
kube-controller-manager                    4.10.0-fc.0   True        False         False      7h36m   
kube-scheduler                             4.10.0-fc.0   True        False         False      7h36m   
kube-storage-version-migrator              4.10.0-fc.0   True        False         False      7h39m   
machine-api                                4.10.0-fc.0   True        False         False      7h35m   
machine-approver                           4.10.0-fc.0   True        False         False      7h39m   
machine-config                             4.10.0-fc.0   True        False         False      7h37m   
marketplace                                4.10.0-fc.0   True        False         False      7h38m   
monitoring                                 4.10.0-fc.0   True        False         False      7h26m   
network                                    4.10.0-fc.0   True        False         False      7h40m   
node-tuning                                4.10.0-fc.0   True        False         False      7h38m   
openshift-apiserver                        4.10.0-fc.0   True        False         False      7h29m   
openshift-controller-manager               4.10.0-fc.0   True        False         False      7h37m   
openshift-samples                          4.10.0-fc.0   True        False         False      7h26m   
operator-lifecycle-manager                 4.10.0-fc.0   True        False         False      7h39m   
operator-lifecycle-manager-catalog         4.10.0-fc.0   True        False         False      7h39m   
operator-lifecycle-manager-packageserver   4.10.0-fc.0   True        False         False      7h31m   
service-ca                                 4.10.0-fc.0   True        False         False      7h39m   
storage                                    4.10.0-fc.0   True        False         False      7h25m   

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-8e04ca308a07babafc3c40ef9f5c59d3   True      False      False      3              3                   3                     0                      7h44m
worker   rendered-worker-8bc271c2c622ad62fd00fb5db6169a47   True      False      False      3              3                   3                     0                      7h44m

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-fc.0   True        False         6h30m   Cluster version is 4.10.0-fc.0
~~~

What did you expect to happen?

- Successful installation

How to reproduce it (as minimally and precisely as possible)?

- Deploy IPI IBMCloud on US-based supported regions (ca-tor, us-east, us-south)

Anything else we need to know?

- Could be partially related with BZ#2037276?

Comment 3 Christopher J Schaefer 2022-01-12 19:57:41 UTC

This same kind of issue, where there is network connectivity issues causing deployment delays and requiring a followup "wait-for install-complete" is not limited to NA regions, it has been seen in EU regions.

IBM Cloud is investigating the issue and hopefully improving stability/reliability on related resources to help prevent this issue in the future.

Comment 4 Pedro Amoedo 2022-01-13 16:01:34 UTC

Thanks Christopher, I'm updating the summary to better reflect the situation, in my case I've only seen that behavior in NA-based ones, maybe those are more saturated locations.

Comment 5 Jeff Nowicki 2022-01-25 20:42:26 UTC

After switching to a different instance type (specifically bx2-4x16) we observed high installation success in local testing and also noticed CI test success as well. Previously the bx2d-4x16 instance type was being used and was unreliable/problematic due to provisioning of limited availability storage.

We are working to ensure that bx2-4x16 is the default instance type.

Pedro - could you try your test again, ensuring bx2-4x16 is the instance type for bootstrap, master and worker nodes? I suspect you will not be seeing the described issue as regularly going forward (if not at all).

Comment 6 Pedro Amoedo 2022-01-26 15:19:04 UTC

Sure Jeff, I'll make some tests with that profile on US-based regions, which show the problem with a highest ratio than others, I'll keep you posted.

Best Regards.

Comment 7 Pedro Amoedo 2022-01-28 10:23:27 UTC

Hi Jeff, after switching back to default "OpenShiftSDN" network type as discussed, and overridden instance type to "bx2-4x16" in align with openshift/installer#5578[1], the tests have significantly improved.

I have tested 6 different installations in US-based supported regions like "us-south", "us-east" & "ca-tor", 4 of 6 tests have run flawlessly from begin to end, the other 2 that failed presented some minimal issues but not during the installation itself but rather during our post_action scripts (I have used nightly versions so this is not remarkable).

In summary, the new profiles in combination with default SDN network are presenting better success ratio, even on US-based regions that were showing worst results initially, thanks.

[1] - https://github.com/openshift/installer/pull/5578

NOTE: the PR is linked/tracked via BZ#2045916, therefore I'm closing this one as duplicate.

Best Regards.

*** This bug has been marked as a duplicate of bug 2045916 ***

Note You need to log in before you can comment on or make changes to this bug.