Version: 4.10.0-fc.0 Platform: IPI IBMCloud What happened? - Installation fails due to poor network performance, Ingress is degraded and not able to progress in time: ~~~ 01-12 13:34:27.914 level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) 01-12 13:34:27.915 level=info msg=Cluster operator insights Disabled is False with AsExpected: 01-12 13:34:27.915 level=info msg=Cluster operator network ManagementStateDegraded is False with : 01-12 13:34:27.915 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. 01-12 13:34:27.915 level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 01-12 13:34:27.915 level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 01-12 13:34:27.915 level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation 01-12 13:34:27.915 level=fatal msg=failed to initialize the cluster: Some cluster operators are still updating: authentication, console ~~~ However, after few minutes, the cluster finish the installation on its own: ~~~ $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-fc.0 True False False 6h28m baremetal 4.10.0-fc.0 True False False 7h38m cloud-controller-manager 4.10.0-fc.0 True False False 7h46m cloud-credential 4.10.0-fc.0 True False False 7h38m cluster-autoscaler 4.10.0-fc.0 True False False 7h38m config-operator 4.10.0-fc.0 True False False 7h39m console 4.10.0-fc.0 True False False 6h27m csi-snapshot-controller 4.10.0-fc.0 True False False 7h39m dns 4.10.0-fc.0 True False False 7h38m etcd 4.10.0-fc.0 True False False 7h37m image-registry 4.10.0-fc.0 True False False 7h28m ingress 4.10.0-fc.0 True False False 7h27m insights 4.10.0-fc.0 True False False 7h27m kube-apiserver 4.10.0-fc.0 True False False 7h29m kube-controller-manager 4.10.0-fc.0 True False False 7h36m kube-scheduler 4.10.0-fc.0 True False False 7h36m kube-storage-version-migrator 4.10.0-fc.0 True False False 7h39m machine-api 4.10.0-fc.0 True False False 7h35m machine-approver 4.10.0-fc.0 True False False 7h39m machine-config 4.10.0-fc.0 True False False 7h37m marketplace 4.10.0-fc.0 True False False 7h38m monitoring 4.10.0-fc.0 True False False 7h26m network 4.10.0-fc.0 True False False 7h40m node-tuning 4.10.0-fc.0 True False False 7h38m openshift-apiserver 4.10.0-fc.0 True False False 7h29m openshift-controller-manager 4.10.0-fc.0 True False False 7h37m openshift-samples 4.10.0-fc.0 True False False 7h26m operator-lifecycle-manager 4.10.0-fc.0 True False False 7h39m operator-lifecycle-manager-catalog 4.10.0-fc.0 True False False 7h39m operator-lifecycle-manager-packageserver 4.10.0-fc.0 True False False 7h31m service-ca 4.10.0-fc.0 True False False 7h39m storage 4.10.0-fc.0 True False False 7h25m $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-8e04ca308a07babafc3c40ef9f5c59d3 True False False 3 3 3 0 7h44m worker rendered-worker-8bc271c2c622ad62fd00fb5db6169a47 True False False 3 3 3 0 7h44m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-fc.0 True False 6h30m Cluster version is 4.10.0-fc.0 ~~~ What did you expect to happen? - Successful installation How to reproduce it (as minimally and precisely as possible)? - Deploy IPI IBMCloud on US-based supported regions (ca-tor, us-east, us-south) Anything else we need to know? - Could be partially related with BZ#2037276?
This same kind of issue, where there is network connectivity issues causing deployment delays and requiring a followup "wait-for install-complete" is not limited to NA regions, it has been seen in EU regions. IBM Cloud is investigating the issue and hopefully improving stability/reliability on related resources to help prevent this issue in the future.
Thanks Christopher, I'm updating the summary to better reflect the situation, in my case I've only seen that behavior in NA-based ones, maybe those are more saturated locations.
After switching to a different instance type (specifically bx2-4x16) we observed high installation success in local testing and also noticed CI test success as well. Previously the bx2d-4x16 instance type was being used and was unreliable/problematic due to provisioning of limited availability storage. We are working to ensure that bx2-4x16 is the default instance type. Pedro - could you try your test again, ensuring bx2-4x16 is the instance type for bootstrap, master and worker nodes? I suspect you will not be seeing the described issue as regularly going forward (if not at all).
Sure Jeff, I'll make some tests with that profile on US-based regions, which show the problem with a highest ratio than others, I'll keep you posted. Best Regards.
Hi Jeff, after switching back to default "OpenShiftSDN" network type as discussed, and overridden instance type to "bx2-4x16" in align with openshift/installer#5578[1], the tests have significantly improved. I have tested 6 different installations in US-based supported regions like "us-south", "us-east" & "ca-tor", 4 of 6 tests have run flawlessly from begin to end, the other 2 that failed presented some minimal issues but not during the installation itself but rather during our post_action scripts (I have used nightly versions so this is not remarkable). In summary, the new profiles in combination with default SDN network are presenting better success ratio, even on US-based regions that were showing worst results initially, thanks. [1] - https://github.com/openshift/installer/pull/5578 NOTE: the PR is linked/tracked via BZ#2045916, therefore I'm closing this one as duplicate. Best Regards. *** This bug has been marked as a duplicate of bug 2045916 ***