- Installation fails due to poor network performance, Ingress is degraded and not able to progress in time:
01-12 13:34:27.914 level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
01-12 13:34:27.915 level=info msg=Cluster operator insights Disabled is False with AsExpected:
01-12 13:34:27.915 level=info msg=Cluster operator network ManagementStateDegraded is False with :
01-12 13:34:27.915 level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
01-12 13:34:27.915 level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
01-12 13:34:27.915 level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
01-12 13:34:27.915 level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
01-12 13:34:27.915 level=fatal msg=failed to initialize the cluster: Some cluster operators are still updating: authentication, console
However, after few minutes, the cluster finish the installation on its own:
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.10.0-fc.0 True False False 6h28m
baremetal 4.10.0-fc.0 True False False 7h38m
cloud-controller-manager 4.10.0-fc.0 True False False 7h46m
cloud-credential 4.10.0-fc.0 True False False 7h38m
cluster-autoscaler 4.10.0-fc.0 True False False 7h38m
config-operator 4.10.0-fc.0 True False False 7h39m
console 4.10.0-fc.0 True False False 6h27m
csi-snapshot-controller 4.10.0-fc.0 True False False 7h39m
dns 4.10.0-fc.0 True False False 7h38m
etcd 4.10.0-fc.0 True False False 7h37m
image-registry 4.10.0-fc.0 True False False 7h28m
ingress 4.10.0-fc.0 True False False 7h27m
insights 4.10.0-fc.0 True False False 7h27m
kube-apiserver 4.10.0-fc.0 True False False 7h29m
kube-controller-manager 4.10.0-fc.0 True False False 7h36m
kube-scheduler 4.10.0-fc.0 True False False 7h36m
kube-storage-version-migrator 4.10.0-fc.0 True False False 7h39m
machine-api 4.10.0-fc.0 True False False 7h35m
machine-approver 4.10.0-fc.0 True False False 7h39m
machine-config 4.10.0-fc.0 True False False 7h37m
marketplace 4.10.0-fc.0 True False False 7h38m
monitoring 4.10.0-fc.0 True False False 7h26m
network 4.10.0-fc.0 True False False 7h40m
node-tuning 4.10.0-fc.0 True False False 7h38m
openshift-apiserver 4.10.0-fc.0 True False False 7h29m
openshift-controller-manager 4.10.0-fc.0 True False False 7h37m
openshift-samples 4.10.0-fc.0 True False False 7h26m
operator-lifecycle-manager 4.10.0-fc.0 True False False 7h39m
operator-lifecycle-manager-catalog 4.10.0-fc.0 True False False 7h39m
operator-lifecycle-manager-packageserver 4.10.0-fc.0 True False False 7h31m
service-ca 4.10.0-fc.0 True False False 7h39m
storage 4.10.0-fc.0 True False False 7h25m
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-8e04ca308a07babafc3c40ef9f5c59d3 True False False 3 3 3 0 7h44m
worker rendered-worker-8bc271c2c622ad62fd00fb5db6169a47 True False False 3 3 3 0 7h44m
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-fc.0 True False 6h30m Cluster version is 4.10.0-fc.0
What did you expect to happen?
- Successful installation
How to reproduce it (as minimally and precisely as possible)?
- Deploy IPI IBMCloud on US-based supported regions (ca-tor, us-east, us-south)
Anything else we need to know?
- Could be partially related with BZ#2037276?
This same kind of issue, where there is network connectivity issues causing deployment delays and requiring a followup "wait-for install-complete" is not limited to NA regions, it has been seen in EU regions.
IBM Cloud is investigating the issue and hopefully improving stability/reliability on related resources to help prevent this issue in the future.
Thanks Christopher, I'm updating the summary to better reflect the situation, in my case I've only seen that behavior in NA-based ones, maybe those are more saturated locations.
After switching to a different instance type (specifically bx2-4x16) we observed high installation success in local testing and also noticed CI test success as well. Previously the bx2d-4x16 instance type was being used and was unreliable/problematic due to provisioning of limited availability storage.
We are working to ensure that bx2-4x16 is the default instance type.
Pedro - could you try your test again, ensuring bx2-4x16 is the instance type for bootstrap, master and worker nodes? I suspect you will not be seeing the described issue as regularly going forward (if not at all).
Sure Jeff, I'll make some tests with that profile on US-based regions, which show the problem with a highest ratio than others, I'll keep you posted.
Hi Jeff, after switching back to default "OpenShiftSDN" network type as discussed, and overridden instance type to "bx2-4x16" in align with openshift/installer#5578, the tests have significantly improved.
I have tested 6 different installations in US-based supported regions like "us-south", "us-east" & "ca-tor", 4 of 6 tests have run flawlessly from begin to end, the other 2 that failed presented some minimal issues but not during the installation itself but rather during our post_action scripts (I have used nightly versions so this is not remarkable).
In summary, the new profiles in combination with default SDN network are presenting better success ratio, even on US-based regions that were showing worst results initially, thanks.
 - https://github.com/openshift/installer/pull/5578
NOTE: the PR is linked/tracked via BZ#2045916, therefore I'm closing this one as duplicate.
*** This bug has been marked as a duplicate of bug 2045916 ***