Version: $ openshift-install version openshift-install 4.10.0-0.nightly-2021-12-23-193744 built from commit 94a3ed9cbe4db66dc50dab8b85d2abf40fb56426 release image registry.ci.openshift.org/ocp/release@sha256:c23a323ae7d5f788cc5f24d7281d745be6e1282b9f269e699530987be2b67e21 release architecture amd64 Platform: alibabacloud Please specify: * IPI (automated install with `openshift-install`. If you don't know, then it's IPI) What happened? 2 worker machines stuck in Failed phase due to connection to 'ecs-cn-hangzhou.aliyuncs.com' timeout, although the specified region is 'us-east-1'. Besides, as 2/3 workers nodes are not ready, the installation failed finally. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 54m Unable to apply 4.10.0-0.nightly-2021-12-23-193744: the cluster operator monitoring has not yet successfully rolled out $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 54m Unable to apply 4.10.0-0.nightly-2021-12-23-193744: the cluster operator monitoring has not yet successfully rolled out $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-aa-km22m-master-0 Ready master 49m v1.22.1+6859754 jiwei-aa-km22m-master-1 Ready master 49m v1.22.1+6859754 jiwei-aa-km22m-master-2 Ready master 49m v1.22.1+6859754 jiwei-aa-km22m-worker-us-east-1b-vhqsl Ready worker 38m v1.22.1+6859754 $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jiwei-aa-km22m-master-0 Running ecs.g6.xlarge us-east-1 us-east-1b 53m jiwei-aa-km22m-master-1 Running ecs.g6.xlarge us-east-1 us-east-1a 53m jiwei-aa-km22m-master-2 Running ecs.g6.xlarge us-east-1 us-east-1b 53m jiwei-aa-km22m-worker-us-east-1a-cklfp Failed 45m jiwei-aa-km22m-worker-us-east-1b-pg5k8 Failed 45m jiwei-aa-km22m-worker-us-east-1b-vhqsl Running ecs.g6.large us-east-1 us-east-1b 45m $ oc describe machines jiwei-aa-km22m-worker-us-east-1a-cklfp -n openshift-machine-api | grep Message f:errorMessage: Message: Instance has not been created Error Message: failed to reconcile machine "jiwei-aa-km22m-worker-us-east-1a-cklfp": failed to create instance: error getting ImageID: error describing Images: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeImages&Format=JSON&ImageId=m-0xi5sh1y90lsbo8ayl0a&RegionId=us-east-1&ShowExpired=true&Signature=u49QmF3CR%2BKca3LHCv5JDfq5XlE%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=e6b2d3fb930f2579a7ec982c008b07ed&SignatureType=&SignatureVersion=1.0&Timestamp=2021-12-24T04%3A50%3A12Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout Type Reason Age From Message $ oc describe machines jiwei-aa-km22m-worker-us-east-1b-pg5k8 -n openshift-machine-api | grep Message f:errorMessage: Message: Failed to check if machine exists: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeInstances&Format=JSON&RegionId=us-east-1&Signature=g5lC%2F4Fa8EqpRWOXV2z9z881QQY%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=7bfdec53ecb8238281b19a51aa6a2629&SignatureType=&SignatureVersion=1.0&Tag.1.Key=kubernetes.io%2Fcluster%2Fjiwei-aa-km22m&Tag.1.value=owned&Tag.2.Key=Name&Tag.2.value=jiwei-aa-km22m-worker-us-east-1b-pg5k8&Timestamp=2021-12-24T04%3A50%3A17Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout Error Message: failed to reconcile machine "jiwei-aa-km22m-worker-us-east-1b-pg5k8": failed to create instance: error getting ImageID: error describing Images: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeImages&Format=JSON&ImageId=m-0xi5sh1y90lsbo8ayl0a&RegionId=us-east-1&ShowExpired=true&Signature=MVpabQUqxi1%2B6ZvYxCYsQlZFLls%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=576757e89a62237ea5ff26b66eba1526&SignatureType=&SignatureVersion=1.0&Timestamp=2021-12-24T04%3A51%3A04Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout Type Reason Age From Message $ What did you expect to happen? It should connect to https://ecs-us-east-1.aliyuncs.com/ instead, and all machines should turn into Running. How to reproduce it (as minimally and precisely as possible)? We met the issue twice this week. Anything else we need to know? $ openshift-install create cluster --dir work --log-level info INFO Consuming Worker Machines from target directory INFO Consuming OpenShift Install (Manifests) from target directory INFO Consuming Master Machines from target directory INFO Consuming Common Manifests from target directory INFO Consuming Openshift Manifests from target directory INFO Creating infrastructure resources... INFO Waiting up to 20m0s (until 4:54AM) for the Kubernetes API at https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443... INFO API v1.22.1+6859754 up INFO Waiting up to 30m0s (until 5:06AM) for bootstrapping to complete... INFO Destroying the bootstrap resources... INFO Waiting up to 40m0s (until 5:29AM) for the cluster at https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443 to initialize... E1224 04:49:22.374800 283572 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager CloudControllerOwner is True with AsExpected: Cluster Cloud Controller Manager Operator owns cloud controllers at 4.10.0-0.nightly-2021-12-23-193744 INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed ERROR Cluster operator image-registry Degraded is True with ProgressDeadlineExceeded: Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-767d89c8df" has timed out progressing. ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6ffddc694b-7nwjt" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.) INFO Cluster operator insights Disabled is False with AsExpected: INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. ERROR Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 1 unavailable replicas ERROR updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 1 updated replicas ERROR updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas ERROR updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas INFO Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. INFO Cluster operator network ManagementStateDegraded is False with : ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation FATAL failed to initialize the cluster: Cluster operator monitoring is not available $ $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-0.nightly-2021-12-23-193744 True False False 31m baremetal 4.10.0-0.nightly-2021-12-23-193744 True False False 47m cloud-controller-manager True False False 50m cloud-credential 4.10.0-0.nightly-2021-12-23-193744 True False False 46m cluster-autoscaler 4.10.0-0.nightly-2021-12-23-193744 True False False 47m config-operator 4.10.0-0.nightly-2021-12-23-193744 True False False 48m console 4.10.0-0.nightly-2021-12-23-193744 True False False 29m csi-snapshot-controller 4.10.0-0.nightly-2021-12-23-193744 True False False 47m dns 4.10.0-0.nightly-2021-12-23-193744 True False False 46m etcd 4.10.0-0.nightly-2021-12-23-193744 True False False 46m image-registry 4.10.0-0.nightly-2021-12-23-193744 True True True 38m Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-767d89c8df" has timed out progressing. ingress 4.10.0-0.nightly-2021-12-23-193744 True False True 37m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6ffddc694b-7nwjt" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.) insights 4.10.0-0.nightly-2021-12-23-193744 True False False 41m kube-apiserver 4.10.0-0.nightly-2021-12-23-193744 True False False 43m kube-controller-manager 4.10.0-0.nightly-2021-12-23-193744 True False False 45m kube-scheduler 4.10.0-0.nightly-2021-12-23-193744 True False False 45m kube-storage-version-migrator 4.10.0-0.nightly-2021-12-23-193744 True False False 48m machine-api 4.10.0-0.nightly-2021-12-23-193744 True False False 43m machine-approver 4.10.0-0.nightly-2021-12-23-193744 True False False 47m machine-config 4.10.0-0.nightly-2021-12-23-193744 True False False 47m marketplace 4.10.0-0.nightly-2021-12-23-193744 True False False 47m monitoring False True True 27m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.10.0-0.nightly-2021-12-23-193744 True False False 47m node-tuning 4.10.0-0.nightly-2021-12-23-193744 True False False 47m openshift-apiserver 4.10.0-0.nightly-2021-12-23-193744 True False False 42m openshift-controller-manager 4.10.0-0.nightly-2021-12-23-193744 True False False 45m openshift-samples 4.10.0-0.nightly-2021-12-23-193744 True False False 42m operator-lifecycle-manager 4.10.0-0.nightly-2021-12-23-193744 True False False 47m operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-12-23-193744 True False False 47m operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-12-23-193744 True False False 43m service-ca 4.10.0-0.nightly-2021-12-23-193744 True False False 48m storage 4.10.0-0.nightly-2021-12-23-193744 True False False 47m $
Tested along with bug2035757 (see https://bugzilla.redhat.com/show_bug.cgi?id=2035757#c17), no such issue any more. Mark as verified for now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056