2035453 – [IPI on Alibabacloud] 2 worker machines stuck in Failed phase due to connection to 'ecs-cn-hangzhou.aliyuncs.com' timeout, although the specified region is 'us-east-1'

Bug 2035453 - [IPI on Alibabacloud] 2 worker machines stuck in Failed phase due to connection to 'ecs-cn-hangzhou.aliyuncs.com' timeout, although the specified region is 'us-east-1'

Summary: [IPI on Alibabacloud] 2 worker machines stuck in Failed phase due to connecti...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	aos-install
QA Contact:	Jianli Wei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-24 05:40 UTC by Jianli Wei
Modified:	2022-03-12 04:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:40:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-alibaba pull 24	0	None	open	Bug 2035453: [Alibaba] fix api endpoint	2022-01-19 09:31:07 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:40:21 UTC

Description Jianli Wei 2021-12-24 05:40:44 UTC

Version:
$ openshift-install version
openshift-install 4.10.0-0.nightly-2021-12-23-193744
built from commit 94a3ed9cbe4db66dc50dab8b85d2abf40fb56426
release image registry.ci.openshift.org/ocp/release@sha256:c23a323ae7d5f788cc5f24d7281d745be6e1282b9f269e699530987be2b67e21
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?
2 worker machines stuck in Failed phase due to connection to 'ecs-cn-hangzhou.aliyuncs.com' timeout, although the specified region is 'us-east-1'. Besides, as 2/3 workers nodes are not ready, the installation failed finally. 

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          54m     Unable to apply 4.10.0-0.nightly-2021-12-23-193744: the cluster operator monitoring has not yet successfully rolled out
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          54m     Unable to apply 4.10.0-0.nightly-2021-12-23-193744: the cluster operator monitoring has not yet successfully rolled out
$ oc get nodes
NAME                                     STATUS   ROLES    AGE   VERSION
jiwei-aa-km22m-master-0                  Ready    master   49m   v1.22.1+6859754
jiwei-aa-km22m-master-1                  Ready    master   49m   v1.22.1+6859754
jiwei-aa-km22m-master-2                  Ready    master   49m   v1.22.1+6859754
jiwei-aa-km22m-worker-us-east-1b-vhqsl   Ready    worker   38m   v1.22.1+6859754
$ oc get machines -n openshift-machine-api
NAME                                     PHASE     TYPE            REGION      ZONE         AGE
jiwei-aa-km22m-master-0                  Running   ecs.g6.xlarge   us-east-1   us-east-1b   53m
jiwei-aa-km22m-master-1                  Running   ecs.g6.xlarge   us-east-1   us-east-1a   53m
jiwei-aa-km22m-master-2                  Running   ecs.g6.xlarge   us-east-1   us-east-1b   53m
jiwei-aa-km22m-worker-us-east-1a-cklfp   Failed                                             45m
jiwei-aa-km22m-worker-us-east-1b-pg5k8   Failed                                             45m
jiwei-aa-km22m-worker-us-east-1b-vhqsl   Running   ecs.g6.large    us-east-1   us-east-1b   45m
$ oc describe machines jiwei-aa-km22m-worker-us-east-1a-cklfp -n openshift-machine-api | grep Message
        f:errorMessage:
    Message:               Instance has not been created
  Error Message:           failed to reconcile machine "jiwei-aa-km22m-worker-us-east-1a-cklfp": failed to create instance: error getting ImageID: error describing Images: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeImages&Format=JSON&ImageId=m-0xi5sh1y90lsbo8ayl0a&RegionId=us-east-1&ShowExpired=true&Signature=u49QmF3CR%2BKca3LHCv5JDfq5XlE%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=e6b2d3fb930f2579a7ec982c008b07ed&SignatureType=&SignatureVersion=1.0&Timestamp=2021-12-24T04%3A50%3A12Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout
  Type     Reason        Age   From                     Message
$ oc describe machines jiwei-aa-km22m-worker-us-east-1b-pg5k8 -n openshift-machine-api | grep Message
        f:errorMessage:
    Message:               Failed to check if machine exists: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeInstances&Format=JSON&RegionId=us-east-1&Signature=g5lC%2F4Fa8EqpRWOXV2z9z881QQY%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=7bfdec53ecb8238281b19a51aa6a2629&SignatureType=&SignatureVersion=1.0&Tag.1.Key=kubernetes.io%2Fcluster%2Fjiwei-aa-km22m&Tag.1.value=owned&Tag.2.Key=Name&Tag.2.value=jiwei-aa-km22m-worker-us-east-1b-pg5k8&Timestamp=2021-12-24T04%3A50%3A17Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout
  Error Message:           failed to reconcile machine "jiwei-aa-km22m-worker-us-east-1b-pg5k8": failed to create instance: error getting ImageID: error describing Images: Post "https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=LTAI5tGRJGBwVwBNbX2EWKkv&Action=DescribeImages&Format=JSON&ImageId=m-0xi5sh1y90lsbo8ayl0a&RegionId=us-east-1&ShowExpired=true&Signature=MVpabQUqxi1%2B6ZvYxCYsQlZFLls%3D&SignatureMethod=HMAC-SHA1&SignatureNonce=576757e89a62237ea5ff26b66eba1526&SignatureType=&SignatureVersion=1.0&Timestamp=2021-12-24T04%3A51%3A04Z&Version=2014-05-26": dial tcp 106.11.172.2:443: i/o timeout
  Type     Reason        Age   From                     Message
$ 

What did you expect to happen?
It should connect to https://ecs-us-east-1.aliyuncs.com/ instead, and all machines should turn into Running. 

How to reproduce it (as minimally and precisely as possible)?
We met the issue twice this week.

Anything else we need to know?
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Worker Machines from target directory
INFO Consuming OpenShift Install (Manifests) from target directory
INFO Consuming Master Machines from target directory
INFO Consuming Common Manifests from target directory
INFO Consuming Openshift Manifests from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 4:54AM) for the Kubernetes API at https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443...
INFO API v1.22.1+6859754 up
INFO Waiting up to 30m0s (until 5:06AM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 5:29AM) for the cluster at https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443 to initialize...
E1224 04:49:22.374800  283572 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.jiwei-aa.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudControllerOwner is True with AsExpected: Cluster Cloud Controller Manager Operator owns cloud controllers at 4.10.0-0.nightly-2021-12-23-193744
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted:
INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed
ERROR Cluster operator image-registry Degraded is True with ProgressDeadlineExceeded: Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-767d89c8df" has timed out progressing.
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6ffddc694b-7nwjt" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.)
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
ERROR Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 1 unavailable replicas
ERROR updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 1 updated replicas
ERROR updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas
ERROR updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas
INFO Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Cluster operator monitoring is not available
$ 
$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-23-193744   True        False         False      31m
baremetal                                  4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
cloud-controller-manager                                                        True        False         False      50m
cloud-credential                           4.10.0-0.nightly-2021-12-23-193744   True        False         False      46m
cluster-autoscaler                         4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
config-operator                            4.10.0-0.nightly-2021-12-23-193744   True        False         False      48m
console                                    4.10.0-0.nightly-2021-12-23-193744   True        False         False      29m
csi-snapshot-controller                    4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
dns                                        4.10.0-0.nightly-2021-12-23-193744   True        False         False      46m
etcd                                       4.10.0-0.nightly-2021-12-23-193744   True        False         False      46m
image-registry                             4.10.0-0.nightly-2021-12-23-193744   True        True          True       38m     Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-767d89c8df" has timed out progressing.
ingress                                    4.10.0-0.nightly-2021-12-23-193744   True        False         True       37m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6ffddc694b-7nwjt" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.)
insights                                   4.10.0-0.nightly-2021-12-23-193744   True        False         False      41m
kube-apiserver                             4.10.0-0.nightly-2021-12-23-193744   True        False         False      43m
kube-controller-manager                    4.10.0-0.nightly-2021-12-23-193744   True        False         False      45m
kube-scheduler                             4.10.0-0.nightly-2021-12-23-193744   True        False         False      45m
kube-storage-version-migrator              4.10.0-0.nightly-2021-12-23-193744   True        False         False      48m
machine-api                                4.10.0-0.nightly-2021-12-23-193744   True        False         False      43m
machine-approver                           4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
machine-config                             4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
marketplace                                4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
monitoring                                                                      False       True          True       27m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
node-tuning                                4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
openshift-apiserver                        4.10.0-0.nightly-2021-12-23-193744   True        False         False      42m
openshift-controller-manager               4.10.0-0.nightly-2021-12-23-193744   True        False         False      45m
openshift-samples                          4.10.0-0.nightly-2021-12-23-193744   True        False         False      42m
operator-lifecycle-manager                 4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-12-23-193744   True        False         False      43m
service-ca                                 4.10.0-0.nightly-2021-12-23-193744   True        False         False      48m
storage                                    4.10.0-0.nightly-2021-12-23-193744   True        False         False      47m
$

Comment 6 Jianli Wei 2022-01-21 03:54:32 UTC

Tested along with bug2035757 (see https://bugzilla.redhat.com/show_bug.cgi?id=2035757#c17), no such issue any more. Mark as verified for now.

Comment 9 errata-xmlrpc 2022-03-12 04:40:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.