Bug 2041694
| Summary: | [IPI on Alibabacloud] installation fails when region does not support the cloud_essd disk category | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jianli Wei <jiwei> |
| Component: | Installer | Assignee: | aos-install |
| Installer sub component: | openshift-installer | QA Contact: | Jianli Wei <jiwei> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | brlu, gpei, husun, jialiu, mstaeble |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-10 16:40:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The region "cn-qingdao (China (Qingdao))" has similar issue.
$ yq e '.controlPlane' work/install-config.yaml
architecture: amd64
hyperthreading: Enabled
name: master
platform:
alibabacloud:
systemDiskCategory: cloud_efficiency
replicas: 3
$ yq e '.compute' work/install-config.yaml
- architecture: amd64
hyperthreading: Enabled
name: worker
platform:
alibabacloud:
systemDiskCategory: cloud_efficiency
replicas: 3
$ yq e '.platform' work/install-config.yaml
alibabacloud:
region: cn-qingdao
resourceGroupID: rg-aek2wky7lxk4f5y
$
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Common Manifests from target directory
INFO Consuming Worker Machines from target directory
INFO Consuming OpenShift Install (Manifests) from target directory
INFO Consuming Master Machines from target directory
INFO Consuming Openshift Manifests from target directory
INFO Creating infrastructure resources...
ERROR
ERROR Error: [ERROR] terraform-provider-alicloud/alicloud/resource_alicloud_instance.go:452: Resource alicloud_instance RunInstances Failed!!! [SDK alibaba-cloud-sdk-go ERROR]:
ERROR SDK.ServerError
ERROR ErrorCode: InvalidResourceType.NotSupported
ERROR Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidResourceType.NotSupported&source=PopGw
ERROR RequestId: 374A4CC9-2370-5998-899D-7C54C39A9533
ERROR Message: user order resource type [[cloud_essd]] not exists in [cn-qingdao-b]
ERROR
ERROR on ../../../tmp/openshift-install-bootstrap-2812370522/main.tf line 133, in resource "alicloud_instance" "bootstrap":
ERROR 133: resource "alicloud_instance" "bootstrap" {
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
$
$ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId cn-qingdao --InstanceType 'ecs.g6.xlarge' --endpoint ecs.cn-qingdao.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[]
ZoneId | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[]
------ | -----------------------------------------------------------------------------
cn-qingdao-b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]
$ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId cn-qingdao --InstanceType 'ecs.g6.large' --endpoint ecs.cn-qingdao.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[]
ZoneId | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[]
------ | -----------------------------------------------------------------------------
cn-qingdao-c | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_essd] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]
cn-qingdao-b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]
$
@husun The bootstrap VM is hard-coded to use cloud_essd. Is that intentional? I am setting this as a non-blocker for now as it only affects regions that do not support cloud_essd. root cause has been found, sunhui is working on it, PR will be submitted soon. I have fixed it on the PR https://github.com/openshift/installer/pull/5564 $ openshift-install create install-config --dir work
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region ap-southeast-3
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-408
? Pull Secret [? for help] ********
INFO Install-Config created in: work
$ vim work/install-config.yaml
$ yq e .platform work/install-config.yaml
alibabacloud:
region: ap-southeast-3
resourceGroupID: rg-aek2wky7lxk4f5y
defaultMachinePlatform:
instanceType: ecs.g6.xlarge
systemDiskCategory: cloud_efficiency
systemDiskSize: 200
$ yq e .metadata work/install-config.yaml
creationTimestamp: null
name: jiwei-408
$ yq e .credentialsMode work/install-config.yaml
Manual
$ openshift-install create manifests --dir work
INFO Consuming Install Config from target directory
INFO Manifests created in: work/manifests and work/openshift
$
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Master Machines from target directory
INFO Consuming Worker Machines from target directory
INFO Consuming OpenShift Install (Manifests) from target directory
INFO Consuming Common Manifests from target directory
INFO Consuming Openshift Manifests from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 11:57AM) for the Kubernetes API at https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443...
INFO API v1.23.0+2135ac2 up
INFO Waiting up to 30m0s (until 12:11PM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 12:31PM) for the cluster at https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443 to initialize...
W0127 11:52:08.550078 430110 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost
I0127 11:52:08.550251 430110 trace.go:205] Trace[1248183454]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (27-Jan-2022 11:51:51.019) (total time: 17530ms):
Trace[1248183454]: ---"Objects listed" error:Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost 17530ms (11:52:08.550)
Trace[1248183454]: [17.530476537s] [17.530476537s] END
E0127 11:52:08.550279 430110 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost
INFO Waiting up to 10m0s (until 12:12PM) for the openshift-console route to be created...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/fedora/work/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jiwei-408.alicloud-qe.devcluster.openshift.com
INFO Login to the console with user: "kubeadmin", and password: "3iUbd-G5R5G-skw2e-9LxZ9"
INFO Time elapsed: 27m5s
$
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2022-01-26-234447 True False 2m17s Cluster version is 4.10.0-0.nightly-2022-01-26-234447
$ oc get nodes
NAME STATUS ROLES AGE VERSION
jiwei-408-hv4sp-master-0 Ready master 21m v1.23.0+2135ac2
jiwei-408-hv4sp-master-1 Ready master 19m v1.23.0+2135ac2
jiwei-408-hv4sp-master-2 Ready master 19m v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3a-rnmd7 Ready worker 8m49s v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3a-zhmkj Ready worker 8m45s v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3b-8j2ws Ready worker 10m v1.23.0+2135ac2
$
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.10.0-0.nightly-2022-01-26-234447 True False False 2m48s
baremetal 4.10.0-0.nightly-2022-01-26-234447 True False False 18m
cloud-controller-manager 4.10.0-0.nightly-2022-01-26-234447 True False False 21m
cloud-credential 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
cluster-autoscaler 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
config-operator 4.10.0-0.nightly-2022-01-26-234447 True False False 19m
console 4.10.0-0.nightly-2022-01-26-234447 True False False 4m41s
csi-snapshot-controller 4.10.0-0.nightly-2022-01-26-234447 True False False 18m
dns 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
etcd 4.10.0-0.nightly-2022-01-26-234447 True False False 16m
image-registry 4.10.0-0.nightly-2022-01-26-234447 True False False 10m
ingress 4.10.0-0.nightly-2022-01-26-234447 True False False 9m32s
insights 4.10.0-0.nightly-2022-01-26-234447 True False False 12m
kube-apiserver 4.10.0-0.nightly-2022-01-26-234447 True False False 15m
kube-controller-manager 4.10.0-0.nightly-2022-01-26-234447 True False False 16m
kube-scheduler 4.10.0-0.nightly-2022-01-26-234447 True False False 15m
kube-storage-version-migrator 4.10.0-0.nightly-2022-01-26-234447 True False False 18m
machine-api 4.10.0-0.nightly-2022-01-26-234447 True False False 13m
machine-approver 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
machine-config 4.10.0-0.nightly-2022-01-26-234447 True False False 16m
marketplace 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
monitoring 4.10.0-0.nightly-2022-01-26-234447 True False False 7m8s
network 4.10.0-0.nightly-2022-01-26-234447 True False False 18m
node-tuning 4.10.0-0.nightly-2022-01-26-234447 True False False 8m9s
openshift-apiserver 4.10.0-0.nightly-2022-01-26-234447 True False False 12m
openshift-controller-manager 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
openshift-samples 4.10.0-0.nightly-2022-01-26-234447 True False False 12m
operator-lifecycle-manager 4.10.0-0.nightly-2022-01-26-234447 True False False 18m
operator-lifecycle-manager-catalog 4.10.0-0.nightly-2022-01-26-234447 True False False 17m
operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2022-01-26-234447 True False False 12m
service-ca 4.10.0-0.nightly-2022-01-26-234447 True False False 19m
storage 4.10.0-0.nightly-2022-01-26-234447 True False True 15m AlibabaDiskCSIDriverOperatorCRDegraded: AlibabaCloudDriverStaticResourcesControllerDegraded: "rbac/snapshotter_role.yaml" (string): clusterroles.rbac.authorization.k8s.io "alibaba-disk-external-snapshotter-role" is forbidden: user "system:serviceaccount:openshift-cluster-csi-drivers:alibaba-disk-csi-driver-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:openshift-cluster-csi-drivers" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:...
$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Version: $ openshift-install version openshift-install 4.10.0-0.ci.test-2022-01-18-015330-ci-ln-c2rvwfb-latest built from commit c4bc155f6de2494b9baca767cd74dc665e2ec468 release image registry.build01.ci.openshift.org/ci-ln-c2rvwfb/release@sha256:105a191b4183a002f36cd4421a8db27ccb1e352d20a428e3899b0da491859451 release architecture amd64 Platform: alibabacloud Please specify: * IPI What happened? IPI installation failed, due to 'resource type [[cloud_essd]] not exists in [ap-southeast-3a]', although the specified 'systemDiskCategory' is 'cloud_efficiency'. What did you expect to happen? The installer should use the specified 'defaultMachinePlatform' when launching any ECS instance. How to reproduce it (as minimally and precisely as possible)? Always. Anything else we need to know? $ openshift-install create install-config --dir work ? SSH Public Key /home/jiwei/.ssh/openshift-qe.pub ? Platform alibabacloud ? Region ap-southeast-3 ? Base Domain alicloud-qe.devcluster.openshift.com ? Cluster Name jiwei-204 ? Pull Secret [? for help] ******* $ echo 'credentialsMode: Manual' >> work/install-config.yaml $ vim work/install-config.yaml $ yq e '.platform' work/install-config.yaml alibabacloud: region: ap-southeast-3 resourceGroupID: rg-aek2wky7lxk4f5y defaultMachinePlatform: instanceType: ecs.g6.xlarge systemDiskCategory: cloud_efficiency systemDiskSize: 200 $ $ openshift-install create manifests --dir work INFO Consuming Install Config from target directory INFO Manifests created in: work/manifests and work/openshift $ $ openshift-install create cluster --dir work --log-level info INFO Consuming Master Machines from target directory INFO Consuming Openshift Manifests from target directory INFO Consuming Worker Machines from target directory INFO Consuming OpenShift Install (Manifests) from target directory INFO Consuming Common Manifests from target directory INFO Creating infrastructure resources... ERROR ERROR Error: [ERROR] terraform-provider-alicloud/alicloud/resource_alicloud_instance.go:452: Resource alicloud_instance RunInstances Failed!!! [SDK alibaba-cloud-sdk-go ERROR]: ERROR SDK.ServerError ERROR ErrorCode: InvalidResourceType.NotSupported ERROR Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidResourceType.NotSupported&source=PopGw ERROR RequestId: 961BAEA3-3F36-3C09-AC48-14BB985902A0 ERROR Message: user order resource type [[cloud_essd]] not exists in [ap-southeast-3a] ERROR ERROR on ../../../tmp/openshift-install-bootstrap-3799552535/main.tf line 133, in resource "alicloud_instance" "bootstrap": ERROR 133: resource "alicloud_instance" "bootstrap" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change $ $ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId ap-southeast-3 --InstanceType 'ecs.g6.xlarge' --endpoint ecs.ap-southeast-3.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[] ZoneId | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] ------ | ----------------------------------------------------------------------------- ap-southeast-3a | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]] ap-southeast-3b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_essd] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]] $