2041694 – [IPI on Alibabacloud] installation fails when region does not support the cloud_essd disk category

Bug 2041694 - [IPI on Alibabacloud] installation fails when region does not support the cloud_essd disk category

Summary: [IPI on Alibabacloud] installation fails when region does not support the clo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	aos-install
QA Contact:	Jianli Wei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-18 06:23 UTC by Jianli Wei
Modified:	2022-03-10 16:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:40:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 5564	0	None	open	Bug 2041694: [Alibaba] fix system disk category of bootstrap	2022-01-22 16:59:04 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:40:32 UTC

Description Jianli Wei 2022-01-18 06:23:35 UTC

Version:
$ openshift-install version
openshift-install 4.10.0-0.ci.test-2022-01-18-015330-ci-ln-c2rvwfb-latest
built from commit c4bc155f6de2494b9baca767cd74dc665e2ec468
release image registry.build01.ci.openshift.org/ci-ln-c2rvwfb/release@sha256:105a191b4183a002f36cd4421a8db27ccb1e352d20a428e3899b0da491859451
release architecture amd64

Platform: alibabacloud

Please specify:
* IPI

What happened?
IPI installation failed, due to 'resource type [[cloud_essd]] not exists in [ap-southeast-3a]', although the specified 'systemDiskCategory' is 'cloud_efficiency'.

What did you expect to happen?
The installer should use the specified 'defaultMachinePlatform' when launching any ECS instance.

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
$ openshift-install create install-config --dir work
? SSH Public Key /home/jiwei/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region ap-southeast-3
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-204
? Pull Secret [? for help] *******
$ echo 'credentialsMode: Manual' >> work/install-config.yaml
$ vim work/install-config.yaml
$ yq e '.platform' work/install-config.yaml
alibabacloud:
  region: ap-southeast-3
  resourceGroupID: rg-aek2wky7lxk4f5y
  defaultMachinePlatform:
    instanceType: ecs.g6.xlarge
    systemDiskCategory: cloud_efficiency
    systemDiskSize: 200
$ 
$ openshift-install create manifests --dir work
INFO Consuming Install Config from target directory 
INFO Manifests created in: work/manifests and work/openshift 
$ 
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Master Machines from target directory 
INFO Consuming Openshift Manifests from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Common Manifests from target directory 
INFO Creating infrastructure resources...         
ERROR                                              
ERROR Error: [ERROR] terraform-provider-alicloud/alicloud/resource_alicloud_instance.go:452: Resource alicloud_instance RunInstances Failed!!! [SDK alibaba-cloud-sdk-go ERROR]: 
ERROR SDK.ServerError                              
ERROR ErrorCode: InvalidResourceType.NotSupported  
ERROR Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidResourceType.NotSupported&source=PopGw 
ERROR RequestId: 961BAEA3-3F36-3C09-AC48-14BB985902A0 
ERROR Message: user order resource type [[cloud_essd]] not exists in [ap-southeast-3a] 
ERROR                                              
ERROR   on ../../../tmp/openshift-install-bootstrap-3799552535/main.tf line 133, in resource "alicloud_instance" "bootstrap": 
ERROR  133: resource "alicloud_instance" "bootstrap" { 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 
$ 
$ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId ap-southeast-3 --InstanceType 'ecs.g6.xlarge' --endpoint ecs.ap-southeast-3.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[]
ZoneId          | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[]
------          | -----------------------------------------------------------------------------
ap-southeast-3a | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]
ap-southeast-3b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_essd] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]

$

Comment 1 Jianli Wei 2022-01-18 10:28:54 UTC

The region "cn-qingdao (China (Qingdao))" has similar issue. 

$ yq e '.controlPlane' work/install-config.yaml
architecture: amd64
hyperthreading: Enabled
name: master
platform:
  alibabacloud:
    systemDiskCategory: cloud_efficiency
replicas: 3
$ yq e '.compute' work/install-config.yaml
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    alibabacloud:
      systemDiskCategory: cloud_efficiency
  replicas: 3
$ yq e '.platform' work/install-config.yaml
alibabacloud:
  region: cn-qingdao
  resourceGroupID: rg-aek2wky7lxk4f5y
$ 
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Common Manifests from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Master Machines from target directory 
INFO Consuming Openshift Manifests from target directory 
INFO Creating infrastructure resources...         
ERROR                                              
ERROR Error: [ERROR] terraform-provider-alicloud/alicloud/resource_alicloud_instance.go:452: Resource alicloud_instance RunInstances Failed!!! [SDK alibaba-cloud-sdk-go ERROR]: 
ERROR SDK.ServerError                              
ERROR ErrorCode: InvalidResourceType.NotSupported  
ERROR Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidResourceType.NotSupported&source=PopGw 
ERROR RequestId: 374A4CC9-2370-5998-899D-7C54C39A9533 
ERROR Message: user order resource type [[cloud_essd]] not exists in [cn-qingdao-b] 
ERROR                                              
ERROR   on ../../../tmp/openshift-install-bootstrap-2812370522/main.tf line 133, in resource "alicloud_instance" "bootstrap": 
ERROR  133: resource "alicloud_instance" "bootstrap" { 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change 
$ 

$ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId cn-qingdao --InstanceType 'ecs.g6.xlarge' --endpoint ecs.cn-qingdao.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[]
ZoneId       | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[]
------       | -----------------------------------------------------------------------------
cn-qingdao-b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]

$ aliyun ecs DescribeAvailableResource --DestinationResource 'SystemDisk' --RegionId cn-qingdao --InstanceType 'ecs.g6.large' --endpoint ecs.cn-qingdao.aliyuncs.com --output cols=ZoneId,AvailableResources.AvailableResource[].SupportedResources.SupportedResource[] rows=AvailableZones.AvailableZone[]
ZoneId       | AvailableResources.AvailableResource[].SupportedResources.SupportedResource[]
------       | -----------------------------------------------------------------------------
cn-qingdao-c | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_essd] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]
cn-qingdao-b | [map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_efficiency] map[Max:500 Min:20 Status:Available Unit:GiB Value:cloud_ssd]]

$

Comment 2 Matthew Staebler 2022-01-19 10:38:00 UTC

@husun The bootstrap VM is hard-coded to use cloud_essd. Is that intentional?

Comment 3 Matthew Staebler 2022-01-19 10:39:13 UTC

I am setting this as a non-blocker for now as it only affects regions that do not support cloud_essd.

Comment 4 Brian Lu 2022-01-21 01:08:46 UTC

root cause has been found, sunhui is working on it, PR will be submitted soon.

Comment 5 husun 2022-01-24 09:05:19 UTC

I have fixed it on the PR https://github.com/openshift/installer/pull/5564

Comment 8 Jianli Wei 2022-01-27 12:10:56 UTC

$ openshift-install create install-config --dir work
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region ap-southeast-3
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-408
? Pull Secret [? for help] ********
INFO Install-Config created in: work
$ vim work/install-config.yaml
$ yq e .platform work/install-config.yaml
alibabacloud:
  region: ap-southeast-3
  resourceGroupID: rg-aek2wky7lxk4f5y
  defaultMachinePlatform:
    instanceType: ecs.g6.xlarge
    systemDiskCategory: cloud_efficiency
    systemDiskSize: 200
$ yq e .metadata work/install-config.yaml 
creationTimestamp: null
name: jiwei-408
$ yq e .credentialsMode work/install-config.yaml 
Manual
$ openshift-install create manifests --dir work
INFO Consuming Install Config from target directory 
INFO Manifests created in: work/manifests and work/openshift 
$ 
$ openshift-install create cluster --dir work --log-level info
INFO Consuming Master Machines from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Common Manifests from target directory 
INFO Consuming Openshift Manifests from target directory 
INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s (until 11:57AM) for the Kubernetes API at https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443... 
INFO API v1.23.0+2135ac2 up                       
INFO Waiting up to 30m0s (until 12:11PM) for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 40m0s (until 12:31PM) for the cluster at https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443 to initialize... 
W0127 11:52:08.550078  430110 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost
I0127 11:52:08.550251  430110 trace.go:205] Trace[1248183454]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (27-Jan-2022 11:51:51.019) (total time: 17530ms):
Trace[1248183454]: ---"Objects listed" error:Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost 17530ms (11:52:08.550)
Trace[1248183454]: [17.530476537s] [17.530476537s] END
E0127 11:52:08.550279  430110 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.jiwei-408.alicloud-qe.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": http2: client connection lost
INFO Waiting up to 10m0s (until 12:12PM) for the openshift-console route to be created... 
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/fedora/work/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jiwei-408.alicloud-qe.devcluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password: "3iUbd-G5R5G-skw2e-9LxZ9" 
INFO Time elapsed: 27m5s                          
$ 
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-26-234447   True        False         2m17s   Cluster version is 4.10.0-0.nightly-2022-01-26-234447
$ oc get nodes
NAME                                           STATUS   ROLES    AGE     VERSION
jiwei-408-hv4sp-master-0                       Ready    master   21m     v1.23.0+2135ac2
jiwei-408-hv4sp-master-1                       Ready    master   19m     v1.23.0+2135ac2
jiwei-408-hv4sp-master-2                       Ready    master   19m     v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3a-rnmd7   Ready    worker   8m49s   v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3a-zhmkj   Ready    worker   8m45s   v1.23.0+2135ac2
jiwei-408-hv4sp-worker-ap-southeast-3b-8j2ws   Ready    worker   10m     v1.23.0+2135ac2
$ 
$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2022-01-26-234447   True        False         False      2m48s
baremetal                                  4.10.0-0.nightly-2022-01-26-234447   True        False         False      18m
cloud-controller-manager                   4.10.0-0.nightly-2022-01-26-234447   True        False         False      21m
cloud-credential                           4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m
cluster-autoscaler                         4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m
config-operator                            4.10.0-0.nightly-2022-01-26-234447   True        False         False      19m
console                                    4.10.0-0.nightly-2022-01-26-234447   True        False         False      4m41s
csi-snapshot-controller                    4.10.0-0.nightly-2022-01-26-234447   True        False         False      18m     
dns                                        4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m     
etcd                                       4.10.0-0.nightly-2022-01-26-234447   True        False         False      16m     
image-registry                             4.10.0-0.nightly-2022-01-26-234447   True        False         False      10m     
ingress                                    4.10.0-0.nightly-2022-01-26-234447   True        False         False      9m32s   
insights                                   4.10.0-0.nightly-2022-01-26-234447   True        False         False      12m     
kube-apiserver                             4.10.0-0.nightly-2022-01-26-234447   True        False         False      15m     
kube-controller-manager                    4.10.0-0.nightly-2022-01-26-234447   True        False         False      16m     
kube-scheduler                             4.10.0-0.nightly-2022-01-26-234447   True        False         False      15m     
kube-storage-version-migrator              4.10.0-0.nightly-2022-01-26-234447   True        False         False      18m     
machine-api                                4.10.0-0.nightly-2022-01-26-234447   True        False         False      13m     
machine-approver                           4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m     
machine-config                             4.10.0-0.nightly-2022-01-26-234447   True        False         False      16m     
marketplace                                4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m     
monitoring                                 4.10.0-0.nightly-2022-01-26-234447   True        False         False      7m8s    
network                                    4.10.0-0.nightly-2022-01-26-234447   True        False         False      18m     
node-tuning                                4.10.0-0.nightly-2022-01-26-234447   True        False         False      8m9s    
openshift-apiserver                        4.10.0-0.nightly-2022-01-26-234447   True        False         False      12m     
openshift-controller-manager               4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m     
openshift-samples                          4.10.0-0.nightly-2022-01-26-234447   True        False         False      12m
operator-lifecycle-manager                 4.10.0-0.nightly-2022-01-26-234447   True        False         False      18m
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2022-01-26-234447   True        False         False      17m
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2022-01-26-234447   True        False         False      12m
service-ca                                 4.10.0-0.nightly-2022-01-26-234447   True        False         False      19m
storage                                    4.10.0-0.nightly-2022-01-26-234447   True        False         True       15m     AlibabaDiskCSIDriverOperatorCRDegraded: AlibabaCloudDriverStaticResourcesControllerDegraded: "rbac/snapshotter_role.yaml" (string): clusterroles.rbac.authorization.k8s.io "alibaba-disk-external-snapshotter-role" is forbidden: user "system:serviceaccount:openshift-cluster-csi-drivers:alibaba-disk-csi-driver-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:openshift-cluster-csi-drivers" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:...
$

Comment 11 errata-xmlrpc 2022-03-10 16:40:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.