Bug 2043003 - [IPI on Alibabacloud] 'destroy cluster' of a failed installation (bug2041694) stuck after 'stage=Nat gateways'
Summary: [IPI on Alibabacloud] 'destroy cluster' of a failed installation (bug2041694)...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: aos-install
QA Contact: Jianli Wei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-20 13:15 UTC by Jianli Wei
Modified: 2022-03-10 16:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:40:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
.openshift_install.log (403.38 KB, text/plain)
2022-01-20 13:15 UTC, Jianli Wei
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5580 0 None open Bug 2043003: [Alibaba] fix destroy not exist security group 2022-01-27 03:01:08 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:41:09 UTC

Description Jianli Wei 2022-01-20 13:15:50 UTC
Created attachment 1852159 [details]
.openshift_install.log

Version:
$ openshift-install version
openshift-install 4.10.0-0.nightly-2022-01-20-082726
built from commit 9eade28a9ce4862a6ef092bc5f5fcfb499342d4d
release image registry.ci.openshift.org/ocp/release@sha256:bdc27b9ff4a1a482d00fc08924f1157d782ded9f3e91af09fe9f3596bcea877c
release architecture amd64
$ 

Platform: alibabacloud

Please specify:
* IPI

What happened?
(1)Although the installation failed due to https://bugzilla.redhat.com/show_bug.cgi?id=2041694, the 'destroy cluster' is expected to work well, but it hung after 'stage=Nat gateways'. The expected 'stage' after 'stage=Nat gateways' should be 'stage=EIPs', but the EIP stays there. 

(2) Although it tells 'INFO ECS instances deleted', one master instance ('jiwei-405-bbrrc-master-1') stays running.

(3) '.openshift_install.log' keeps telling:
time="2022-01-20T12:29:59Z" level=debug msg="Revoking dependency for security groups" securityGroupIDs="[sg-a2dc4z6pxx3pz3rpueyi sg-a2d9yj0vthluex0ro9bg sg-a2d9yj0vthlueyzssbal]" stage="ECS security groups"
time="2022-01-20T12:29:59Z" level=debug msg=Revoking securityGroupID=sg-a2dc4z6pxx3pz3rpueyi stage="ECS security groups"
time="2022-01-20T12:29:59Z" level=debug msg="Error executing stage" error="SDK.ServerError\nErrorCode: InvalidSecurityGroupId.NotFound\nRecommend: https://error-center.aliyun.com/status/search?Keyword=InvalidSecurityGroupId.NotFound&source=PopGw\nRequestId: 73B85607-90C9-36A5-900B-726609EB5A8E\nMessage: The specified SecurityGroupId does not exist." stage="ECS security groups"

What did you expect to happen?
Destroying the cluster should succeed. 

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
$ yq e '.compute[].platform' work/install-config.yaml 
alibabacloud:
  systemDiskCategory: cloud_efficiency
$ yq e '.controlPlane.platform' work/install-config.yaml 
alibabacloud:
  systemDiskCategory: cloud_efficiency
$ yq e '.platform' work/install-config.yaml 
alibabacloud:
  region: ap-south-1
  resourceGroupID: rg-aek2c4huej7f3ni
$ yq e '.credentialsMode' work/install-config.yaml 
Manual
$ 
$ openshift-install create manifests --dir work
INFO Consuming Install Config from target directory
INFO Manifests created in: work/manifests and work/openshift
$ openshift-install create cluster --dir work --log-level info
INFO Consuming OpenShift Install (Manifests) from target directory
INFO Consuming Common Manifests from target directory
INFO Consuming Worker Machines from target directory
INFO Consuming Openshift Manifests from target directory
INFO Consuming Master Machines from target directory
INFO Creating infrastructure resources...
ERROR
ERROR Error: [ERROR] terraform-provider-alicloud/alicloud/resource_alicloud_instance.go:452: Resource alicloud_instance RunInstances Failed!!! [SDK alibaba-cloud-sdk-go ERROR]:
ERROR SDK.ServerError
ERROR ErrorCode: InvalidResourceType.NotSupported
ERROR Recommend: https://error-center.aliyun.com/status/search?Keyword=InvalidResourceType.NotSupported&source=PopGw
ERROR RequestId: 8AB5BF0F-C4D1-3866-AC4A-3890388D12EB
ERROR Message: user order resource type [[cloud_essd]] not exists in [ap-south-1a]
ERROR
ERROR   on ../../tmp/openshift-install-bootstrap-608796246/main.tf line 133, in resource "alicloud_instance" "bootstrap":
ERROR  133: resource "alicloud_instance" "bootstrap" {
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
$ aliyun vpc DescribeEipAddresses --RegionId ap-south-1 --EipName jiwei-405-bbrrc-eip --output cols=AllocationId,InstanceRegionId,InstanceType,Description,InternetChargeType,IpAddress,Tags.Tag[] rows=EipAddresses.EipAddress[]
AllocationId              | InstanceRegionId | InstanceType | Description                    | InternetChargeType | IpAddress       | Tags.Tag[]
------------              | ---------------- | ------------ | -----------                    | ------------------ | ---------       | ----------
eip-a2dkmtv16me8rv4sr16wb | ap-south-1       | Nat          | Created By OpenShift Installer | PayByTraffic       | 149.129.184.209 | [map[Key:Name Value:jiwei-405-bbrrc-eip] map[Key:sigs.k8s.io/cloud-provider-alibaba/origin Value:ocp] map[Key:GISV Value:ocp] map[Key:kubernetes.io/cluster/jiwei-405-bbrrc Value:owned]]

$ 
$ openshift-install destroy cluster --dir work --log-level info
INFO ECS instances deleted                         stage=ECS instances
INFO OSS bucket deleted                            bucketName=jiwei-405-bbrrc-bootstrap stage=OSS buckets
INFO OSS buckets deleted                           stage=OSS buckets
INFO RAM roles deleted                             stage=RAM roles
INFO Private zones deleted                         stage=private zones
INFO SLB instances deleted                         stage=SLBs
INFO NAT gateways deleted                          stage=Nat gateways
^C
$ 
$ aliyun vpc DescribeEipAddresses --RegionId ap-south-1 --EipName jiwei-405-bbrrc-eip --output cols=AllocationId,InstanceRegionId,InstanceType,Description,InternetChargeType,IpAddress,Tags.Tag[] rows=EipAddresses.EipAddress[]
AllocationId              | InstanceRegionId | InstanceType | Description                    | InternetChargeType | IpAddress       | Tags.Tag[]
------------              | ---------------- | ------------ | -----------                    | ------------------ | ---------       | ----------
eip-a2dkmtv16me8rv4sr16wb |                  |              | Created By OpenShift Installer | PayByTraffic       | 149.129.184.209 | [map[Key:Name Value:jiwei-405-bbrrc-eip] map[Key:sigs.k8s.io/cloud-provider-alibaba/origin Value:ocp] map[Key:GISV Value:ocp] map[Key:kubernetes.io/cluster/jiwei-405-bbrrc Value:owned]]

$ 
$ aliyun ecs DescribeInstances --RegionId ap-south-1 --InstanceName jiwei-405-bbrrc-master-1 --endpoint ecs.ap-south-1.aliyuncs.com --output cols=ZoneId,InstanceId,Status,SecurityGroupIds.SecurityGroupId[],VpcAttributes.VpcId rows=Instances.Instance[]
ZoneId      | InstanceId             | Status  | SecurityGroupIds.SecurityGroupId[] | VpcAttributes.VpcId
------      | ----------             | ------  | ---------------------------------- | -------------------
ap-south-1a | i-a2dc4z6pxx3pz3rpf5tk | Running | [sg-a2d9yj0vthluex0ro9bg]          | vpc-a2d05l0atni35cloe8u6h

$ aliyun ecs DescribeSecurityGroups --RegionId ap-south-1 --VpcId vpc-a2d05l0atni35cloe8u6h --endpoint ecs.ap-south-1.aliyuncs.com --output cols=SecurityGroupName,SecurityGroupId,Description rows=SecurityGroups.SecurityGroup[]
SecurityGroupName            | SecurityGroupId         | Description
-----------------            | ---------------         | -----------
jiwei-405-bbrrc_sg_bootstrap | sg-a2d9yj0vthlueyzssbal | Created By OpenShift Installer
jiwei-405-bbrrc-sg-master    | sg-a2d9yj0vthluex0ro9bg | Created By OpenShift Installer

$

Comment 1 Matthew Staebler 2022-01-20 14:24:28 UTC
The destroyer should not error when it attempts to delete a resource that does not exist. The destroyer should be accepting a not-found error as a successful install.

See https://github.com/openshift/installer/blob/303a3c7adcc718f48e1aa372acd92f31b4685642/pkg/destroy/alibabacloud/alibabacloud.go#L832 as the problematic area for the specific case outlined in this BZ.

Comment 4 husun 2022-01-25 18:26:19 UTC
@jiwei 
This scenario doesn't always come up, so I wonder if your cluster resources and logs are still saved?
1. Whether the following records can be found in the log
  Deleting ECS instances ecsIDs=[<the ID of jiwei-405-bbrrc-master-1>] stage=ECS instances
2. Does the master instance ('jiwei-405-bbrrc-master-1') have tags

Comment 8 Jianli Wei 2022-01-29 14:35:59 UTC
Not meet the issue during yesterday and today's testing, mark as verified for now.

Comment 11 errata-xmlrpc 2022-03-10 16:40:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.