Bug 1976016

Summary: Azure: Destroy cluster eventually fails when trying to delete a cluster while other resources (not related to the cluster) are present in the resource group
Product: OpenShift Container Platform Reporter: To Hung Sze <tsze>
Component: InstallerAssignee: Aditya Narayanaswamy <anarayan>
Installer sub component: openshift-installer QA Contact: To Hung Sze <tsze>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: low CC: anarayan, esimard, mstaeble
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Azure destroy works as expected but the error message produced is not user friendly when there are other resources in the resource group that the installer tries to delete it. Fixing the error message the installer produces and providing a user friendly one.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:36:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description To Hung Sze 2021-06-25 01:44:25 UTC
Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version:
openshift-install-linux-4.8.0-0.nightly-2021-06-23-232238

$ openshift-install version
built from commit a5ddd2dd6c72d8a5ea0a5f17acd8b964b6a3d1be
release image registry.ci.openshift.org/ocp/release@sha256:1e612a1bd83bfa79e9aa2404d6967321cf31cd36e89b02d7a85ff6b77b238417


Platform:
Azure

Please specify:
IPI

What happened?
When trying to destroy cluster and there is other resources using the cluster, installer loops over an error message and eventually fails without very user friendly message.

To reproduce:
1 Install an IPI Cluster on Azure (Cluster A)
2 Install another IPI (Cluster B) on the same vnet used by cluster A
3 Try to destroy Cluster A (while Cluster B is still there).

Installer will loop over the "deleting resource group" and eventually fail:

DEBUG deleting resource group                      
DEBUG failed to delete tszeaz062421d-rrzxw-rg: Future#WaitForCompletion: the number of retries has been exceeded: StatusCode=409 -- Original Error: Code="ResourceGroupDeletionBlocked" Message="Deletion of resource group 'tszeaz062421d-rrzxw-rg' failed as resources with identifiers 'Microsoft.Network/virtualNetworks/tszeaz062421d-rrzxw-vnet,Microsoft.Network/networkSecurityGroups/tszeaz062421d-rrzxw-nsg' could not be deleted. The provisioning state of the resource group will be rolled back. The tracking Id is 'ce130605-4777-470a-82ca-d66a0b68a0ea'. Please check audit logs for more details." Details=[{"AdditionalInfo":null,"Code":null,"Details":null,"Message":"{\"error\":{\"code\":\"InUseSubnetCannotBeDeleted\",\"message\":\"Subnet tszeaz062421d-rrzxw-master-subnet is in use by /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421f-d9js6-rg/providers/Microsoft.Network/loadBalancers/tszeaz062421f-d9js6-internal/frontendIPConfigurations/internal-lb-ip-v4 and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet.\",\"details\":[]}}","Target":"/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421d-rrzxw-rg/providers/Microsoft.Network/virtualNetworks/tszeaz062421d-rrzxw-vnet"},{"AdditionalInfo":null,"Code":null,"Details":null,"Message":"{\"error\":{\"code\":\"InUseNetworkSecurityGroupCannotBeDeleted\",\"message\":\"Network security group /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421d-rrzxw-rg/providers/Microsoft.Network/networkSecurityGroups/tszeaz062421d-rrzxw-nsg cannot be deleted because it is in use by the following resources: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421d-rrzxw-rg/providers/Microsoft.Network/virtualNetworks/tszeaz062421d-rrzxw-vnet/subnets/tszeaz062421d-rrzxw-master-subnet, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421d-rrzxw-rg/providers/Microsoft.Network/virtualNetworks/tszeaz062421d-rrzxw-vnet/subnets/tszeaz062421d-rrzxw-worker-subnet. In order to delete the Network security group, remove the association with the resource(s). To learn how to do this, see aka.ms/deletensg.\",\"details\":[]}}","Target":"/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz062421d-rrzxw-rg/providers/Microsoft.Network/networkSecurityGroups/tszeaz062421d-rrzxw-nsg"}] 
DEBUG deleting resource group                      
DEBUG failed to delete tszeaz062421d-rrzxw-rg: Future#WaitForCompletion: context has been cancelled: StatusCode=409 -- Original Error: context deadline exceeded 
DEBUG context deadline exceeded                    
DEBUG context deadline exceeded                    
FATAL Failed to destroy cluster: [failed to delete resource group: context deadline exceeded, failed to delete application registrations and their service principals: context deadline exceeded] 
 

Note
Installer behaves similarly in GCP until the following is implemented:
https://issues.redhat.com/browse/CORS-1380

Comment 1 Etienne Simard 2021-06-25 16:25:32 UTC
Can you please attach your install-config.yaml for both Cluster A and Cluster B?

Comment 2 To Hung Sze 2021-06-25 17:09:15 UTC
Clsuter A:
metadata:
  creationTimestamp: null
  name: tszeaz062421d
....
platform:
  azure:
    baseDomainResourceGroupName: os4-common
    cloudName: AzurePublicCloud
    outboundType: Loadbalancer
    region: centralus
(skipping a bunch of lines that are standard)

Then manually install using this
platform:
  azure:
    baseDomainResourceGroupName: os4-common
    cloudName: AzurePublicCloud
    outboundType: Loadbalancer
    region: centralus
    networkResourceGroupName: tszeaz062421d-rrzxw-rg
    virtualNetwork: tszeaz062421d-rrzxw-vnet
    controlPlaneSubnet: tszeaz062421d-rrzxw-master-subnet
    computeSubnet: tszeaz062421d-rrzxw-worker-subnet

Let me know if you need anything additional. Thanks.

Comment 3 Matthew Staebler 2021-06-25 17:30:17 UTC
So is the ask here to improve the output to the user? Because the behavior is correct. We cannot delete the resource group for cluster A because there are resources in that resource group that are in use by resources for cluster B.

Comment 4 Matthew Staebler 2021-06-25 17:32:40 UTC
Also, the title of the BZ is very misleading. The title implies that there is a problem deleting cluster A when it is using a network resource group that is shared with cluster B. However, what you really have is a problem deleting cluster A when you have other resources not related to cluster A using the resource group of cluster A.

Comment 5 To Hung Sze 2021-06-25 18:36:18 UTC
Yes, it would be great if we display a message in the log (as opposed to an error from Azure) similar to CORS-1380.

I tried to make the title clear but I guess I didn't do a good enough job.
Tried again. Please review and feel free to edit it.

Thanks.

Comment 6 Russell Teague 2021-08-02 18:01:12 UTC
Needs prioritization.

Comment 8 To Hung Sze 2021-08-31 18:56:02 UTC
Using openshift-install-linux-4.9.0-0.nightly-2021-08-30-070917

destroy cluster A, I get

FATAL Failed to destroy cluster: unable to delete resource group, resources in the group are in use by others: failed to delete tszeaz083121a-d6lm9-rg: Future#WaitForCompletion: the number of retries has been exceeded: StatusCode=409 -- Original Error: Code="ResourceGroupDeletionBlocked" Message="Deletion of resource group 'tszeaz083121a-d6lm9-rg' failed as resources with identifiers 'Microsoft.Network/virtualNetworks/tszeaz083121a-d6lm9-vnet,Microsoft.Network/networkSecurityGroups/tszeaz083121a-d6lm9-nsg' could not be deleted. The provisioning state of the resource group will be rolled back. The tracking Id is '4aaa5038-ffcf-4cc6-8405-8035e71878a5'. Please check audit logs for more details." Details=[{"AdditionalInfo":null,"Code":null,"Details":null,"Message":"{\"error\":{\"code\":\"InUseSubnetCannotBeDeleted\",\"message\":\"Subnet tszeaz083121a-d6lm9-worker-subnet is in use by /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121d-rdl56-rg/providers/Microsoft.Network/networkInterfaces/tszeaz083121d-rdl56-worker-northcentralus-sjhrj-nic/ipConfigurations/pipConfig and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet.\",\"details\":[]}}","Target":"/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121a-d6lm9-rg/providers/Microsoft.Network/virtualNetworks/tszeaz083121a-d6lm9-vnet"},{"AdditionalInfo":null,"Code":null,"Details":null,"Message":"{\"error\":{\"code\":\"InUseNetworkSecurityGroupCannotBeDeleted\",\"message\":\"Network security group /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121a-d6lm9-rg/providers/Microsoft.Network/networkSecurityGroups/tszeaz083121a-d6lm9-nsg cannot be deleted because it is in use by the following resources: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121a-d6lm9-rg/providers/Microsoft.Network/virtualNetworks/tszeaz083121a-d6lm9-vnet/subnets/tszeaz083121a-d6lm9-worker-subnet, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121a-d6lm9-rg/providers/Microsoft.Network/virtualNetworks/tszeaz083121a-d6lm9-vnet/subnets/tszeaz083121a-d6lm9-master-subnet. In order to delete the Network security group, remove the association with the resource(s). To learn how to do this, see aka.ms/deletensg.\",\"details\":[]}}","Target":"/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/tszeaz083121a-d6lm9-rg/providers/Microsoft.Network/networkSecurityGroups/tszeaz083121a-d6lm9-nsg"}] 
destroy cluster failed - please double check

Comment 9 Aditya Narayanaswamy 2021-08-31 19:11:58 UTC
This is the part I added (the first part of your error message).

FATAL Failed to destroy cluster: unable to delete resource group, resources in the group are in use by others:

I think the user should know what resource is causing this problem so I added the error message received from Azure and am displaying this along with above sentence.

Comment 10 To Hung Sze 2021-09-02 14:41:07 UTC
Thanks @anarayan

Comment 13 errata-xmlrpc 2021-10-18 17:36:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759