Bug 2101736

Summary: Finalizers can't be removed for machines
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Mike Fedosin <mfedosin>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED ERRATA Docs Contact: Jeana Routh <jrouth>
Severity: low    
Priority: high    
Version: 4.11   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, machines created in early versions of {product-title} with invalid configurations could not be deleted. With this release, the webhooks that prevent the creation of machines with invalid configurations no longer prevent the deletion of existing invalid machines. Users can now successfully remove these machines from their cluster by manually removing the finalizers on these machines. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2101736[*BZ#2101736*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:50:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2022-06-28 09:39:29 UTC
Description of problem:
I setup a 4.5 vsphere cluster, and create a machineset with AWS provider spec,  then delete machineset, machine stuck in Deleting status. Then upgrade cluster from 4.5 all the way to 4.11 then remove finalizer, machine still couldn't be deleted because of the validating webhooks.

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-18-171831

How reproducible:
Always

Steps to Reproduce:
1. Setup a 4.5 vsphere cluster
2. Create a new machineset in vSphere cluster with AWS provider spec

3. Delete machineset, machine stuck in Deleting status
4. Upgrade cluster from 4.5 all the way to 4.11 then remove finalizer, still couldn't delete machine. 

$ oc edit machine zhsunvs2-g6mkw-worker1-rxf47                 
Warning: providerSpec.numCPUs: 0 is less than the minimum value (2): the minimum value will be used instead
Warning: providerSpec.memoryMiB: 0 is less than the recommended minimum value (2048): nodes may not boot correctly
Warning: providerSpec.diskGiB: 0 is less than the recommended minimum (120): nodes may fail to start if disk size is too low
error: machines.machine.openshift.io "zhsunvs2-g6mkw-worker1-rxf47" could not be patched: admission webhook "validation.machine.machine.openshift.io" denied the request: [providerSpec.template: Required value: template must be provided, providerSpec.workspace: Required value: workspace must be provided, providerSpec.network.devices: Required value: at least 1 network device must be provided]

$ oc get machine                                                                                                 
NAME                            PHASE      TYPE   REGION   ZONE   AGE
zhsunvs19-bdttx-master-0        Running                           10h
zhsunvs19-bdttx-master-1        Running                           10h
zhsunvs19-bdttx-master-2        Running                           10h
zhsunvs19-bdttx-worker-d99jf    Running                           10h
zhsunvs19-bdttx-worker-mtjgx    Running                           10h
zhsunvs19-bdttx-worker2-7wfh2   Deleting                          9h

Actual results:
Remove finalizer, machine couldn't be deleted.


Expected results:
Remove finalizer, machine could be deleted.

Additional info:
must-gather: https://drive.google.com/file/d/1ulYQI5yR2LTgnnRnC4GcWzkxmk1Hin6x/view?usp=sharing
This is for https://issues.redhat.com/browse/OCPCLOUD-1426

Comment 2 sunzhaohua 2022-07-20 13:39:58 UTC
This still doesn't work, same steps with the bug descripted. After upgrade cluster to 4.12, then remove finalizer, machine finalizer couldn't be updated.

$ oc get clusterversion                                                                                        [20:45:51]
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-07-20-030220   True        False         157m    Cluster version is 4.12.0-0.nightly-2022-07-20-030220

$ oc get machine                                                                                               [20:40:42]
NAME                           PHASE      TYPE   REGION   ZONE   AGE
zhsunvs4-ntvg6-master-0        Running                           12h
zhsunvs4-ntvg6-master-1        Running                           12h
zhsunvs4-ntvg6-master-2        Running                           12h
zhsunvs4-ntvg6-worker-mf2gz    Running                           12h
zhsunvs4-ntvg6-worker-v26s8    Running                           12h
zhsunvs4-ntvg6-worker1-tjwtt   Deleting                          11h

$ oc edit machine zhsunvs4-ntvg6-worker1-tjwtt                                                                 [20:45:59]
Warning: providerSpec.numCPUs: 0 is missing or less than the minimum value (2): nodes may not boot correctly
Warning: providerSpec.memoryMiB: 0 is missing or less than the recommended minimum value (2048): nodes may not boot correctly
Warning: providerSpec.diskGiB: 0 is missing or less than the recommended minimum (120): nodes may fail to start if disk size is too low
Warning: providerSpec.credentialsSecret: Invalid value: "aws-cloud-credentials": not found. Expected CredentialsSecret to exist
error: machines.machine.openshift.io "zhsunvs4-ntvg6-worker1-tjwtt" could not be patched: admission webhook "validation.machine.machine.openshift.io" denied the request: [providerSpec.template: Required value: template must be provided, providerSpec.workspace: Required value: workspace must be provided, providerSpec.network.devices: Required value: at least 1 network device must be provided]
You can run `oc replace -f /var/folders/0m/7xwxpmks77n3dm5rr8x8g92r0000gn/T/oc-edit-3841744707.yaml` to try this update again.

Comment 3 Mike Fedosin 2022-08-23 15:48:29 UTC
Hmmm... I was able to delete the finalizer using these commands:

❯ oc delete machines -nopenshift-machine-api   ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29
machine.machine.openshift.io "ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29" deleted
^C
❯ oc get machines -nopenshift-machine-api   ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29 -oyaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: running
  creationTimestamp: "2022-08-23T15:05:26Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-08-23T15:41:05Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-
  generation: 3
  ...
❯ oc patch machines -nopenshift-machine-api   ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29 -p '{"metadata":{"finalizers":null}}' --type=merge
machine.machine.openshift.io/ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29 patched
❯ oc get machines -nopenshift-machine-api   ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29 -oyaml
Error from server (NotFound): machines.machine.openshift.io "ci-ln-sc52m12-76ef8-2c4ss-worker-us-west-2b-5wl29" not found

Comment 4 sunzhaohua 2022-08-26 04:39:19 UTC
@mfedosin I tried again, I couldn't delete the finalizer, please help to take a look if we need support this case.

step:
1. Setup a 4.5 vsphere cluster  (only in 4.5 we can create a machineset with aws ProviderSpec, from 4.6 webhook don't allowed to create)
2. Create a new machineset in vSphere cluster with AWS provider spec
3. Delete machine, machine stuck in Deleting status
4. Upgrade cluster from 4.5 all the way to 4.12 then remove finalizer. 

After upgrading to 4.12, remove finalizer
$ oc patch machines -nopenshift-machine-api  zhsun-v25-gl6hs-worker1-jfnrm -p '{"metadata":{"finalizers":null}}' --type=merge
Warning: providerSpec.numCPUs: 0 is missing or less than the minimum value (2): nodes may not boot correctly
Warning: providerSpec.memoryMiB: 0 is missing or less than the recommended minimum value (2048): nodes may not boot correctly
Warning: providerSpec.diskGiB: 0 is missing or less than the recommended minimum (120): nodes may fail to start if disk size is too low
Warning: providerSpec.credentialsSecret: Invalid value: "aws-cloud-credentials": not found. Expected CredentialsSecret to exist
Error from server ([providerSpec.template: Required value: template must be provided, providerSpec.workspace: Required value: workspace must be provided, providerSpec.network.devices: Required value: at least 1 network device must be provided]): admission webhook "validation.machine.machine.openshift.io" denied the request: [providerSpec.template: Required value: template must be provided, providerSpec.workspace: Required value: workspace must be provided, providerSpec.network.devices: Required value: at least 1 network device must be provided]

$ oc get machine                                                                                  
NAME                            PHASE      TYPE   REGION   ZONE   AGE
zhsun-v25-gl6hs-master-0        Running                           28h
zhsun-v25-gl6hs-master-1        Running                           28h
zhsun-v25-gl6hs-master-2        Running                           28h
zhsun-v25-gl6hs-worker-cnw54    Running                           27h
zhsun-v25-gl6hs-worker-jc2cl    Running                           27h
zhsun-v25-gl6hs-worker1-jfnrm   Deleting                          27h
zhsun-v25-gl6hs-worker1-v8mqq                                     27h
zhsun-v25-gl6hs-worker2-6cq9n                                     27h
zhsun-v25-gl6hs-worker2-k96pd   Deleting                          27h

Comment 6 sunzhaohua 2022-09-09 15:21:01 UTC
Verified
clusterverison: 4.12.0-0.ci-2022-09-09-121216
Same steps as Comment 4, machine could be deleted.

$ oc get machine                                                                                      
NAME                            PHASE      TYPE   REGION   ZONE   AGE
zhsunvs99-vkmns-master-0        Running                           13h
zhsunvs99-vkmns-master-1        Running                           13h
zhsunvs99-vkmns-master-2        Running                           13h
zhsunvs99-vkmns-worker-8cqnc    Running                           13h
zhsunvs99-vkmns-worker-99v99    Running                           13h
zhsunvs99-vkmns-worker1-2swvx                                     13h
zhsunvs99-vkmns-worker1-h5nn2   Deleting                          13h
zhsunvs99-vkmns-worker2-2dbmd   Deleting                          13h
zhsunvs99-vkmns-worker2-ffxwp                                     13h

$ oc patch machines -nopenshift-machine-api  zhsunvs99-vkmns-worker1-h5nn2 -p '{"metadata":{"finalizers":null}}' --type=merge
machine.machine.openshift.io/zhsunvs99-vkmns-worker1-h5nn2 patched

$ oc get machine                                                                                    
NAME                            PHASE      TYPE   REGION   ZONE   AGE
zhsunvs99-vkmns-master-0        Running                           13h
zhsunvs99-vkmns-master-1        Running                           13h
zhsunvs99-vkmns-master-2        Running                           13h
zhsunvs99-vkmns-worker-8cqnc    Running                           13h
zhsunvs99-vkmns-worker-99v99    Running                           13h
zhsunvs99-vkmns-worker1-2swvx                                     13h
zhsunvs99-vkmns-worker2-2dbmd   Deleting                          13h
zhsunvs99-vkmns-worker2-ffxwp                                     13h

Comment 9 errata-xmlrpc 2023-01-17 19:50:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399