Description of problem: Create a new machine, machine stuck in Provisioning status because of "Insufficient disk space on datastore", then delete machine, it took about 12 minutes. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-01-18-164445 How reproducible: Always Steps to Reproduce: 1. Create a new machine, machine stuck in Provisioning status because of "Insufficient disk space on datastore" 2. Delete this machine 3. Actual results: It took about 12 minutes to delete the machine stucks in Provisioning. $ ./oc get machine NAME PHASE TYPE REGION ZONE AGE jstuevervcsa-72g5s-master-0 Running 148m jstuevervcsa-72g5s-master-1 Running 148m jstuevervcsa-72g5s-master-2 Running 148m jstuevervcsa-72g5s-worker-4cgzd Running 140m jstuevervcsa-72g5s-worker-bsh6l Running 140m jstuevervcsa-72g5s-worker-cjpkx Provisioning 32m jstuevervcsa-72g5s-worker-l9r5q Running 81m jstuevervcsa-72g5s-worker-qgl87 Provisioning 29m jstuevervcsa-72g5s-worker-twhzg Running 140m $ ./oc delete machine jstuevervcsa-72g5s-worker-cjpkx $ ./oc logs -f machine-api-controllers-64ffb8bcd-z27zb -c machine-controller | grep jstuevervcsa-72g5s-worker-cjpkx I0119 03:58:02.122325 1 controller.go:312] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers idempotent create I0119 03:58:02.122330 1 actuator.go:66] jstuevervcsa-72g5s-worker-cjpkx: actuator creating machine E0119 03:58:02.140637 1 actuator.go:57] jstuevervcsa-72g5s-worker-cjpkx error: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Create machine: Insufficient disk space on datastore ''. I0119 03:58:02.140680 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine W0119 03:58:02.180596 1 controller.go:314] jstuevervcsa-72g5s-worker-cjpkx: failed to create machine: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Create machine: Insufficient disk space on datastore ''. I0119 03:58:45.298585 1 controller.go:168] jstuevervcsa-72g5s-worker-cjpkx: reconciling Machine I0119 03:58:45.298726 1 controller.go:426] jstuevervcsa-72g5s-worker-cjpkx: going into phase "Deleting" I0119 03:58:45.314565 1 controller.go:208] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers delete I0119 03:58:45.314657 1 actuator.go:150] jstuevervcsa-72g5s-worker-cjpkx: actuator deleting machine I0119 03:58:45.339695 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine E0119 03:58:45.420267 1 actuator.go:57] jstuevervcsa-72g5s-worker-cjpkx error: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. E0119 03:58:45.420501 1 controller.go:229] jstuevervcsa-72g5s-worker-cjpkx: failed to delete machine: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. I0119 03:58:45.452358 1 controller.go:168] jstuevervcsa-72g5s-worker-cjpkx: reconciling Machine I0119 03:58:45.452497 1 controller.go:208] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers delete I0119 03:58:45.452532 1 actuator.go:150] jstuevervcsa-72g5s-worker-cjpkx: actuator deleting machine I0119 03:58:45.475338 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine E0119 03:58:45.529736 1 actuator.go:57] jstuevervcsa-72g5s-worker-cjpkx error: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. E0119 03:58:45.529859 1 controller.go:229] jstuevervcsa-72g5s-worker-cjpkx: failed to delete machine: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. I0119 03:59:24.101804 1 controller.go:168] jstuevervcsa-72g5s-worker-cjpkx: reconciling Machine I0119 03:59:24.101841 1 controller.go:208] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers delete I0119 03:59:24.101847 1 actuator.go:150] jstuevervcsa-72g5s-worker-cjpkx: actuator deleting machine I0119 03:59:24.119039 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine E0119 03:59:24.156785 1 actuator.go:57] jstuevervcsa-72g5s-worker-cjpkx error: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. E0119 03:59:24.157008 1 controller.go:229] jstuevervcsa-72g5s-worker-cjpkx: failed to delete machine: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. I0119 04:00:09.955183 1 controller.go:168] jstuevervcsa-72g5s-worker-cjpkx: reconciling Machine I0119 04:00:09.955195 1 controller.go:208] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers delete I0119 04:00:09.955201 1 actuator.go:150] jstuevervcsa-72g5s-worker-cjpkx: actuator deleting machine I0119 04:00:09.968497 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine E0119 04:00:09.994062 1 actuator.go:57] jstuevervcsa-72g5s-worker-cjpkx error: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. E0119 04:00:09.994147 1 controller.go:229] jstuevervcsa-72g5s-worker-cjpkx: failed to delete machine: jstuevervcsa-72g5s-worker-cjpkx: reconciler failed to Delete machine: Insufficient disk space on datastore ''. I0119 04:10:19.517488 1 controller.go:168] jstuevervcsa-72g5s-worker-cjpkx: reconciling Machine I0119 04:10:19.517629 1 controller.go:208] jstuevervcsa-72g5s-worker-cjpkx: reconciling machine triggers delete I0119 04:10:19.517635 1 actuator.go:150] jstuevervcsa-72g5s-worker-cjpkx: actuator deleting machine I0119 04:10:19.573565 1 reconciler.go:240] jstuevervcsa-72g5s-worker-cjpkx: vm does not exist I0119 04:10:19.573690 1 machine_scope.go:102] jstuevervcsa-72g5s-worker-cjpkx: patching machine I0119 04:10:19.608631 1 actuator.go:109] jstuevervcsa-72g5s-worker-cjpkx: actuator checking if machine exists I0119 04:10:19.629500 1 reconciler.go:199] jstuevervcsa-72g5s-worker-cjpkx: does not exist I0119 04:10:19.660077 1 controller.go:260] jstuevervcsa-72g5s-worker-cjpkx: machine deletion successful Expected results: Machine could be deleted quickly. Additional info:
It's interesting that we can't delete a machine if there's no space on the datastore, perhaps we need to check there's space before we attempt to create a VM, and go into Failed if not
The bug is a case where we should set some status on the machine object if we receive an error, rather than a 'machine still exists response'. Aside from that, cluster owners are required to ensure their infrastructure is healthy. I don't think we should be accountable for ensuring enough space exists on the infrastructure. The API will tell us when there isn't, and that's the check.
I think adding some healthchecks to the datacenter to prevent us trying to create machines on unhealthy datacenters may be useful, will see if someone has time to look at this next sprint
verified clusterversion: 4.8.0-0.nightly-2021-05-21-233425
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438