Description of problem: CNV occasionally cannot be removed due to leftovers dataImportCrons: workflow: 1. On installation, HCO creates the SSP and the CDI CRs, and add the DataImportCronTemplates to the SSP CRs. 2. CDI deploys the DataImportCron CRDs 3. CDI deploys the DataImportCron CRs with a finalizer ... 4. The cluster admin deletes the HyperConverged CR 5. HCO deletes the CDI and SSP CRs. 6. CDI removes the DataImportCron CRD => K8s APIServer tries to remove the DataImportCron CRs 7. everything is stuck waiting for the CDI finalizers on DataImportCron CRs. CDI can't complete the deletion, HCO can't too and so on Symptoms: DataImportCron CRD contains a deletionTimestamp but is also contains: - lastTransitionTime: "2022-01-05T06:12:17Z" message: CustomResource deletion is in progress reason: InstanceDeletionInProgress status: "True" type: Terminating and indeed we still see on the cluster CRs like: apiVersion: cdi.kubevirt.io/v1beta1 kind: DataImportCron metadata: annotations: cdi.kubevirt.io/storage.import.sourceDesiredDigest: sha256:3b3fc310abeadd38655f2886feffcab5a37c9c3272d95b3147090c3710d8dd72 operator-sdk/primary-resource: kubevirt-hyperconverged/ssp-kubevirt-hyperconverged operator-sdk/primary-resource-type: SSP.ssp.kubevirt.io creationTimestamp: "2022-01-05T06:12:17Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2022-01-05T06:12:18Z" finalizers: - cdi.kubevirt.io/dataImportCronFinalizer generation: 3 labels: app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/name: data-sources app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 100.0.0 name: fedora-image-cron namespace: kubevirt-os-images resourceVersion: "47236" uid: 9ad4f031-fea4-43de-a98f-681de80b2e67 spec: garbageCollect: Outdated managedDataSource: fedora schedule: 46 5/12 * * * template: metadata: {} spec: source: registry: url: docker://quay.io/containerdisks/fedora:35 storage: resources: requests: storage: 5Gi status: {} status: conditions: - lastHeartbeatTime: "2022-01-05T06:12:17Z" lastTransitionTime: "2022-01-05T06:12:17Z" message: No current import reason: NoImport status: "False" type: Progressing - lastHeartbeatTime: "2022-01-05T06:12:17Z" lastTransitionTime: "2022-01-05T06:12:17Z" message: No source digest reason: NoDigest status: "False" type: UpToDate lastExecutionTimestamp: "2022-01-05T06:12:17Z" that also contains a deletionTimestamp but they cannot be deleted due to: finalizers: - cdi.kubevirt.io/dataImportCronFinalizer on CDI operator logs we see: 2022-01-05T06:12:20.849536340Z {"level":"info","ts":1641363140.8482242,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:12:20.849536340Z {"level":"info","ts":1641363140.848256,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:12:20.888525153Z {"level":"info","ts":1641363140.888466,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:12:20.888554595Z {"level":"info","ts":1641363140.888534,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:12:22.738656544Z {"level":"info","ts":1641363142.7385948,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} ... 2022-01-05T06:22:19.066036239Z {"level":"info","ts":1641363739.0659866,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:22:19.066036239Z {"level":"info","ts":1641363739.06603,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:22:19.157080691Z {"level":"info","ts":1641363739.1570394,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} 2022-01-05T06:22:19.157130708Z {"level":"info","ts":1641363739.157093,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} looping forever. Version-Release number of selected component (if applicable): CDI: v1.42.1 How reproducible: no systematic, 75% (?) maybe it depends on the status of a pending import job? still unclear Steps to Reproduce: 1. deploy CNV 2. remove HCO CR 3. Actual results: HCO CR gets stuck being deleted because CDI CR is stuck due to stuck dataImportCrons Expected results: CNV can be successfully removed Additional info:
Another comment, on the SSP CR we see deletionTimestamp: "2022-01-05T06:12:17Z" but on the SSP operator logs we see: 2022-01-05T06:12:18.666513886Z 2022-01-05T06:12:18.666Z DEBUG controllers.SSP Resource create/update failed: create not allowed while custom resource definition is terminating {"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged"} 2022-01-05T06:12:18.666513886Z 2022-01-05T06:12:18.666Z DEBUG controllers.SSP Operand reconciliation failed: create not allowed while custom resource definition is terminating {"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged"} 2022-01-05T06:12:18.987728266Z 2022-01-05T06:12:18.986Z ERROR controllers.SSP Error updating SSP status. {"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged", "error": "Operation cannot be fulfilled on ssps.ssp.kubevirt.io \"ssp-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again"} 2022-01-05T06:12:18.987728266Z kubevirt.io/ssp-operator/controllers.(*sspReconciler).Reconcile 2022-01-05T06:12:18.987728266Z /workspace/controllers/ssp_controller.go:200 2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler 2022-01-05T06:12:18.987728266Z /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem 2022-01-05T06:12:18.987728266Z /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 2022-01-05T06:12:18.987728266Z /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 2022-01-05T06:12:18.987728266Z 2022-01-05T06:12:18.986Z ERROR controller-runtime.manager.controller.ssp Reconciler error {"reconciler group": "ssp.kubevirt.io", "reconciler kind": "SSP", "name": "ssp-kubevirt-hyperconverged", "namespace": "kubevirt-hyperconverged", "error": "create not allowed while custom resource definition is terminating"} after that. And on the "problematic" DataImportCron CR we see: creationTimestamp: "2022-01-05T06:12:17Z" deletionTimestamp: "2022-01-05T06:12:18Z" and the deletion process on HCO CR started at: deletionTimestamp: "2022-01-05T06:12:17Z" So I see at least two different issues here: 1. SSP is continuously trying to recreate the DataImportCron CRs even if already in the deletion phase 2. CDI removed the finalizer on the initial DataImportCron CR that indeed got successfully deleted, but then SSP created another one, CDI added back its finalizer on it (even if already in the deleting phase as well) but then never removes it.
Arnon, please take a look.
Test on CNV v4.10.0-605 1) Deleting CDI CR, dataimportcrons could be cleaned. After a few minutes, CDI will be re-created, and the dataimportcrons will be back. 2) Deleting HCO CR, still got stuck, got below log from CDI operator: {"level":"info","ts":1643033095.3532662,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1643033095.3558974,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"info","ts":1643033095.5176296,"logger":"cdi-operator","msg":"Deleting CRDs and verifing no finalizers","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"} {"level":"error","ts":1643033095.984524,"logger":"cdi-operator","msg":"error invoking callback for","type":"*v1.Deployment","error":"DataImportCron centos-stream8-image-cron has Finalizers [cdi.kubevirt.io/dataImportCronFinalizer]","stacktrace":"kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/callbacks.(*CallbackDispatcher).InvokeCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/callbacks/callback-dispatcher.go:157\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).InvokeCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:549\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).InvokeDeleteCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:515\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).ReconcileDelete\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:713\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).Reconcile\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:136\nkubevirt.io/containerized-data-importer/pkg/operator/controller.(*ReconcileCDI).Reconcile\n\t/remote-source/app/pkg/operator/controller/controller.go:233\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} {"level":"error","ts":1643033095.984651,"logger":"controller.cdi-operator-controller","msg":"Reconciler error","name":"cdi-kubevirt-hyperconverged","namespace":"","error":"DataImportCron centos-stream8-image-cron has Finalizers [cdi.kubevirt.io/dataImportCronFinalizer]","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
We are still able to reproduce this bug with CDI operator v1.44.0, see: https://github.com/kubevirt/hyperconverged-cluster-operator/pull/1764
Following Simone's comment 4, reopening.
*** Bug 2046429 has been marked as a duplicate of this bug. ***
@nunnatsa and me are having a debug session reproducing it on his cluster. We see there is no need to disable and re-enable the feature. It's enough to just try deleteing HCO - getting the same result.
We saw that when we delete HCO CRD, most openshift-cnv pods (not only CDI pods except cdi-operator) are deleted by the kubelet before we get to CDI CR deletion, so there is no controller to handle the DataImportCrons (with finalizers) deletion, which causes the stuck. So far, it seems the issue is out of CDI. As next step, mhenriks suggested to scale cdi-operator to 0 before deleting HCO CRD and see if the CDI resources are deleted. Then try scaling up cdi-operator and see what happens. We'll continue debugging it in Sunday Feb 6
According to our findings, when hco-operator deletes the CDI CR, kubernetes deletes its dependents resources, including "cdi-deployment" which is in-charged of cleaning up (among other things) the DataImportCron CRs. https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#ownership-and-finalizers Thus, nothing will then delete the DIC objects and the CDI finalizer will prevent the CR to be deleted. One potential option is to request kubernetes not to delete the dependent objects, using orphan deletion policy, then cdi-deployment will perform the cleanup procedure, and it will be eventually deleted by cdi-operator.
@nunnatsa please take this bz, as it's clearly an HCO issue, out of CDI scope.
Posted PR to get rid of the DataImportCron finalizer, which is safest solution here.
Cherry-pick is waiting for approval (already merged in main branch). I guess D/S build will be there a few hours after its merge. https://github.com/kubevirt/containerized-data-importer/pull/2157
Test on CNV-4.10.0-680, Issue have been fixed, didn't meet stuck when deleting HCO CR, and dataimportcron, datasource, dv are all cleaned successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947