Bug 2037312 - CNV occasionally cannot be removed due to leftovers dataImportCrons
Summary: CNV occasionally cannot be removed due to leftovers dataImportCrons
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.10.0
Assignee: Arnon Gilboa
QA Contact: Yan Du
URL:
Whiteboard:
: 2046429 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-05 12:24 UTC by Simone Tiraboschi
Modified: 2022-03-16 16:06 UTC (History)
5 users (show)

Fixed In Version: CNV v4.10.0-676
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 16:06:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2097 0 None Merged Delete CRDs & check finalizers in operator delete 2022-01-20 10:31:11 UTC
Github kubevirt containerized-data-importer pull 2114 0 None Merged [release-v1.43] Delete CRDs & check finalizers in operator delete 2022-01-20 10:33:40 UTC
Github kubevirt containerized-data-importer pull 2129 0 None Merged Fix CRDs deletion in operator deletion 2022-01-29 20:28:44 UTC
Github kubevirt containerized-data-importer pull 2130 0 None Merged [release-v1.43] Fix CRDs deletion in operator deletion 2022-01-29 20:33:15 UTC
Github kubevirt containerized-data-importer pull 2144 0 None Merged Get rid of DataImportCron finalizer 2022-02-13 11:17:39 UTC
Github kubevirt containerized-data-importer pull 2157 0 None Merged [release-v1.43] Get rid of DataImportCron finalizer 2022-02-14 13:17:33 UTC
Red Hat Product Errata RHSA-2022:0947 0 None Closed RHEL EUS Errata Documentation 2022-06-16 06:26:55 UTC

Description Simone Tiraboschi 2022-01-05 12:24:26 UTC
Description of problem:
CNV occasionally cannot be removed due to leftovers dataImportCrons:

workflow:
1. On installation, HCO creates the SSP and the CDI CRs, and add the DataImportCronTemplates to the SSP CRs.
2. CDI deploys the DataImportCron CRDs
3. CDI deploys the DataImportCron CRs with a finalizer
...
4. The cluster admin deletes the HyperConverged CR
5. HCO deletes the CDI and SSP CRs.
6. CDI removes the DataImportCron CRD => K8s APIServer tries to remove the DataImportCron CRs
7. everything is stuck waiting for the CDI finalizers on DataImportCron CRs. CDI can't complete the deletion, HCO can't too and so on


Symptoms:
DataImportCron CRD contains a deletionTimestamp
but is also contains:

  - lastTransitionTime: "2022-01-05T06:12:17Z"
    message: CustomResource deletion is in progress
    reason: InstanceDeletionInProgress
    status: "True"
    type: Terminating

and indeed we still see on the cluster CRs like:

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataImportCron
metadata:
  annotations:
    cdi.kubevirt.io/storage.import.sourceDesiredDigest: sha256:3b3fc310abeadd38655f2886feffcab5a37c9c3272d95b3147090c3710d8dd72
    operator-sdk/primary-resource: kubevirt-hyperconverged/ssp-kubevirt-hyperconverged
    operator-sdk/primary-resource-type: SSP.ssp.kubevirt.io
  creationTimestamp: "2022-01-05T06:12:17Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-01-05T06:12:18Z"
  finalizers:
  - cdi.kubevirt.io/dataImportCronFinalizer
  generation: 3
  labels:
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/name: data-sources
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 100.0.0
  name: fedora-image-cron
  namespace: kubevirt-os-images
  resourceVersion: "47236"
  uid: 9ad4f031-fea4-43de-a98f-681de80b2e67
spec:
  garbageCollect: Outdated
  managedDataSource: fedora
  schedule: 46 5/12 * * *
  template:
    metadata: {}
    spec:
      source:
        registry:
          url: docker://quay.io/containerdisks/fedora:35
      storage:
        resources:
          requests:
            storage: 5Gi
    status: {}
status:
  conditions:
  - lastHeartbeatTime: "2022-01-05T06:12:17Z"
    lastTransitionTime: "2022-01-05T06:12:17Z"
    message: No current import
    reason: NoImport
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2022-01-05T06:12:17Z"
    lastTransitionTime: "2022-01-05T06:12:17Z"
    message: No source digest
    reason: NoDigest
    status: "False"
    type: UpToDate
  lastExecutionTimestamp: "2022-01-05T06:12:17Z"


that also contains a deletionTimestamp but they cannot be deleted due to:

  finalizers:
  - cdi.kubevirt.io/dataImportCronFinalizer

on CDI operator logs we see:

2022-01-05T06:12:20.849536340Z {"level":"info","ts":1641363140.8482242,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:12:20.849536340Z {"level":"info","ts":1641363140.848256,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:12:20.888525153Z {"level":"info","ts":1641363140.888466,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:12:20.888554595Z {"level":"info","ts":1641363140.888534,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:12:22.738656544Z {"level":"info","ts":1641363142.7385948,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
...
2022-01-05T06:22:19.066036239Z {"level":"info","ts":1641363739.0659866,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:22:19.066036239Z {"level":"info","ts":1641363739.06603,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:22:19.157080691Z {"level":"info","ts":1641363739.1570394,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
2022-01-05T06:22:19.157130708Z {"level":"info","ts":1641363739.157093,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}

looping forever.



Version-Release number of selected component (if applicable):
CDI: v1.42.1

How reproducible:
no systematic, 75% (?)
maybe it depends on the status of a pending import job? still unclear

Steps to Reproduce:
1. deploy CNV
2. remove HCO CR
3.

Actual results:
HCO CR gets stuck being deleted because CDI CR is stuck due to stuck dataImportCrons  

Expected results:
CNV can be successfully removed

Additional info:

Comment 1 Simone Tiraboschi 2022-01-05 13:30:06 UTC
Another comment,
on the SSP CR we see deletionTimestamp: "2022-01-05T06:12:17Z"

but on the SSP operator logs we see:

2022-01-05T06:12:18.666513886Z 2022-01-05T06:12:18.666Z	DEBUG	controllers.SSP	Resource create/update failed: create not allowed while custom resource definition is terminating	{"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged"}
2022-01-05T06:12:18.666513886Z 2022-01-05T06:12:18.666Z	DEBUG	controllers.SSP	Operand reconciliation failed: create not allowed while custom resource definition is terminating	{"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged"}
2022-01-05T06:12:18.987728266Z 2022-01-05T06:12:18.986Z	ERROR	controllers.SSP	Error updating SSP status.	{"ssp": "kubevirt-hyperconverged/ssp-kubevirt-hyperconverged", "error": "Operation cannot be fulfilled on ssps.ssp.kubevirt.io \"ssp-kubevirt-hyperconverged\": the object has been modified; please apply your changes to the latest version and try again"}
2022-01-05T06:12:18.987728266Z kubevirt.io/ssp-operator/controllers.(*sspReconciler).Reconcile
2022-01-05T06:12:18.987728266Z 	/workspace/controllers/ssp_controller.go:200
2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2022-01-05T06:12:18.987728266Z 	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298
2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2022-01-05T06:12:18.987728266Z 	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
2022-01-05T06:12:18.987728266Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2022-01-05T06:12:18.987728266Z 	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
2022-01-05T06:12:18.987728266Z 2022-01-05T06:12:18.986Z	ERROR	controller-runtime.manager.controller.ssp	Reconciler error	{"reconciler group": "ssp.kubevirt.io", "reconciler kind": "SSP", "name": "ssp-kubevirt-hyperconverged", "namespace": "kubevirt-hyperconverged", "error": "create not allowed while custom resource definition is terminating"}

after that.

And on the "problematic" DataImportCron CR we see:

  creationTimestamp: "2022-01-05T06:12:17Z"
  deletionTimestamp: "2022-01-05T06:12:18Z"

and the deletion process on HCO CR started at:
    deletionTimestamp: "2022-01-05T06:12:17Z"

So I see at least two different issues here:
1. SSP is continuously trying to recreate the DataImportCron CRs even if already in the deletion phase
2. CDI removed the finalizer on the initial DataImportCron CR that indeed got successfully deleted, but then SSP created another one, CDI added back its finalizer on it (even if already in the deleting phase as well) but then never removes it.

Comment 2 Adam Litke 2022-01-10 12:39:23 UTC
Arnon, please take a look.

Comment 3 Yan Du 2022-01-24 14:59:42 UTC
Test on CNV v4.10.0-605

1) Deleting CDI CR, dataimportcrons could be cleaned. After a few minutes, CDI will be re-created, and the dataimportcrons will be back.

2) Deleting HCO CR, still got stuck, got below log from CDI operator:

{"level":"info","ts":1643033095.3532662,"logger":"cdi-operator","msg":"Reconciling CDI","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
{"level":"info","ts":1643033095.3558974,"logger":"cdi-operator","msg":"Doing reconcile delete","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
{"level":"info","ts":1643033095.5176296,"logger":"cdi-operator","msg":"Deleting CRDs and verifing no finalizers","Request.Namespace":"","Request.Name":"cdi-kubevirt-hyperconverged"}
{"level":"error","ts":1643033095.984524,"logger":"cdi-operator","msg":"error invoking callback for","type":"*v1.Deployment","error":"DataImportCron centos-stream8-image-cron has Finalizers [cdi.kubevirt.io/dataImportCronFinalizer]","stacktrace":"kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/callbacks.(*CallbackDispatcher).InvokeCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/callbacks/callback-dispatcher.go:157\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).InvokeCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:549\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).InvokeDeleteCallbacks\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:515\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).ReconcileDelete\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:713\nkubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler.(*Reconciler).Reconcile\n\t/remote-source/app/vendor/kubevirt.io/controller-lifecycle-operator-sdk/pkg/sdk/reconciler/reconciler.go:136\nkubevirt.io/containerized-data-importer/pkg/operator/controller.(*ReconcileCDI).Reconcile\n\t/remote-source/app/pkg/operator/controller/controller.go:233\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1643033095.984651,"logger":"controller.cdi-operator-controller","msg":"Reconciler error","name":"cdi-kubevirt-hyperconverged","namespace":"","error":"DataImportCron centos-stream8-image-cron has Finalizers [cdi.kubevirt.io/dataImportCronFinalizer]","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Comment 4 Simone Tiraboschi 2022-02-02 08:36:20 UTC
We are still able to reproduce this bug with CDI operator v1.44.0, see: https://github.com/kubevirt/hyperconverged-cluster-operator/pull/1764

Comment 5 Natalie Gavrielov 2022-02-03 11:40:14 UTC
Following Simone's comment 4, reopening.

Comment 6 Arnon Gilboa 2022-02-03 15:04:33 UTC
*** Bug 2046429 has been marked as a duplicate of this bug. ***

Comment 7 Arnon Gilboa 2022-02-03 15:25:22 UTC
@nunnatsa and me are having a debug session reproducing it on his cluster.
We see there is no need to disable and re-enable the feature.
It's enough to just try deleteing HCO - getting the same result.

Comment 8 Arnon Gilboa 2022-02-04 09:22:16 UTC
We saw that when we delete HCO CRD, most openshift-cnv pods (not only CDI pods except cdi-operator) are deleted by the kubelet before we get to CDI CR deletion, so there is no controller to handle the DataImportCrons (with finalizers) deletion, which causes the stuck. So far, it seems the issue is out of CDI.

As next step, mhenriks suggested to scale cdi-operator to 0 before deleting HCO CRD and see if the CDI resources are deleted. Then try scaling up cdi-operator and see what happens.

We'll continue debugging it in Sunday Feb 6

Comment 9 Oren Cohen 2022-02-07 11:31:22 UTC
According to our findings, when hco-operator deletes the CDI CR, kubernetes deletes its dependents resources, including "cdi-deployment" which is in-charged of cleaning up (among other things) the DataImportCron CRs.
https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#ownership-and-finalizers

Thus, nothing will then delete the DIC objects and the CDI finalizer will prevent the CR to be deleted.

One potential option is to request kubernetes not to delete the dependent objects, using orphan deletion policy, then cdi-deployment will perform the cleanup procedure, and it will be eventually deleted by cdi-operator.

Comment 10 Arnon Gilboa 2022-02-07 13:00:29 UTC
@nunnatsa please take this bz, as it's clearly an HCO issue, out of CDI scope.

Comment 11 Arnon Gilboa 2022-02-08 16:08:38 UTC
Posted PR to get rid of the DataImportCron finalizer, which is safest solution here.

Comment 12 Arnon Gilboa 2022-02-14 10:42:32 UTC
Cherry-pick is waiting for approval (already merged in main branch).
I guess D/S build will be there a few hours after its merge.

https://github.com/kubevirt/containerized-data-importer/pull/2157

Comment 13 Yan Du 2022-02-16 12:13:11 UTC
Test on CNV-4.10.0-680, 
Issue have been fixed, didn't meet stuck when deleting HCO CR, 
and dataimportcron, datasource, dv are all cleaned successfully.

Comment 16 errata-xmlrpc 2022-03-16 16:06:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.