Created attachment 1714686 [details] Logs Description of problem: Deleting the scratch space PVC of an import operation can result in breaking the operation Version-Release number of selected component (if applicable): 2.4.1 How reproducible: Hard to reproduce, timing-related (Initially hit this in tier 2 automation, then reproduced manually) Steps to Reproduce: 1. Import DV that requires scratch space PVC 2. Delete the scratch space PVC that was created Actual results: Import operation freezes, does not complete Expected results: Import operation still succeeds Additional info: Logs attached to bug as a file dv.yaml: apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: deleting-scratch-pvc spec: source: http: url: "http://PATH/cirros-0.4.0-x86_64-disk.qcow2.xz" pvc: volumeMode: Filesystem accessModes: - ReadWriteOnce resources: requests: storage: 2Gi storageClassName: hostpath-provisioner We also have these TCs: https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2328 https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2327 That cover the deletion of scratch space PVC and expect the operation to complete successfully regardless.
@Bartosz please take a look.
I am on it, trying to recreate.
Recreate successful, but need to remove scratch just after creating it, when pod has not been scheduled again/not started yet. The pod becomes Unschedulable and controller does not handle the state correctly. "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-09-22T11:01:03Z", "message": "persistentvolumeclaim \"scratch-space-delete-scratch\" not found", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "BestEffort" }
After requesting the creation of the pod and the scratch-space: 1. the Pod is being created so it shows in the system as Pending (waits for all PVC to be available), this is observed in controller and it tries to create scratch again, but scratch is already in the system (also Pending), so controller sets condition "Claim Pending" and returns. 2. Some external action removes PVC (if I am not mistaken the finalizer: "kubernetes.io/pvc-protection" does not work if pod is Pending/not scheduled yet), now the only thing the controller can see is a PVC event, but the scratch PVC is not found so controller returns. No further events for the POD (still Pending - no changes) .
https://github.com/kubevirt/containerized-data-importer/pull/1424
Not a blocker for 2.5. Pushing out.
Had some problem with downstream builds. There's no -11 (or newer) available.
Build should work now (thanks to Gal Ben Haim!)
I've got some information from @akalenyu that now the original problem does not show. The original problem was that every time the scratch pvc was deleted while the pod was still Pending, the system was in a state where there was pending importer pod, no scratch space and the import controller was not reconciling this situation (it could after the Resync period - 10 hours). After the fix is applied, the import controller does requeue the reconcile loop until the DV is in state Succeeded or Failed. So in this situation the scratch PVC is recreated. This was proved by runing the tests. Now we discovered that the test fails once in many runs. Analyzing the logs shows this situation: PVC: test-scratch status=Terminating POD: importer-test status=ContainerCreating, and the last event shows: Type Reason Age From Message Warning FailedMount 4m (x8 over 8m28s) kubelet Unable to attach or mount volumes: unmounted volumes=[cdi-scratch-vol], unattached volumes=[cdi-scratch-vol]: error processing PVC test-scratch: PVC is being deleted This looks exactly like this: https://bugzilla.redhat.com/show_bug.cgi?id=1570606 To resolve this user can recreate a DV. I am not sure we can/should detect the situation and try to resolve this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0799