Bug 1878499 - DV import doesn't recover from scratch space PVC deletion
Summary: DV import doesn't recover from scratch space PVC deletion
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 2.6.0
Assignee: Bartosz Rybacki
QA Contact: Alex Kalenyuk
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-13 13:52 UTC by Alex Kalenyuk
Modified: 2021-03-10 11:19 UTC (History)
4 users (show)

Fixed In Version: virt-cdi-importer 2.6.0-14
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-10 11:18:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs (26.16 KB, text/plain)
2020-09-13 13:52 UTC, Alex Kalenyuk
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:0799 0 None None None 2021-03-10 11:19:14 UTC

Description Alex Kalenyuk 2020-09-13 13:52:13 UTC
Created attachment 1714686 [details]
Logs

Description of problem:
Deleting the scratch space PVC of an import operation can result in breaking the operation

Version-Release number of selected component (if applicable):
2.4.1

How reproducible:
Hard to reproduce, timing-related
(Initially hit this in tier 2 automation, then reproduced manually)

Steps to Reproduce:
1. Import DV that requires scratch space PVC
2. Delete the scratch space PVC that was created

Actual results:
Import operation freezes, does not complete

Expected results:
Import operation still succeeds

Additional info:
Logs attached to bug as a file


dv.yaml:
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: deleting-scratch-pvc
spec:
  source:
    http:
      url: "http://PATH/cirros-0.4.0-x86_64-disk.qcow2.xz"
  pvc:
    volumeMode: Filesystem
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
    storageClassName: hostpath-provisioner


We also have these TCs:
https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2328
https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2327 
That cover the deletion of scratch space PVC and expect the operation to complete successfully regardless.

Comment 1 Adam Litke 2020-09-17 13:34:27 UTC
@Bartosz please take a look.

Comment 2 Bartosz Rybacki 2020-09-21 09:12:00 UTC
I am on it, trying to recreate.

Comment 3 Bartosz Rybacki 2020-09-22 11:07:15 UTC
Recreate successful, but need to remove scratch just after creating it, when pod has not been scheduled again/not started yet. The pod becomes Unschedulable and controller does not handle the state correctly.

 "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-09-22T11:01:03Z",
                "message": "persistentvolumeclaim \"scratch-space-delete-scratch\" not found",
                "reason": "Unschedulable",
                "status": "False",
                "type": "PodScheduled"
            }
        ],
        "phase": "Pending",
        "qosClass": "BestEffort"
    }

Comment 4 Bartosz Rybacki 2020-10-01 11:59:50 UTC
After requesting the creation of the pod and the scratch-space:
1. the Pod is being created so it shows in the system as Pending (waits for all PVC to be available), this is observed in controller and it tries to create scratch again, but scratch is already in the system (also Pending), so controller sets  condition "Claim Pending" and returns.

2. Some external action removes PVC (if I am not mistaken the finalizer: "kubernetes.io/pvc-protection" does not work if pod is Pending/not scheduled yet), now the only thing the controller can see is a PVC event, but the scratch PVC is not found so controller returns.

No further events for the POD (still Pending - no changes) .

Comment 6 Adam Litke 2020-10-12 18:57:03 UTC
Not a blocker for 2.5.  Pushing out.

Comment 7 Maya Rashish 2020-11-29 15:17:57 UTC
Had some problem with downstream builds. There's no -11 (or newer) available.

Comment 8 Maya Rashish 2020-12-01 11:32:16 UTC
Build should work now (thanks to Gal Ben Haim!)

Comment 9 Bartosz Rybacki 2020-12-15 17:19:31 UTC
I've got some information from @akalenyu that now the original problem does not show.

The original problem was that every time the scratch pvc was deleted while the pod was still Pending, the system was in a state where there was pending importer pod, no scratch space and the import controller was not reconciling this situation (it could after the Resync period - 10 hours). 

After the fix is applied, the import controller does requeue the reconcile loop until the DV is in state Succeeded or Failed. So in this situation the scratch PVC is recreated. This was proved by runing the tests.


Now we discovered that the test fails once in many runs. Analyzing the logs shows this situation:

PVC:
test-scratch status=Terminating

POD:
importer-test status=ContainerCreating, and the last event shows:
  Type     Reason                  Age                    From                     Message
  Warning  FailedMount             4m (x8 over 8m28s)     kubelet                  Unable to attach or mount volumes: unmounted volumes=[cdi-scratch-vol], unattached volumes=[cdi-scratch-vol]: error processing PVC test-scratch: PVC is being deleted

This looks exactly like this:
https://bugzilla.redhat.com/show_bug.cgi?id=1570606

To resolve this user can recreate a DV. I am not sure we can/should detect the situation and try to resolve this.

Comment 13 errata-xmlrpc 2021-03-10 11:18:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799


Note You need to log in before you can comment on or make changes to this bug.