1878499 – DV import doesn't recover from scratch space PVC deletion

Bug 1878499 - DV import doesn't recover from scratch space PVC deletion

Summary: DV import doesn't recover from scratch space PVC deletion

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	2.6.0
Assignee:	Bartosz Rybacki
QA Contact:	Alex Kalenyuk
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-13 13:52 UTC by Alex Kalenyuk
Modified:	2021-03-10 11:19 UTC (History)
CC List:	4 users (show)
Fixed In Version:	virt-cdi-importer 2.6.0-14
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-10 11:18:00 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs (26.16 KB, text/plain) 2020-09-13 13:52 UTC, Alex Kalenyuk	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:0799	0	None	None	None	2021-03-10 11:19:14 UTC

Description Alex Kalenyuk 2020-09-13 13:52:13 UTC

Created attachment 1714686 [details]
Logs

Description of problem:
Deleting the scratch space PVC of an import operation can result in breaking the operation

Version-Release number of selected component (if applicable):
2.4.1

How reproducible:
Hard to reproduce, timing-related
(Initially hit this in tier 2 automation, then reproduced manually)

Steps to Reproduce:
1. Import DV that requires scratch space PVC
2. Delete the scratch space PVC that was created

Actual results:
Import operation freezes, does not complete

Expected results:
Import operation still succeeds

Additional info:
Logs attached to bug as a file


dv.yaml:
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: deleting-scratch-pvc
spec:
  source:
    http:
      url: "http://PATH/cirros-0.4.0-x86_64-disk.qcow2.xz"
  pvc:
    volumeMode: Filesystem
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 2Gi
    storageClassName: hostpath-provisioner


We also have these TCs:
https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2328
https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-2327 
That cover the deletion of scratch space PVC and expect the operation to complete successfully regardless.

Comment 1 Adam Litke 2020-09-17 13:34:27 UTC

@Bartosz please take a look.

Comment 2 Bartosz Rybacki 2020-09-21 09:12:00 UTC

I am on it, trying to recreate.

Comment 3 Bartosz Rybacki 2020-09-22 11:07:15 UTC

Recreate successful, but need to remove scratch just after creating it, when pod has not been scheduled again/not started yet. The pod becomes Unschedulable and controller does not handle the state correctly.

 "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-09-22T11:01:03Z",
                "message": "persistentvolumeclaim \"scratch-space-delete-scratch\" not found",
                "reason": "Unschedulable",
                "status": "False",
                "type": "PodScheduled"
            }
        ],
        "phase": "Pending",
        "qosClass": "BestEffort"
    }

Comment 4 Bartosz Rybacki 2020-10-01 11:59:50 UTC

After requesting the creation of the pod and the scratch-space:
1. the Pod is being created so it shows in the system as Pending (waits for all PVC to be available), this is observed in controller and it tries to create scratch again, but scratch is already in the system (also Pending), so controller sets  condition "Claim Pending" and returns.

2. Some external action removes PVC (if I am not mistaken the finalizer: "kubernetes.io/pvc-protection" does not work if pod is Pending/not scheduled yet), now the only thing the controller can see is a PVC event, but the scratch PVC is not found so controller returns.

No further events for the POD (still Pending - no changes) .

Comment 5 Bartosz Rybacki 2020-10-09 14:19:37 UTC

https://github.com/kubevirt/containerized-data-importer/pull/1424

Comment 6 Adam Litke 2020-10-12 18:57:03 UTC

Not a blocker for 2.5.  Pushing out.

Comment 7 Maya Rashish 2020-11-29 15:17:57 UTC

Had some problem with downstream builds. There's no -11 (or newer) available.

Comment 8 Maya Rashish 2020-12-01 11:32:16 UTC

Build should work now (thanks to Gal Ben Haim!)

Comment 9 Bartosz Rybacki 2020-12-15 17:19:31 UTC

I've got some information from @akalenyu that now the original problem does not show.

The original problem was that every time the scratch pvc was deleted while the pod was still Pending, the system was in a state where there was pending importer pod, no scratch space and the import controller was not reconciling this situation (it could after the Resync period - 10 hours). 

After the fix is applied, the import controller does requeue the reconcile loop until the DV is in state Succeeded or Failed. So in this situation the scratch PVC is recreated. This was proved by runing the tests.


Now we discovered that the test fails once in many runs. Analyzing the logs shows this situation:

PVC:
test-scratch status=Terminating

POD:
importer-test status=ContainerCreating, and the last event shows:
  Type     Reason                  Age                    From                     Message
  Warning  FailedMount             4m (x8 over 8m28s)     kubelet                  Unable to attach or mount volumes: unmounted volumes=[cdi-scratch-vol], unattached volumes=[cdi-scratch-vol]: error processing PVC test-scratch: PVC is being deleted

This looks exactly like this:
https://bugzilla.redhat.com/show_bug.cgi?id=1570606

To resolve this user can recreate a DV. I am not sure we can/should detect the situation and try to resolve this.

Comment 13 errata-xmlrpc 2021-03-10 11:18:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799

Note You need to log in before you can comment on or make changes to this bug.