Bug 2235694

Summary: Creating Datavolume gets stuck CloneInProgress
Product: Container Native Virtualization (CNV) Reporter: Yash <ymotiyel>
Component: StorageAssignee: Michael Henriksen <mhenriks>
Status: CLOSED WONTFIX QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: high    
Version: 2.6.10CC: alitke, dafrank, mhenriks
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-11 18:04:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yash 2023-08-29 13:31:21 UTC
Description of problem:

Trying to create a VM from DataVolume with the following definition stucks in CloneInProgress status. 

---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: splunk-standalone-vm-image
namespace: clspcoykvzwctcm-l-vz-dev-000
spec:
pvc:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
source:
registry:
secretRef: XXXX-XXXX-XXXX-XXX
url: docker://XXXX-XXXX-XXXX-XXX
---

$ oc get pods -A | grep upload
clspcoykvzwctcm-l-vz-dev-000  cdi-upload-splunk-standalone-vm-data                       1/1    Running    0         2d

$ oc logs cdi-upload-splunk-standalone-vm-data
2023-08-18T16:50:38.410619610Z I0818 16:50:38.410523       1 uploadserver.go:70] Upload destination: /data/disk.img
2023-08-18T16:50:38.410619610Z I0818 16:50:38.410591       1 uploadserver.go:72] Running server on 0.0.0.0:8443


Looking at the CDI pod logs:

~~~
$ oc logs -n openshift-cnv cdi-deployment-6fdf9cc794-p85nz | grep "splunk-standalone-vm-data"
~~~

Shows that the "Storage Clone token is expired" error message:

~~~
2023-08-22T03:09:15.817715998Z {"level":"error","ts":1692673755.817503,"logger":"controller","msg":"Reconciler error","controller":"clone-controller","name":"splunk-standalone-vm-data","namespace":"clspcoykvzwctcm-l-vz-dev-000","error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)"
~~~


Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.v2.6.10

How reproducible:


Steps to Reproduce:
1. Create a Datavolume using the above definition file 
2.
3.

Actual results:
Clone will get stuck in CloneInprogress forever

Expected results:
The DV should be successfully created


Additional info:

Comment 1 Michael Henriksen 2023-08-30 13:26:13 UTC
Looking at the info here and the logs attached in support ticket there seem to be two issues here.  There are also two DatVolumes at play. 

1) DataVolume "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image":  The import of the image from registry is consistently failing.  Lots of these in cdi-deploymaent log:

2023-08-21T19:41:25.048441203Z {"level":"info","ts":1692646885.0483322,"logger":"controller.import-controller","msg":"Pod termination code","PVC":"clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image","pod.Name":"importer-splunk-standalone-vm-image","ExitCode":1}

Would it be possible to capture the log of the "clspcoykvzwctcm-l-vz-dodev-000/importer-splunk-standalone-vm-image" pod?

What is the size of the image in the registry?  Is it close to 50Gi?  If so, may be running out of scratch space because of filesystem overhead.  Try increasing the target PVC size.

2) DataVolume "clspcoykvzwctcm-l-vz-dev-000/splunk-standalone-vm-data":  This DataVolume is attempting to clone "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image" but it cannot because the PVC is in use by the importer pod that keeps crashing.  cdi-deployment keeps retrying until the clone auth toke (5min) eventually expires.

In this old version of CNV the only way to avoid a token timeout is to wait for the registry import to complete before creating the second DataVolume.  But if a token timeout occurs, the user simply has to recreate the DataVolume.  But in this scenereo the root problem is that the registry import is failing