Description of problem: Trying to create a VM from DataVolume with the following definition stucks in CloneInProgress status. --- apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: name: splunk-standalone-vm-image namespace: clspcoykvzwctcm-l-vz-dev-000 spec: pvc: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi source: registry: secretRef: XXXX-XXXX-XXXX-XXX url: docker://XXXX-XXXX-XXXX-XXX --- $ oc get pods -A | grep upload clspcoykvzwctcm-l-vz-dev-000 cdi-upload-splunk-standalone-vm-data 1/1 Running 0 2d $ oc logs cdi-upload-splunk-standalone-vm-data 2023-08-18T16:50:38.410619610Z I0818 16:50:38.410523 1 uploadserver.go:70] Upload destination: /data/disk.img 2023-08-18T16:50:38.410619610Z I0818 16:50:38.410591 1 uploadserver.go:72] Running server on 0.0.0.0:8443 Looking at the CDI pod logs: ~~~ $ oc logs -n openshift-cnv cdi-deployment-6fdf9cc794-p85nz | grep "splunk-standalone-vm-data" ~~~ Shows that the "Storage Clone token is expired" error message: ~~~ 2023-08-22T03:09:15.817715998Z {"level":"error","ts":1692673755.817503,"logger":"controller","msg":"Reconciler error","controller":"clone-controller","name":"splunk-standalone-vm-data","namespace":"clspcoykvzwctcm-l-vz-dev-000","error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)" ~~~ Version-Release number of selected component (if applicable): kubevirt-hyperconverged-operator.v2.6.10 How reproducible: Steps to Reproduce: 1. Create a Datavolume using the above definition file 2. 3. Actual results: Clone will get stuck in CloneInprogress forever Expected results: The DV should be successfully created Additional info:
Looking at the info here and the logs attached in support ticket there seem to be two issues here. There are also two DatVolumes at play. 1) DataVolume "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image": The import of the image from registry is consistently failing. Lots of these in cdi-deploymaent log: 2023-08-21T19:41:25.048441203Z {"level":"info","ts":1692646885.0483322,"logger":"controller.import-controller","msg":"Pod termination code","PVC":"clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image","pod.Name":"importer-splunk-standalone-vm-image","ExitCode":1} Would it be possible to capture the log of the "clspcoykvzwctcm-l-vz-dodev-000/importer-splunk-standalone-vm-image" pod? What is the size of the image in the registry? Is it close to 50Gi? If so, may be running out of scratch space because of filesystem overhead. Try increasing the target PVC size. 2) DataVolume "clspcoykvzwctcm-l-vz-dev-000/splunk-standalone-vm-data": This DataVolume is attempting to clone "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image" but it cannot because the PVC is in use by the importer pod that keeps crashing. cdi-deployment keeps retrying until the clone auth toke (5min) eventually expires. In this old version of CNV the only way to avoid a token timeout is to wait for the registry import to complete before creating the second DataVolume. But if a token timeout occurs, the user simply has to recreate the DataVolume. But in this scenereo the root problem is that the registry import is failing