Bug 2235694 - Creating Datavolume gets stuck CloneInProgress
Summary: Creating Datavolume gets stuck CloneInProgress
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.6.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Michael Henriksen
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-29 13:31 UTC by Yash
Modified: 2023-09-20 17:03 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-11 18:04:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-32677 0 None None None 2023-09-06 12:34:09 UTC

Description Yash 2023-08-29 13:31:21 UTC
Description of problem:

Trying to create a VM from DataVolume with the following definition stucks in CloneInProgress status. 

---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: splunk-standalone-vm-image
namespace: clspcoykvzwctcm-l-vz-dev-000
spec:
pvc:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
source:
registry:
secretRef: XXXX-XXXX-XXXX-XXX
url: docker://XXXX-XXXX-XXXX-XXX
---

$ oc get pods -A | grep upload
clspcoykvzwctcm-l-vz-dev-000  cdi-upload-splunk-standalone-vm-data                       1/1    Running    0         2d

$ oc logs cdi-upload-splunk-standalone-vm-data
2023-08-18T16:50:38.410619610Z I0818 16:50:38.410523       1 uploadserver.go:70] Upload destination: /data/disk.img
2023-08-18T16:50:38.410619610Z I0818 16:50:38.410591       1 uploadserver.go:72] Running server on 0.0.0.0:8443


Looking at the CDI pod logs:

~~~
$ oc logs -n openshift-cnv cdi-deployment-6fdf9cc794-p85nz | grep "splunk-standalone-vm-data"
~~~

Shows that the "Storage Clone token is expired" error message:

~~~
2023-08-22T03:09:15.817715998Z {"level":"error","ts":1692673755.817503,"logger":"controller","msg":"Reconciler error","controller":"clone-controller","name":"splunk-standalone-vm-data","namespace":"clspcoykvzwctcm-l-vz-dev-000","error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)"
~~~


Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.v2.6.10

How reproducible:


Steps to Reproduce:
1. Create a Datavolume using the above definition file 
2.
3.

Actual results:
Clone will get stuck in CloneInprogress forever

Expected results:
The DV should be successfully created


Additional info:

Comment 1 Michael Henriksen 2023-08-30 13:26:13 UTC
Looking at the info here and the logs attached in support ticket there seem to be two issues here.  There are also two DatVolumes at play. 

1) DataVolume "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image":  The import of the image from registry is consistently failing.  Lots of these in cdi-deploymaent log:

2023-08-21T19:41:25.048441203Z {"level":"info","ts":1692646885.0483322,"logger":"controller.import-controller","msg":"Pod termination code","PVC":"clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image","pod.Name":"importer-splunk-standalone-vm-image","ExitCode":1}

Would it be possible to capture the log of the "clspcoykvzwctcm-l-vz-dodev-000/importer-splunk-standalone-vm-image" pod?

What is the size of the image in the registry?  Is it close to 50Gi?  If so, may be running out of scratch space because of filesystem overhead.  Try increasing the target PVC size.

2) DataVolume "clspcoykvzwctcm-l-vz-dev-000/splunk-standalone-vm-data":  This DataVolume is attempting to clone "clspcoykvzwctcm-l-vz-dodev-000/splunk-standalone-vm-image" but it cannot because the PVC is in use by the importer pod that keeps crashing.  cdi-deployment keeps retrying until the clone auth toke (5min) eventually expires.

In this old version of CNV the only way to avoid a token timeout is to wait for the registry import to complete before creating the second DataVolume.  But if a token timeout occurs, the user simply has to recreate the DataVolume.  But in this scenereo the root problem is that the registry import is failing


Note You need to log in before you can comment on or make changes to this bug.