This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2216038 - Datavolumes take a long time to be created, jwt token expires
Summary: Datavolumes take a long time to be created, jwt token expires
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.10.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.14.2
Assignee: Michael Henriksen
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-19 20:54 UTC by Sean Haselden
Modified: 2024-04-13 04:25 UTC (History)
9 users (show)

Fixed In Version: v4.14.0.rhel9-1781
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-12-14 16:15:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2862 0 None Merged increase controller concurrency and cpu requests 2023-09-27 18:03:38 UTC
Github kubevirt containerized-data-importer pull 2867 0 None Merged [release-v1.57] increase controller concurrency and cpu requests 2023-09-27 18:03:44 UTC
Red Hat Issue Tracker   CNV-30083 0 None None None 2023-12-14 16:15:16 UTC

Description Sean Haselden 2023-06-19 20:54:59 UTC
Description of problem:


Frequently when creating virtual machines we see they are stuck on pending forever. During investigation we saw datavolumes that they fail to create and are stuck forever with no status update. When looking at logs in the cdi-deployment pod in openshift-cnv we see logs like this:
```
{"level":"error","ts":<timestamp>,"logger":"controller.datavolume-controller","msg":"Reconciler error", "name":<dv name>,"namespace":<namespace>,"error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)","errorVerbose":"square/go-jose/jwt validation failed, token is expired\nerror verifying token\nkubevirt.io/containerized-data-importer/pkg/controller/.validateCloneTokenDW <snip>"}
```
After some investigation we found that the token in the DV annotations:
metadata:
  annotations:
    cdi.kubevirt.io/storage.clone.token: <jwt>

The JWT is in fact expired, and is only valid for 5 minutes.
The status of the DV is just empty like this:
```
status: {}
```


This doesn't reproduce all the time.

In the initial set of logs it took a little more than 8 minutes from when the DV was created (as can bee seen in the creationTimestamp) until the first log. 2023-06-05T06:44:58Z -> 1685948026.3881886 (2023-06-05T06:53:46Z)
Regarding the time sync issue suggested, I verified there is no difference in the time between different nodes in the cluster, and they are all connected to the same NTP server.
The logs from this time are already attached, including the datavolume yaml.

dv-expire.tar.gz/dv.yaml:  creationTimestamp: "2023-06-05T06:44:58Z"

===
cdi-extended.tar.gz/cdi-deployment.log
===


{"level":"info","ts":1685948026.3881886,"logger":"controller.datavolume-controller","msg":"Initializing transfer","Datavolume":"mongodb/affected-vm-1-rootdisk"}
{"level":"error","ts":1685948026.3953767,"logger":"controller.datavolume-controller","msg":"Reconciler error","name":"affected-vm-1-rootdisk","namespace":"mongodb","error":"error verifying token: square/go-jose/jwt: validation failed, token is expired (exp)","errorVerbose":"square/go-jose/jwt: validation failed, token is expired (exp)\nerror verifying token\nkubevirt.io/containerized-data-importer/pkg/controller.validateCloneTokenDV\n\t/remote-source/app/pkg/controller/util.go:876\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).initTransfer\n\t/remote-source/app/pkg/controller/datavolume-controller.go:1156\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).doCrossNamespaceClone\n\t/remote-source/app/pkg/controller/datavolume-controller.go:896\nkubevirt.io/containerized-data-importer/pkg/controller.(*DatavolumeReconciler).reconcileSmartClonePvc\n\t/remote-source/app/pkg/


Customer then managed to reproduce this issue in a pre prod online environment, by creating a few hundred VMs. A 100 of the VMs have a datavolume configuration that doesn't work - it tries to copy the PVC from a different storage class. The rest of the VMs are completely regular - they should be created normally, but as can be seen there are 292 datavolumes that took more than 5 minutes to be acknowledged and are stuck in limbo since the JWT expired. 

For the attached to the case we have CNV must gather, as well as openshift mustgather from the reproduction.

Comment 6 Michael Henriksen 2023-08-22 17:29:14 UTC
This is much less of an issue with Openshift Virt 4.14 because we are now adding a long term token to DataVolumes right away.  But can still happen when cpu is artificially limited to 10m and a large number of clones are generated.  The issue is more easily reproduced in 4.10 (reported version).

The simplest and most direct solution is to increase cdi-deployment CPU resource request from 10m (1% of a single core) to 100m (10% of a single core) and increase controller concurrency from 1 to 3.  In my testing, this would ensure that tokens were validated in single digit seconds as opposed to minutes when the system is under heavy load (creating 500 concurrent cross namespace clones).

Comment 7 Adam Litke 2023-09-27 18:13:45 UTC
Build is d/s but not included in errata. Pushing to 4.14.1.

Comment 8 Adam Litke 2023-10-16 12:48:43 UTC
Kevin,

Let's make sure to retest this in a scale environment with the following parameters:
- Clones should be cross-namespace CSI clones (not csi snapshot)
- 300 concurrent clones
- We can use the ODF rbd storage class

Comment 13 Red Hat Bugzilla 2024-04-13 04:25:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.