Description of problem: virtctl image-upload dv fails Version-Release number of selected component (if applicable): CNV v4.14.0.rhel9-1869 (in v4.14.0.rhel9-1867 it worked fine) How reproducible: Always, on one deployed cluster with v4.14.0.rhel9-1869 Steps to Reproduce: 1. $ virtctl image-upload dv dv-ocs --image-path=./cirros-0.4.0-x86_64-disk.qcow2 --size=1Gi --storage-class=ocs-storagecluster-ceph-rbd --access-mode=ReadWriteMany --insecure --block-volume PVC default/dv-ocs not found DataVolume default/dv-ocs created Waiting for PVC dv-ocs upload pod to be ready... Pod now ready uploadproxy URL not found Actual results: error: uploadproxy URL not found Expected results: Uploading completed successfully Additional info: [cloud-user@ocp-psi-executor ~]$ oc get pods NAME READY STATUS RESTARTS AGE cdi-upload-prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 1/1 Running 0 88s [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ oc logs cdi-upload-prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 I0905 14:57:32.246000 1 uploadserver.go:74] Running server on 0.0.0.0:8443 [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ oc get dv NAME PHASE PROGRESS RESTARTS AGE dv-ocs UploadReady N/A 9m29s [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ oc get dv dv-ocs -oyaml apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: annotations: cdi.kubevirt.io/storage.usePopulator: "true" creationTimestamp: "2023-09-05T14:57:20Z" generation: 1 name: dv-ocs namespace: default resourceVersion: "2232200" uid: 77a93e07-79ba-4b13-ac1b-348595681d3b spec: contentType: kubevirt source: upload: {} storage: accessModes: - ReadWriteMany resources: requests: storage: 1Gi storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block status: claimName: dv-ocs conditions: - lastHeartbeatTime: "2023-09-05T14:57:20Z" lastTransitionTime: "2023-09-05T14:57:20Z" message: PVC dv-ocs Pending reason: Pending status: "False" type: Bound - lastHeartbeatTime: "2023-09-05T14:57:33Z" lastTransitionTime: "2023-09-05T14:57:20Z" reason: TransferRunning status: "False" type: Ready - lastHeartbeatTime: "2023-09-05T14:57:32Z" lastTransitionTime: "2023-09-05T14:57:32Z" reason: Pod is running status: "True" type: Running phase: UploadReady progress: N/A [cloud-user@ocp-psi-executor ~]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dv-ocs Pending ocs-storagecluster-ceph-rbd 10m prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 Bound pvc-825ee54d-a879-44fe-9df1-e93acea238ea 1Gi RWX ocs-storagecluster-ceph-rbd 10m prime-f88af4ef-9064-4762-a8c9-5f6138b868d1-scratch Bound pvc-c306db5c-ef80-470a-b7ed-a63277d74727 129Gi RWO hostpath-csi-basic 10m [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ [cloud-user@ocp-psi-executor ~]$ oc get pvc dv-ocs -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/storage.condition.running: "true" cdi.kubevirt.io/storage.condition.running.message: "" cdi.kubevirt.io/storage.condition.running.reason: Pod is running cdi.kubevirt.io/storage.contentType: kubevirt cdi.kubevirt.io/storage.pod.phase: Running cdi.kubevirt.io/storage.pod.ready: "true" cdi.kubevirt.io/storage.pod.restarts: "0" cdi.kubevirt.io/storage.populator.pvcPrime: prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 cdi.kubevirt.io/storage.preallocation.requested: "false" cdi.kubevirt.io/storage.usePopulator: "true" volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-09-05T14:57:20Z" finalizers: - kubernetes.io/pvc-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 name: dv-ocs namespace: default ownerReferences: - apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: DataVolume name: dv-ocs uid: 77a93e07-79ba-4b13-ac1b-348595681d3b resourceVersion: "2232199" uid: f88af4ef-9064-4762-a8c9-5f6138b868d1 spec: accessModes: - ReadWriteMany dataSource: apiGroup: cdi.kubevirt.io kind: VolumeUploadSource name: volume-upload-source-77a93e07-79ba-4b13-ac1b-348595681d3b dataSourceRef: apiGroup: cdi.kubevirt.io kind: VolumeUploadSource name: volume-upload-source-77a93e07-79ba-4b13-ac1b-348595681d3b resources: requests: storage: "1073741824" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block status: phase: Pending [cloud-user@ocp-psi-executor ~]$ oc get pvc prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "" cdi.kubevirt.io/storage.condition.bound: "true" cdi.kubevirt.io/storage.condition.bound.message: "" cdi.kubevirt.io/storage.condition.bound.reason: "" cdi.kubevirt.io/storage.condition.running: "true" cdi.kubevirt.io/storage.condition.running.message: "" cdi.kubevirt.io/storage.condition.running.reason: Pod is running cdi.kubevirt.io/storage.contentType: kubevirt cdi.kubevirt.io/storage.pod.phase: Running cdi.kubevirt.io/storage.pod.ready: "true" cdi.kubevirt.io/storage.pod.restarts: "0" cdi.kubevirt.io/storage.populator.kind: VolumeUploadSource cdi.kubevirt.io/storage.preallocation.requested: "false" cdi.kubevirt.io/storage.upload.target: "" cdi.kubevirt.io/storage.uploadPodName: cdi-upload-prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" sidecar.istio.io/inject: "false" volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-09-05T14:57:20Z" finalizers: - kubernetes.io/pvc-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 name: prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 namespace: default ownerReferences: - apiVersion: v1 blockOwnerDeletion: true controller: true kind: PersistentVolumeClaim name: dv-ocs uid: f88af4ef-9064-4762-a8c9-5f6138b868d1 resourceVersion: "2232198" uid: 825ee54d-a879-44fe-9df1-e93acea238ea spec: accessModes: - ReadWriteMany resources: requests: storage: "1073741824" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block volumeName: pvc-825ee54d-a879-44fe-9df1-e93acea238ea status: accessModes: - ReadWriteMany capacity: storage: 1Gi phase: Bound [cloud-user@ocp-psi-executor ~]$ oc get pvc prime-f88af4ef-9064-4762-a8c9-5f6138b868d1-scratch -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubevirt.io.hostpath-provisioner volume.kubernetes.io/selected-node: infd-vrf-414t0-w7msb-master-1 volume.kubernetes.io/storage-provisioner: kubevirt.io.hostpath-provisioner creationTimestamp: "2023-09-05T14:57:21Z" finalizers: - kubernetes.io/pvc-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller name: prime-f88af4ef-9064-4762-a8c9-5f6138b868d1-scratch namespace: default ownerReferences: - apiVersion: v1 blockOwnerDeletion: true controller: true kind: Pod name: cdi-upload-prime-f88af4ef-9064-4762-a8c9-5f6138b868d1 uid: 5b7bfab5-f7ba-4629-b102-8aa058477f9d resourceVersion: "2232006" uid: c306db5c-ef80-470a-b7ed-a63277d74727 spec: accessModes: - ReadWriteOnce resources: requests: storage: "1073741824" storageClassName: hostpath-csi-basic volumeMode: Filesystem volumeName: pvc-c306db5c-ef80-470a-b7ed-a63277d74727 status: accessModes: - ReadWriteOnce capacity: storage: 129Gi phase: Bound [cloud-user@ocp-psi-executor ~]$
Dalia @dafrank pointed out that we don't have the 'uploadProxyURL' in cdiconfig: [cloud-user@ocp-psi-executor jenia]$ oc get cdiconfig -oyaml apiVersion: v1 items: - apiVersion: cdi.kubevirt.io/v1beta1 kind: CDIConfig metadata: creationTimestamp: "2023-09-05T12:20:14Z" generation: 6 labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 cdi.kubevirt.io: "" name: config ownerReferences: - apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: CDI name: cdi-kubevirt-hyperconverged uid: 41523fe3-c526-4597-aa5f-5320087833cb resourceVersion: "2116654" uid: 28c20eb6-0932-4f9a-9739-b942f6377d8e spec: featureGates: - HonorWaitForFirstConsumer tlsSecurityProfile: intermediate: {} type: Intermediate status: defaultPodResourceRequirements: limits: cpu: 750m memory: 600M requests: cpu: 100m memory: 60M filesystemOverhead: global: "0.055" storageClass: csi-manila-ceph: "0.055" hostpath-csi-basic: "0.055" hostpath-csi-pvc-block: "0.055" local-block-hpp: "0.055" local-block-ocs: "0.055" nfs: "0.055" ocs-storagecluster-ceph-rbd: "0.055" ocs-storagecluster-ceph-rbd-virtualization: "0.055" ocs-storagecluster-ceph-rbd-wffc: "0.055" ocs-storagecluster-cephfs: "0.055" standard-csi: "0.055" importProxy: HTTPProxy: "" HTTPSProxy: "" noProxy: "" trustedCAProxy: "" scratchSpaceStorageClass: hostpath-csi-basic kind: List metadata: resourceVersion: ""
Interesting one.. looks like Routes were not available when the controller started, so that may mess up the cache? [cloud-user@ocp-psi-executor ~]$ oc logs -n openshift-cnv cdi-deployment-5dcbf47b-kkb6l | grep config-controller {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDIConfig"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDI"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.StorageClass"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Ingress"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Proxy"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting Controller","controller":"config-controller"} Versus a new pod after killing the original: [cloud-user@ocp-psi-executor eof-registry-import]$ oc logs -n openshift-cnv cdi-deployment-5ccd54779f-jtpss | grep config-controller {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDIConfig"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDI"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.StorageClass"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Ingress"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Route"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Proxy"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting Controller","controller":"config-controller"} (Note '{"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Route"}')
Peter can you please approve the blocker
Some thoughts after looking into this extensively: If the openshift apiserver wasn't there during CDI startup (there should be no reason for this), or, was briefly unavailable, our cached client will never recover and will keep failing to get route objects, mistakenly returning "NoMatchError". I think that being able to reproduce this more than a few times suggests the candidate we're testing has a bug around the openshift apiserver/infra setup. This is the current behavior in ctrl-runtime: https://github.com/kubernetes-sigs/controller-runtime/issues/2456#issuecomment-1695918922 https://github.com/kubernetes-sigs/controller-runtime/issues/2176#issuecomment-1712809826 Another problem is dynamically trying to set up a watch, which AFAIK is not yet supported: https://github.com/kubernetes-sigs/controller-runtime/issues/2176#issuecomment-1635414429 On OpenShift, route requests are served by openshift-apiserver. There is no route CRD (random fact, this isn't the case for MicroShift, which only has kube-apiserver and routes are CRDs). This makes it harder to work around this limitation because we can't watch for route CRD creation. Maybe we should reconsider this being a blocker, or at least see if it reproduces on RCs? Meanwhile, some workarounds in CDI could be: - Using uncached client - Delaying route informer setup until they become available @alitke @dafrank
Agreed that this bug shouldn't be a blocker, pushing to 4.14.1
Smoke tests already passed for the 5 builds after CNV v4.14.1.rhel9-18. Since the bug wasn't reproducible on every installation, I would like to see some more smoke results, but so far looks fine.
Verified in 4.14.1: $ virtctl image-upload dv dv-ocs --image-path=./cirros-0.4.0-x86_64-disk.qcow2 --size=1Gi --storage-class=ocs-storagecluster-ceph-rbd --access-mode=ReadWriteMany --insecure --block-volume PVC default/dv-ocs not found DataVolume default/dv-ocs created Waiting for PVC dv-ocs upload pod to be ready... Pod now ready Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.stg-hmeir414.rhos-psi.cnv-qe.rhood.us 12.13 MiB / 12.13 MiB [==========================================================================================================================================================================================================] 100.00% 0s Uploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress Processing completed successfully Uploading ./cirros-0.4.0-x86_64-disk.qcow2 completed successfully The dv Succeeded and PVC is bound $ oc get dv NAME PHASE PROGRESS RESTARTS AGE dv-ocs Succeeded N/A 75s $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dv-ocs Bound pvc-af501356-cee9-4e3d-a16a-0b5c33e97edc 1Gi RWX ocs-storagecluster-ceph-rbd 80s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7704
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days