Bug 2237470
Summary: | virtctl image-upload fails with "uploadproxy URL not found" | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Jenia Peimer <jpeimer> |
Component: | Storage | Assignee: | Alex Kalenyuk <akalenyu> |
Status: | CLOSED ERRATA | QA Contact: | Harel Meir <hmeir> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.14.0 | CC: | akalenyu, alitke, dafrank, pelauter, ycui |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | 4.14.1 | Flags: | dafrank:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | CNV v4.14.1.rhel9-18 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-12-07 15:00:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jenia Peimer
2023-09-05 15:10:54 UTC
Dalia @dafrank pointed out that we don't have the 'uploadProxyURL' in cdiconfig: [cloud-user@ocp-psi-executor jenia]$ oc get cdiconfig -oyaml apiVersion: v1 items: - apiVersion: cdi.kubevirt.io/v1beta1 kind: CDIConfig metadata: creationTimestamp: "2023-09-05T12:20:14Z" generation: 6 labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 cdi.kubevirt.io: "" name: config ownerReferences: - apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: CDI name: cdi-kubevirt-hyperconverged uid: 41523fe3-c526-4597-aa5f-5320087833cb resourceVersion: "2116654" uid: 28c20eb6-0932-4f9a-9739-b942f6377d8e spec: featureGates: - HonorWaitForFirstConsumer tlsSecurityProfile: intermediate: {} type: Intermediate status: defaultPodResourceRequirements: limits: cpu: 750m memory: 600M requests: cpu: 100m memory: 60M filesystemOverhead: global: "0.055" storageClass: csi-manila-ceph: "0.055" hostpath-csi-basic: "0.055" hostpath-csi-pvc-block: "0.055" local-block-hpp: "0.055" local-block-ocs: "0.055" nfs: "0.055" ocs-storagecluster-ceph-rbd: "0.055" ocs-storagecluster-ceph-rbd-virtualization: "0.055" ocs-storagecluster-ceph-rbd-wffc: "0.055" ocs-storagecluster-cephfs: "0.055" standard-csi: "0.055" importProxy: HTTPProxy: "" HTTPSProxy: "" noProxy: "" trustedCAProxy: "" scratchSpaceStorageClass: hostpath-csi-basic kind: List metadata: resourceVersion: "" Interesting one.. looks like Routes were not available when the controller started, so that may mess up the cache? [cloud-user@ocp-psi-executor ~]$ oc logs -n openshift-cnv cdi-deployment-5dcbf47b-kkb6l | grep config-controller {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDIConfig"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDI"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.StorageClass"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Ingress"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Proxy"} {"level":"info","ts":"2023-09-05T12:42:53Z","msg":"Starting Controller","controller":"config-controller"} Versus a new pod after killing the original: [cloud-user@ocp-psi-executor eof-registry-import]$ oc logs -n openshift-cnv cdi-deployment-5ccd54779f-jtpss | grep config-controller {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDIConfig"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1beta1.CDI"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.StorageClass"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Ingress"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Route"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Proxy"} {"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting Controller","controller":"config-controller"} (Note '{"level":"info","ts":"2023-09-05T16:55:31Z","msg":"Starting EventSource","controller":"config-controller","source":"kind source: *v1.Route"}') Peter can you please approve the blocker Some thoughts after looking into this extensively: If the openshift apiserver wasn't there during CDI startup (there should be no reason for this), or, was briefly unavailable, our cached client will never recover and will keep failing to get route objects, mistakenly returning "NoMatchError". I think that being able to reproduce this more than a few times suggests the candidate we're testing has a bug around the openshift apiserver/infra setup. This is the current behavior in ctrl-runtime: https://github.com/kubernetes-sigs/controller-runtime/issues/2456#issuecomment-1695918922 https://github.com/kubernetes-sigs/controller-runtime/issues/2176#issuecomment-1712809826 Another problem is dynamically trying to set up a watch, which AFAIK is not yet supported: https://github.com/kubernetes-sigs/controller-runtime/issues/2176#issuecomment-1635414429 On OpenShift, route requests are served by openshift-apiserver. There is no route CRD (random fact, this isn't the case for MicroShift, which only has kube-apiserver and routes are CRDs). This makes it harder to work around this limitation because we can't watch for route CRD creation. Maybe we should reconsider this being a blocker, or at least see if it reproduces on RCs? Meanwhile, some workarounds in CDI could be: - Using uncached client - Delaying route informer setup until they become available @alitke @dafrank Agreed that this bug shouldn't be a blocker, pushing to 4.14.1 Smoke tests already passed for the 5 builds after CNV v4.14.1.rhel9-18. Since the bug wasn't reproducible on every installation, I would like to see some more smoke results, but so far looks fine. Verified in 4.14.1: $ virtctl image-upload dv dv-ocs --image-path=./cirros-0.4.0-x86_64-disk.qcow2 --size=1Gi --storage-class=ocs-storagecluster-ceph-rbd --access-mode=ReadWriteMany --insecure --block-volume PVC default/dv-ocs not found DataVolume default/dv-ocs created Waiting for PVC dv-ocs upload pod to be ready... Pod now ready Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.stg-hmeir414.rhos-psi.cnv-qe.rhood.us 12.13 MiB / 12.13 MiB [==========================================================================================================================================================================================================] 100.00% 0s Uploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress Processing completed successfully Uploading ./cirros-0.4.0-x86_64-disk.qcow2 completed successfully The dv Succeeded and PVC is bound $ oc get dv NAME PHASE PROGRESS RESTARTS AGE dv-ocs Succeeded N/A 75s $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE dv-ocs Bound pvc-af501356-cee9-4e3d-a16a-0b5c33e97edc 1Gi RWX ocs-storagecluster-ceph-rbd 80s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7704 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |