Bug 1754920
| Summary: | Large images met timeout or EOF when uploading a qcow2 image | |||
|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Yan Du <yadu> | |
| Component: | Documentation | Assignee: | Bob Gaydos <bgaydos> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Irina Gulina <igulina> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.1.0 | CC: | alitke, aspauldi, awels, cnv-qe-bugs, fdeutsch, mhenriks, ncredi, ngavrilo, sgordon, ycui | |
| Target Milestone: | --- | Flags: | alitke:
needinfo-
alitke: needinfo- |
|
| Target Release: | 2.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | 2.2.0 | Doc Type: | Known Issue | |
| Doc Text: |
Cause: An idle upload connection is closed before server-side processing can complete.
Consequence: An EOF error is returned to the user even though the operation continues normally on the server
Workaround (if any): Check the status of the operation by examining the CDI annotations on the affected PVC
Result: A successful upload is confirmed
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1789093 (view as bug list) | Environment: | ||
| Last Closed: | 2020-01-30 20:02:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1789093 | |||
|
Description
Yan Du
2019-09-24 11:21:41 UTC
Could this be a timeout during conversion to raw? Can you provide the uploadserver logs? The EOF happened during validating the image:
$ oc logs cdi-upload-winpv --follow
I0926 02:27:57.144845 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume
I0926 02:27:57.145301 1 uploadserver.go:63] Running server on 0.0.0.0:8443
2019/09/26 02:28:06 http: TLS handshake error from 10.129.1.172:43478: EOF
I0926 02:28:06.449258 1 uploadserver.go:244] Content type header is ""
I0926 02:28:06.449278 1 data-processor.go:252] Calculating available size
I0926 02:28:06.471530 1 data-processor.go:256] Checking out block volume size.
I0926 02:28:06.473668 1 data-processor.go:264] Request image size not empty.
I0926 02:28:06.473690 1 data-processor.go:269] Target size 10Gi.
I0926 02:28:06.473873 1 data-processor.go:182] New phase: TransferScratch
I0926 02:28:06.476689 1 util.go:143] Writing data...
I0926 02:30:31.957120 1 data-processor.go:182] New phase: Process
I0926 02:30:31.957174 1 data-processor.go:182] New phase: Convert
I0926 02:30:31.957179 1 data-processor.go:188] Validating image
error: unexpected EOF
$ oc describe pvc
Name: winpv
Namespace: default
StorageClass: rook-ceph-block
Status: Bound
Volume: pvc-27b5186e-e005-11e9-b28e-fa163e6997df
Labels: <none>
Annotations: cdi.kubevirt.io/storage.pod.phase: Succeeded
cdi.kubevirt.io/storage.pod.ready: false
cdi.kubevirt.io/storage.upload.target:
pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 10Gi
Access Modes: RWO
VolumeMode: Block
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 6m51s rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-fd98c466-28944_29df07b3-dfc2-11e9-9fe5-0a580a81000e External provisioner is provisioning volume for claim "default/winpv"
Normal ExternalProvisioning 6m48s (x2 over 6m50s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
Mounted By: <none>
The above qcow2 image info:
$ qemu-img info disk.qcow2
image: disk.qcow2
file format: qcow2
virtual size: 6.6G (7140155392 bytes)
disk size: 5.7G
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
If I convert the same qcow2 image to raw by qemu-img, then upload succeed.
$ qemu-img info disk_raw.img
image: disk_raw.img
file format: raw
virtual size: 6.6G (7140155392 bytes)
disk size: 5.7G
$ virtctl image-upload --pvc-name=winpv --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img --insecure --block-volume
PVC default/winpv created
Waiting for PVC winpv upload pod to be running...
Pod now running
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4
6.65 GiB / 6.65 GiB [=======================================================================================================================================================================] 100.00% 1m47s
Uploading /mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img completed successfully
$ oc logs cdi-upload-winpv --follow
I0926 05:50:53.540103 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume
I0926 05:50:53.540435 1 uploadserver.go:63] Running server on 0.0.0.0:8443
2019/09/26 05:51:02 http: TLS handshake error from 10.129.3.227:44558: EOF
I0926 05:51:02.534025 1 uploadserver.go:244] Content type header is ""
I0926 05:51:02.534043 1 data-processor.go:252] Calculating available size
I0926 05:51:02.556587 1 data-processor.go:256] Checking out block volume size.
I0926 05:51:02.559810 1 data-processor.go:264] Request image size not empty.
I0926 05:51:02.559846 1 data-processor.go:269] Target size 10Gi.
I0926 05:51:02.560289 1 data-processor.go:182] New phase: TransferDataFile
I0926 05:51:02.564028 1 util.go:143] Writing data...
I0926 05:52:50.305765 1 data-processor.go:182] New phase: Resize
I0926 05:52:50.424873 1 data-processor.go:182] New phase: Complete
I0926 05:52:50.424953 1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume
I0926 05:52:50.425208 1 uploadserver.go:151] Shutting down http server after successful upload
I0926 05:52:50.425399 1 uploadserver.go:71] UploadServer successfully exited
I also got it when I used curl with token:
$ cat dv-upload.yaml
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
name: upload-datavolume
spec:
source:
upload: {}
pvc:
volumeMode: Block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Mi
$ cat token.yaml
apiVersion: upload.cdi.kubevirt.io/v1alpha1
kind: UploadTokenRequest
metadata:
name: upload-datavolume-token
namespace: src
spec:
pvcName: upload-datavolume
1. oc create -f dv-upload.yaml
2. TOKEN=$(kubectl apply -f token.yaml -o="jsonpath={.status.token}")
3. curl -v --insecure -H "Authorization: Bearer $TOKEN" --data-binary @cirros-0.4.0-x86_64-disk.qcow2 https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload
$ oc get pods
NAME READY STATUS RESTARTS AGE
cdi-upload-upload-datavolume 0/1 Pending 0 11m
$ oc logs -f cdi-uploadproxy-5f7c458c79-jlmkh -n openshift-cnv
I0923 11:28:13.801613 1 uploadproxy.go:46] Note: increase the -v level in the api deployment for more detailed logging, eg. -v=2 or -v=3
W0923 11:28:13.802283 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2019/09/26 03:01:25 http: TLS handshake error from 10.128.0.1:56988: EOF
I0926 12:32:12.003289 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src
E0926 12:32:22.114270 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 12:33:54.625050 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src
E0926 12:34:03.695972 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 13:09:49.192674 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src
E0926 13:09:58.248112 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 13:11:57.457918 1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src
E0926 13:12:06.504110 1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1333
error: unexpected EOF
Try to reproduce it with a BM 1. I wasn't able to get a real BM, so I got a libvirt env, the upload there failed with: "timed out waiting for the condition", upload pod was in ContainerCreating (I suspect rook isn't stable on this env, will try to investigate). 2. I tried testing again on PSI, and when using small qcow2 files (hundreds of MB) - the upload works fine (no errors) 3. When I used a larger image: 6.24 GiB, it uploaded 100% of the file and then issued: Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF. In the pod logs it seems to happen when "Validating image", after waiting patiently for another 10 minutes the upload completed and the scratch pvc was removed: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 6.24 GiB / 6.24 GiB [=============================================================================================================================================] 100.00% 12m18s Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF upload pod log: $ oc logs -f cdi-upload-test-pvc I1128 15:13:35.995401 1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume I1128 15:13:35.996194 1 uploadserver.go:63] Running server on 0.0.0.0:8443 2019/11/28 15:13:43 http: TLS handshake error from 10.131.0.32:54908: EOF I1128 15:13:43.304401 1 uploadserver.go:244] Content type header is "" I1128 15:13:43.304451 1 data-processor.go:252] Calculating available size I1128 15:13:43.324150 1 data-processor.go:256] Checking out block volume size. I1128 15:13:43.327674 1 data-processor.go:264] Request image size not empty. I1128 15:13:43.327718 1 data-processor.go:269] Target size 36Gi. I1128 15:13:43.327962 1 util.go:37] deleting file: /scratch/lost+found I1128 15:13:43.330062 1 data-processor.go:182] New phase: TransferScratch I1128 15:13:43.333064 1 util.go:169] Writing data... I1128 15:25:30.028058 1 data-processor.go:182] New phase: Process I1128 15:25:30.034149 1 data-processor.go:182] New phase: Convert I1128 15:25:30.034166 1 data-processor.go:188] Validating image I1128 15:35:08.057641 1 data-processor.go:182] New phase: Resize I1128 15:35:08.117081 1 data-processor.go:182] New phase: Complete I1128 15:35:08.197618 1 util.go:37] deleting file: /scratch/tmpimage I1128 15:35:08.481357 1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume I1128 15:35:08.481851 1 uploadserver.go:151] Shutting down http server after successful upload I1128 15:35:08.982421 1 uploadserver.go:71] UploadServer successfully exited @Michael, is it possible for virtctl image-upload to wait until the UploadServer exits? In normal cases (no EOF), virtctl image-upload will complete right before the UploadServer pod exits. Although possible, I don't think it would be wise to ignore errors such as this and instead wait for the UploadServer pod to exit. Apparently, this error was benign, but most probably are not. Then the program will be waiting forever. Approximately how long was "virtctl image-upload" running before the EOF? Michael, below the progress bar, it shows 12m18s - so let's add a few more seconds for creating the pvc, in total it will be around 12.5-13 minutes. @michael, any updates? The issue from https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c5 was addressed in the following PR: https://github.com/kubevirt/containerized-data-importer/pull/996 What's described in https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c11 is different and I'm not sure what is going on. Should we make a separate issue for it? Adam: In response to c16, let's split the issue reported in c11 into a new bug and resolve this one according to the status of PR#996. I used the latest CDI 2.2 build, and tried to upload a large image (10G). The first try resulted in: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... timed out waiting for the condition $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE test-pvc Bound pvc-51daa9e8-29d1-4657-b261-307692667513 36Gi RWO rook-ceph-block 27m test-pvc-scratch Bound pvc-773b3a70-74d9-4b00-9cf9-45d746a67fe3 36Gi RWO rook-ceph-block 27m The second attempt (didn't change a thing!) resulted with: $ virtctl image-upload --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume PVC default/test-pvc created Waiting for PVC test-pvc upload pod to be running... Pod now running Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4 10.33 GiB [=======================================================================================================================================================================================================================================] 100.00% 57m6s Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF So both issues still relevant Natalie are you sure the first issue is still relevant? Did you see the same stacktrace in uploadproxy log? The output in comment 18 looks like the second issue. There are a few different issues making this bug confusing so I am going to summarize our plan to resolve this: There were two distinct cases mentioned in the original bug description: * Timed out waiting for the condition: - Natalie will split this into its own bug and it will be handled with lower priority in 2.3 because it does not reproduce regularly. * EOF during conversion to RAW - The upload eventually succeeds despite the confusing user experience (EOF error). - I'm happy to hear that adjusting haproxy configuration works around the problem but I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this. - For 2.2 we will address this with a release note - For 2.3 we will change the upload state machine so that the conversion happens asynchronously after the client finishes uploading data re: "I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this" This is not a k8s configuration, it is a CNV QE test environment configuration. (In reply to Michael Henriksen from comment #22) > re: "I think it's too heavy-handed to ask CNV users to update their k8s > configuration to work around this" > > This is not a k8s configuration, it is a CNV QE test environment > configuration. How would a regular user avoid hitting this issue with very large images in their default OCP installation? Opened bug #1789259 that addresses the "timed out waiting for the condition" issue Docs team: Could we add a release note to 2.2 with text similar to the following: ``` When using 'virtctl image-upload' to upload certain large qcow2 VM disk images an EOF error may be reported after all data has been transmitted even though the upload is progressing normally or has completed. The following command can be used to check the status of the upload on a given PVC: oc describe pvc <pvc-name> | grep cdi.kubevirt.io/storage.pod.phase ``` The underlying issue will be fixed in 2.3 with an enhancement to the upload process. Moving this to ON_QA. See https://github.com/openshift/openshift-docs/pull/18680 for links to current test builds. Removing completely from 2.2 Release Notes. It wasn't reported in 2.1 RN so there is no mention of the problem now in 2.2. Correct me if that's not right. Thanks, Bob If QE agrees with comment 30, can QE change state to VERIFIED? Thanks. Bob @nelly I think this is me confusing the "fixed in 2.2" as the real bug and not this bug. Sorry, adding this content back for 2.2: * When running `virtctl image-upload` to upload large VM disk images in `qcow2` format, an end-of-file (EOF) error may be reported after the data is transmitted, even though the upload is either progressing normally or completed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1754920[*BZ#1754920*]) + Run the following command to check the status of an upload on a given PVC: + ---- $ oc describe pvc <pvc-name> | grep cdi.kubevirt.io/storage.pod.phase ---- Verified the section referring to this issue in https://github.com/openshift/openshift-docs/pull/18680/files closing this bug manually; it should have been included in https://access.redhat.com/errata/RHEA-2020:0307 |