Bug 1789093 - Large images met timeout or EOF when uploading a qcow2 image
Summary: Large images met timeout or EOF when uploading a qcow2 image
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.3.0
Assignee: Alexander Wels
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
: 1789259 (view as bug list)
Depends On: 1754920
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-08 17:55 UTC by Adam Litke
Modified: 2020-05-04 19:10 UTC (History)
11 users (show)

Fixed In Version: hco-bundle-registry-container-v2.3.0-23 virt-cdi-operator-container-v2.3.0-36
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1754920
Environment:
Last Closed: 2020-05-04 19:10:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 1095 0 None closed Add async endpoint for upload that closes connection immediately 2020-12-21 05:19:45 UTC
Github kubevirt kubevirt pull 3055 0 None closed Add asynchronous endpoint usage for virtctl upload. 2020-12-21 05:20:16 UTC
Red Hat Product Errata RHEA-2020:2011 0 None None None 2020-05-04 19:10:48 UTC

Description Adam Litke 2020-01-08 17:55:34 UTC
+++ This bug was initially created as a clone of Bug #1754920 +++

Description of problem:
Always met timeout or EOF when uploading an qcow2 image

Version-Release number of selected component (if applicable):
cnv2.1

How reproducible:
Always

Steps to Reproduce:
1. Upload an image by virtctl image-upload
2.
3.

Actual results:
$ virtctl image-upload  --pvc-name=winpv --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk.qcow2 --insecure --block-volume
PVC default/winpv created
Waiting for PVC winpv upload pod to be running...
Pod now running
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4

 5.73 GiB / 5.73 GiB [=========================================================================================================================================================] 100.00% 3m28s

Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF


Expected results:
upload succeed

Additional info:

--- Additional comment from Natalie Gavrielov on 2019-09-25 12:26:00 UTC ---

Could this be a timeout during conversion to raw?

--- Additional comment from Alexander Wels on 2019-09-25 12:29:37 UTC ---

Can you provide the uploadserver logs?

--- Additional comment from Yan Du on 2019-09-26 02:42:39 UTC ---

The EOF happened during validating the image:

$ oc logs cdi-upload-winpv  --follow
I0926 02:27:57.144845       1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume
I0926 02:27:57.145301       1 uploadserver.go:63] Running server on 0.0.0.0:8443
2019/09/26 02:28:06 http: TLS handshake error from 10.129.1.172:43478: EOF
I0926 02:28:06.449258       1 uploadserver.go:244] Content type header is ""
I0926 02:28:06.449278       1 data-processor.go:252] Calculating available size
I0926 02:28:06.471530       1 data-processor.go:256] Checking out block volume size.
I0926 02:28:06.473668       1 data-processor.go:264] Request image size not empty.
I0926 02:28:06.473690       1 data-processor.go:269] Target size 10Gi.
I0926 02:28:06.473873       1 data-processor.go:182] New phase: TransferScratch
I0926 02:28:06.476689       1 util.go:143] Writing data...
I0926 02:30:31.957120       1 data-processor.go:182] New phase: Process
I0926 02:30:31.957174       1 data-processor.go:182] New phase: Convert
I0926 02:30:31.957179       1 data-processor.go:188] Validating image
error: unexpected EOF


$ oc describe pvc
Name:          winpv
Namespace:     default
StorageClass:  rook-ceph-block
Status:        Bound
Volume:        pvc-27b5186e-e005-11e9-b28e-fa163e6997df
Labels:        <none>
Annotations:   cdi.kubevirt.io/storage.pod.phase: Succeeded
               cdi.kubevirt.io/storage.pod.ready: false
               cdi.kubevirt.io/storage.upload.target: 
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      10Gi
Access Modes:  RWO
VolumeMode:    Block
Events:
  Type       Reason                Age                    From                                                                                                      Message
  ----       ------                ----                   ----                                                                                                      -------
  Normal     Provisioning          6m51s                  rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-fd98c466-28944_29df07b3-dfc2-11e9-9fe5-0a580a81000e  External provisioner is provisioning volume for claim "default/winpv"
  Normal     ExternalProvisioning  6m48s (x2 over 6m50s)  persistentvolume-controller                                                                               waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
Mounted By:  <none>

--- Additional comment from Yan Du on 2019-09-26 05:57:01 UTC ---

The above qcow2 image info:

$ qemu-img info disk.qcow2 
image: disk.qcow2
file format: qcow2
virtual size: 6.6G (7140155392 bytes)
disk size: 5.7G
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false

If I convert the same qcow2 image to raw by qemu-img, then upload succeed.

$ qemu-img info disk_raw.img
image: disk_raw.img
file format: raw
virtual size: 6.6G (7140155392 bytes)
disk size: 5.7G


$ virtctl image-upload  --pvc-name=winpv --pvc-size=10Gi --image-path=/mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img --insecure --block-volume
PVC default/winpv created
Waiting for PVC winpv upload pod to be running...
Pod now running
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4

 6.65 GiB / 6.65 GiB [=======================================================================================================================================================================] 100.00% 1m47s

Uploading /mnt/cnv-netapp/WindowsImages/qe-images/WINDOWS_2019_X64/disk_raw.img completed successfully

$ oc logs cdi-upload-winpv --follow
I0926 05:50:53.540103       1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume
I0926 05:50:53.540435       1 uploadserver.go:63] Running server on 0.0.0.0:8443
2019/09/26 05:51:02 http: TLS handshake error from 10.129.3.227:44558: EOF
I0926 05:51:02.534025       1 uploadserver.go:244] Content type header is ""
I0926 05:51:02.534043       1 data-processor.go:252] Calculating available size
I0926 05:51:02.556587       1 data-processor.go:256] Checking out block volume size.
I0926 05:51:02.559810       1 data-processor.go:264] Request image size not empty.
I0926 05:51:02.559846       1 data-processor.go:269] Target size 10Gi.
I0926 05:51:02.560289       1 data-processor.go:182] New phase: TransferDataFile
I0926 05:51:02.564028       1 util.go:143] Writing data...
I0926 05:52:50.305765       1 data-processor.go:182] New phase: Resize
I0926 05:52:50.424873       1 data-processor.go:182] New phase: Complete
I0926 05:52:50.424953       1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume
I0926 05:52:50.425208       1 uploadserver.go:151] Shutting down http server after successful upload
I0926 05:52:50.425399       1 uploadserver.go:71] UploadServer successfully exited

--- Additional comment from Natalie Gavrielov on 2019-09-26 13:23:22 UTC ---

I also got it when I used curl with token:
$ cat dv-upload.yaml
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: upload-datavolume
spec:
  source:
      upload: {}
  pvc:
    volumeMode: Block
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 500Mi

$ cat token.yaml
apiVersion: upload.cdi.kubevirt.io/v1alpha1
kind: UploadTokenRequest
metadata:
   name: upload-datavolume-token
   namespace: src
spec:
   pvcName: upload-datavolume


1. oc create -f dv-upload.yaml
2. TOKEN=$(kubectl apply -f token.yaml -o="jsonpath={.status.token}")
3. curl -v --insecure -H "Authorization: Bearer $TOKEN" --data-binary @cirros-0.4.0-x86_64-disk.qcow2  https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload


$ oc get pods
NAME                           READY   STATUS    RESTARTS   AGE
cdi-upload-upload-datavolume   0/1     Pending   0          11m

$ oc logs -f cdi-uploadproxy-5f7c458c79-jlmkh -n openshift-cnv
I0923 11:28:13.801613       1 uploadproxy.go:46] Note: increase the -v level in the api deployment for more detailed logging, eg. -v=2 or -v=3
W0923 11:28:13.802283       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2019/09/26 03:01:25 http: TLS handshake error from 10.128.0.1:56988: EOF
I0926 12:32:12.003289       1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src
E0926 12:32:22.114270       1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
	/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 12:33:54.625050       1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src
E0926 12:34:03.695972       1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
	/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 13:09:49.192674       1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume-file, namespace: src
E0926 13:09:58.248112       1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume-file.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
	/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1333
I0926 13:11:57.457918       1 uploadproxy.go:179] Received valid token: pvc: upload-datavolume, namespace: src
E0926 13:12:06.504110       1 uploadproxy.go:211] Error connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload: timed out 5 times connecting to https://cdi-upload-upload-datavolume.src.svc/v1alpha1/upload
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).testConnect
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:274
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).proxyUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:210
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:188
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).handleUploadRequest-fm
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:140
net/http.HandlerFunc.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:1964
net/http.(*ServeMux).ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2361
kubevirt.io/containerized-data-importer/pkg/uploadproxy.(*uploadProxyApp).ServeHTTP
	/go/src/kubevirt.io/containerized-data-importer/pkg/uploadproxy/uploadproxy.go:144
net/http.serverHandler.ServeHTTP
	/usr/lib/golang/src/net/http/server.go:2741
net/http.(*conn).serve
	/usr/lib/golang/src/net/http/server.go:1847
runtime.goexit
	/usr/lib/golang/src/runtime/asm_amd64.s:1333
error: unexpected EOF

--- Additional comment from Fabian Deutsch on 2019-09-30 10:31:33 UTC ---

To me this is not a blocker as we still support raw upload.
If needed then a user can do a client side convert then upload raw.

Proposing to move this out to 2.1.1

Adam?

--- Additional comment from Adam Litke on 2019-09-30 17:44:40 UTC ---

I suspect it's PSI instability/over-subscription.  I agree with moving it out.

--- Additional comment from Ying Cui on 2019-10-09 12:30:52 UTC ---

@Natalie, could you provide the exact image cirros-0.4.0-x86_64-disk.qcow2? could you try to set smaller PVC size just enough to reproduce this issue?

--- Additional comment from Adam Litke on 2019-10-15 11:44:05 UTC ---

Ying, Can you ensure that we have, in Polarion, a tier 2 test case that tests virtctl image-upload with a large Windows image?  Once we have this case and automate it, we can close this bug and handle it with the regression testing.

--- Additional comment from Natalie Gavrielov on 2019-10-30 14:15:44 UTC ---

Try to reproduce it with a BM

--- Additional comment from Natalie Gavrielov on 2019-11-28 15:57:27 UTC ---

1. I wasn't able to get a real BM, so I got a libvirt env, the upload there failed with: "timed out waiting for the condition", upload pod was in ContainerCreating (I suspect rook isn't stable on this env, will try to investigate).

2. I tried testing again on PSI, and when using small qcow2 files (hundreds of MB) - the upload works fine (no errors)

3. When I used a larger image: 6.24 GiB, it uploaded 100% of the file and then issued: Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF.
   In the pod logs it seems to happen when "Validating image", after waiting patiently for another 10 minutes the upload completed and the scratch pvc was removed:

$ virtctl image-upload  --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume
PVC default/test-pvc created
Waiting for PVC test-pvc upload pod to be running...
Pod now running
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4

 6.24 GiB / 6.24 GiB [=============================================================================================================================================] 100.00% 12m18s

Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF
upload pod log:
$ oc logs -f cdi-upload-test-pvc
I1128 15:13:35.995401       1 uploadserver.go:61] Upload destination: /dev/cdi-block-volume
I1128 15:13:35.996194       1 uploadserver.go:63] Running server on 0.0.0.0:8443
2019/11/28 15:13:43 http: TLS handshake error from 10.131.0.32:54908: EOF
I1128 15:13:43.304401       1 uploadserver.go:244] Content type header is ""
I1128 15:13:43.304451       1 data-processor.go:252] Calculating available size
I1128 15:13:43.324150       1 data-processor.go:256] Checking out block volume size.
I1128 15:13:43.327674       1 data-processor.go:264] Request image size not empty.
I1128 15:13:43.327718       1 data-processor.go:269] Target size 36Gi.
I1128 15:13:43.327962       1 util.go:37] deleting file: /scratch/lost+found
I1128 15:13:43.330062       1 data-processor.go:182] New phase: TransferScratch
I1128 15:13:43.333064       1 util.go:169] Writing data...
I1128 15:25:30.028058       1 data-processor.go:182] New phase: Process
I1128 15:25:30.034149       1 data-processor.go:182] New phase: Convert
I1128 15:25:30.034166       1 data-processor.go:188] Validating image
I1128 15:35:08.057641       1 data-processor.go:182] New phase: Resize
I1128 15:35:08.117081       1 data-processor.go:182] New phase: Complete
I1128 15:35:08.197618       1 util.go:37] deleting file: /scratch/tmpimage
I1128 15:35:08.481357       1 uploadserver.go:263] Wrote data to /dev/cdi-block-volume
I1128 15:35:08.481851       1 uploadserver.go:151] Shutting down http server after successful upload
I1128 15:35:08.982421       1 uploadserver.go:71] UploadServer successfully exited 


@Michael, is it possible for virtctl image-upload to wait until the UploadServer exits?

--- Additional comment from Michael Henriksen on 2019-12-02 14:25:24 UTC ---

In normal cases (no EOF), virtctl image-upload will complete right before the UploadServer pod exits.  Although possible, I don't think it would be wise to ignore errors such as this and instead wait for the UploadServer pod to exit.  Apparently, this error was benign, but most probably are not.  Then the program will be waiting forever.

--- Additional comment from Michael Henriksen on 2019-12-02 14:26:25 UTC ---

Approximately how long was "virtctl image-upload" running before the EOF?

--- Additional comment from Natalie Gavrielov on 2019-12-04 11:38:03 UTC ---

Michael, below the progress bar, it shows 12m18s - so let's add a few more seconds for creating the pvc, in total it will be around 12.5-13 minutes.

--- Additional comment from Adam Litke on 2019-12-10 20:56:09 UTC ---

@michael, any updates?

--- Additional comment from Michael Henriksen on 2019-12-17 02:28:32 UTC ---

The issue from https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c5 was addressed in the following PR:

https://github.com/kubevirt/containerized-data-importer/pull/996

What's described in https://bugzilla.redhat.com/show_bug.cgi?id=1754920#c11 is different and I'm not sure what is going on.  Should we make a separate issue for it?

--- Additional comment from Adam Litke on 2019-12-18 14:09:50 UTC ---

Adam: In response to c16, let's split the issue reported in c11 into a new bug and resolve this one according to the status of PR#996.

--- Additional comment from Natalie Gavrielov on 2020-01-01 18:09:04 UTC ---

I used the latest CDI 2.2 build, and tried to upload a large image (10G).
The first try resulted in:

$ virtctl image-upload  --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume
PVC default/test-pvc created
Waiting for PVC test-pvc upload pod to be running...
timed out waiting for the condition

$ oc get pvc
NAME               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
test-pvc           Bound    pvc-51daa9e8-29d1-4657-b261-307692667513   36Gi       RWO            rook-ceph-block   27m
test-pvc-scratch   Bound    pvc-773b3a70-74d9-4b00-9cf9-45d746a67fe3   36Gi       RWO            rook-ceph-block   27m


The second attempt (didn't change a thing!) resulted with:
$ virtctl image-upload  --pvc-name=test-pvc --pvc-size=36Gi --image-path=win_16.qcow2 --insecure --block-volume
PVC default/test-pvc created
Waiting for PVC test-pvc upload pod to be running...
Pod now running
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.working.oc4

10.33 GiB [=======================================================================================================================================================================================================================================] 100.00% 57m6s

Post https://cdi-uploadproxy-openshift-cnv.apps.working.oc4/v1alpha1/upload: EOF

So both issues still relevant

--- Additional comment from Michael Henriksen on 2020-01-06 14:46:17 UTC ---

Natalie are you sure the first issue is still relevant?  Did you see the same stacktrace in uploadproxy log?

The output in comment 18 looks like the second issue.

--- Additional comment from Michael Henriksen on 2020-01-07 22:56:24 UTC ---

Was unable to reproduce EOF issue locally so Natalie hooked me up with a QA environment.

It appears that idle connections are being closed after 1 minute by the haproxy loadbalancer that sits in front of the k8s apiserver/router.  This is an environment specific issue.

Uploading the same file through a NodePort service (virtctl connecting directly to worker node) succeeds.

I also edited the haproxy config to increase timeouts from 1m to 60m.  After this, upload also succeeded.

here are the config changes I made in /etc/haproxy/haproxy.cfg on cnv-executor-kev4322e:

frontend frontend_443
    bind *:443
    mode tcp
    option tcplog
    # timeout client 1m  
    timeout client 60m
    default_backend backend_443

backend backend_443
    mode tcp
    option tcplog
    option log-health-checks
    option redispatch
    log global
    balance roundrobin
    timeout connect 10s
    #timeout server 1m
    timeout server 60m
    server rhel-worker-0 cnv-executor-kev4322e-rhel-worker-0.working.oc4:443 check
    server rhel-worker-1 cnv-executor-kev4322e-rhel-worker-1.working.oc4:443 check
    server rhel-worker-2 cnv-executor-kev4322e-rhel-worker-2.working.oc4:443 check
    server worker-0 worker-0.working.oc4:443 check
    server worker-1 worker-1.working.oc4:443 check

--- Additional comment from Adam Litke on 2020-01-08 14:00:23 UTC ---

There are a few different issues making this bug confusing so I am going to summarize our plan to resolve this:

There were two distinct cases mentioned in the original bug description:

* Timed out waiting for the condition:
- Natalie will split this into its own bug and it will be handled with lower priority in 2.3 because it does not reproduce regularly.

* EOF during conversion to RAW
- The upload eventually succeeds despite the confusing user experience (EOF error).
- I'm happy to hear that adjusting haproxy configuration works around the problem but I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this.
- For 2.2 we will address this with a release note
- For 2.3 we will change the upload state machine so that the conversion happens asynchronously after the client finishes uploading data

--- Additional comment from Michael Henriksen on 2020-01-08 14:09:13 UTC ---

re: "I think it's too heavy-handed to ask CNV users to update their k8s configuration to work around this"

This is not a k8s configuration, it is a CNV QE test environment configuration.

Comment 2 Adam Litke 2020-01-14 19:03:32 UTC
*** Bug 1789259 has been marked as a duplicate of this bug. ***

Comment 3 Nelly Credi 2020-03-16 08:23:12 UTC
please add fixed in version

Comment 4 Natalie Gavrielov 2020-03-24 09:46:58 UTC
Verified with CNV 2.3, upload of large qcow2 images finish successfully with no errors.

$ virtctl image-upload dv dv-win --size=35Gi --storage-class=nfs --insecure --image-path=win_12.qcow2
DataVolume default/dv-win created
Waiting for PVC dv-win upload pod to be ready...
Pod now ready
Uploading data to https://cdi-uploadproxy-openshift-cnv.apps.natalie-ipi.cnv-qe.rhcloud.com

 9.58 GiB / 9.58 GiB [===============================================================================================================================================] 100.00% 2m3s

Uploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress
Processing completed successfully
Uploading win_12.qcow2 completed successfully

Comment 7 errata-xmlrpc 2020-05-04 19:10:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2011


Note You need to log in before you can comment on or make changes to this bug.