Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1430243

Summary: disk image upload fails (network error)
Product: Red Hat Enterprise Virtualization Manager Reporter: Marcus West <mwest>
Component: ovirt-imageio-proxyAssignee: Daniel Erez <derez>
Status: CLOSED ERRATA QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0.6CC: derez, mwest, nsoffer, tnisan
Target Milestone: ovirt-4.2.0Keywords: ZStream
Target Release: 4.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1446922 (view as bug list) Environment:
Last Closed: 2018-05-15 17:56:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1446922    
Attachments:
Description Flags
logs: engine, imageio-proxy, imageio-daemon, vdsm none

Description Marcus West 2017-03-08 06:54:30 UTC
## Description of problem:

When I try to upload a disk image, it fails with the error:

WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-29) [cor-id] Correlation ID: [cor-id], Call Stack: null, Custom Event ID: -1, Message: Unable to upload image to disk <disk-img> due to a network error. Make sure ovirt-imageio-proxy service is installed and configured, and ovirt-engine's certificate is registered as a valid CA in the browser.


## Version-Release number of selected component (if applicable):

rhevm-4.0.6.3-0.1.el7ev.noarch
ovirt-engine-4.0.6.3-0.1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-imageio-proxy-0.4.0-0.el7ev.noarch
ovirt-imageio-proxy-setup-0.4.0-0.el7ev.noarch


## How reproducible:

Always

## Steps to Reproduce:

1. import CA
2. click 'disks' -> 'upload' -> start
3. select image to upload

## Actual results:

Some traffic can be seen over port 54323 (between client and rhevm), but then the status goes to 'paused by system'.  After a log time, originally stated error appears in the event tab

## Expected results:

disk image to import, or a more specific error provided as to the cause of the problem.

## Additional info

required ports are open, and 'ImageProxyAddress' is correct

Comment 4 Daniel Erez 2017-03-08 09:19:24 UTC
Hi Marcus,

Please make sure:
* The *engine* ca is imported, i.e.:
  http://<engine_url>/ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA
* Network isn't blocked (firewall, etc)
* You're logged to the WebAdmin using https and fqdn.

Comment 14 Nir Soffer 2017-03-10 12:33:26 UTC
(In reply to Marcus West from comment #0)
Which browser are you using? did you try with another browser?

Comment 15 Daniel Erez 2017-03-13 08:04:19 UTC
@Marcus:
* Does the client machine have any limitation on upload transfer speed?
* Can you please try to reduce the value of UploadImageChunkSizeKB (in vdc_options table) and check whether it mitigates the issue.

Comment 16 Daniel Erez 2017-03-13 15:58:52 UTC
Another solution: try increasing the 'imaged_read_timeout_sec' value (e.g. 300), located in the proxy conf file (/etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf),
and restart the proxy service..

Comment 17 Marcus West 2017-03-14 06:43:36 UTC
> which version of rhel and webob are installed in the working 4.0 env?

on rhevmhe1:

Red Hat Enterprise Linux Server release 7.3 (Maipo)
python-webob-1.2.3-6.el7.noarch
eap7-jboss-metadata-web-10.0.0-6.Final_redhat_1.1.ep7.el7.noarch
eap7-jboss-websocket-api_1.1_spec-1.1.1-1.Final_redhat_1.1.ep7.el7.noarch
ovirt-engine-4.0.6.3-0.1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-imageio-proxy-0.5.0-0.el7ev.noarch
ovirt-imageio-proxy-setup-0.5.0-0.el7ev.noarch

> Which browser are you using? did you try with another browser?

browser: Firefox 45.4.0, on RHEL7 CSB  
I also have Chrome 54.0.2840.59, will try that also

Comment 26 Natalie Gavrielov 2017-04-30 11:15:29 UTC
Nir / Daniel,

Anything special in the scenario that should be tested? 
It seems to me just a regular UI upload..

Comment 28 Daniel Erez 2017-05-07 08:41:29 UTC
(In reply to Natalie Gavrielov from comment #26)
> Nir / Daniel,
> 
> Anything special in the scenario that should be tested? 
> It seems to me just a regular UI upload..

The addressed issue here is network bandwidth. The fix was to increase the request timeout to a more 'reasonable' one (from 10 seconds to 120 seconds). So upload should now succeed with less than optimal network conditions (depends on the UploadImageXhrTimeoutInSeconds configuration value).

Comment 29 Natalie Gavrielov 2017-05-15 18:18:39 UTC
Created attachment 1279065 [details]
logs: engine, imageio-proxy, imageio-daemon, vdsm

Daniel,

I tried to verify this issue the following way:
1. start a UI upload of a disk.
2. configure: tc qdisk add dev eno1 root netem delay 40s
(This means there should be a delay of 40 seconds of each chunk that us being sent to the engine)
3. view log and UI to see the progress of the upload.

Note: I tried a few values: 115, 100, 80, 40, 20 ,1 second. the only value that is not being paused by the system is 1 (all the rest gets paused).

engine.log shows the following warnings (a warning for a test that was paused):
-------------------------------------------------------------------------------
2017-05-15 19:52:49,793+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler1) [b36c5aac-68bd-4efe-bfbe-1f458df96dea] Transfer paused due to no updates in 65 seconds. Upload disk 'test-upload-with-delay' (id '257639b8-3c35-4d68-8c0a-fc02294fce37')
2017-05-15 20:15:32,702+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:18:20,943+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler2) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:28:25,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:44:45,336+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 67 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:52:11,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 65 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
-------------------------------------------------------------------------------

I was expecting the warning to be generated after ~120 seconds, because UploadImageXhrTimeoutInSeconds is configured with 120..

engine=# SELECT * FROM vdc_options where option_name ilike '%UploadImageXhrTimeoutInSeconds%';
 option_id |          option_name           | option_value | version 
-----------+--------------------------------+--------------+---------
       445 | UploadImageXhrTimeoutInSeconds | 120          | general
(1 row)


Any idea why this warning pops up after approximately a minute?

Comment 30 Daniel Erez 2017-05-16 07:50:17 UTC
The 'Transfer paused due to no updates in 65 seconds' message only means that the ImageTransfer entity wasn't updated for about a minute. The entity update should be triggered by the client every few seconds by invoking TransferImageStatus command (for keepalive reasons). I.e. the status wasn't polled for about a minute, probably during the traffic control execution. Any way, the UploadImageXhrTimeoutInSeconds value is only the timeout value of the request, which can be estimated by the logs [*]. Indeed about 2 minutes were elapsed between resuming and pausing the transfer. 

[*}:

2017-05-15 20:50:19,590+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

2017-05-15 20:52:11,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 65 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

Comment 31 Natalie Gavrielov 2017-05-16 08:34:10 UTC
(In reply to Daniel Erez from comment #30)
> The 'Transfer paused due to no updates in 65 seconds' message only means
> that the ImageTransfer entity wasn't updated for about a minute. The entity
> update should be triggered by the client every few seconds by invoking
> TransferImageStatus command (for keepalive reasons). I.e. the status wasn't
> polled for about a minute, probably during the traffic control execution.
> Any way, the UploadImageXhrTimeoutInSeconds value is only the timeout value
> of the request, which can be estimated by the logs [*]. Indeed about 2
> minutes were elapsed between resuming and pausing the transfer. 
> 
> [*}:
> 
> 2017-05-15 20:50:19,590+03 INFO 
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming
> session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
> 
> 2017-05-15 20:52:11,883+03 WARN 
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer
> paused due to no updates in 65 seconds. Upload disk 'test-2' (id
> '9916ab7b-b3c4-4d77-a06d-376a519406f8')

I may have described the steps in comment 28 a bit different than how it really was.
After the first test (as described in comment 28), I continued the test the following way:
1. remove the delay: sudo tc qdisk del dev eno1 root
2. resume upload.
3. *wait for the upload to continue*, configure the delay again.
4. view log and UI to see the progress of the upload.

Now, regarding step 3 - I in some cases I allowed upload to work for some time (a few minutes), before configuring the delay. So basically, these two log messages don't really say much, for example: 
2017-05-15 20:37:43,804+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler9) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

2017-05-15 20:44:45,336+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 67 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

Here we have a 7 minutes..

Any suggestions on how to test this?

Comment 32 Daniel Erez 2017-05-16 09:56:56 UTC
It could be a bit tricky to measure the exact time as the client might retry the request a few times, but you could try using the console log in the browser. However, the issue in this bug is not really about the exact timeout period but rather to simply verify that the upload won't fail when network bandwidth isn't optimal. Just an example: upload rate less than 1MB/s. So I think that measuring the exact timing isn't relevant as we rely on both chunk size and timeout values, i.e. we can only roughly estimate the supported network conditions. In short, simply check upload rate of ~1MB/s should do it for this bug.

Comment 33 Natalie Gavrielov 2017-05-17 08:24:38 UTC
Verified using:
rhevm-4.1.2.2-0.1.el7.noarch
ovirt-imageio-proxy-1.0.0-0.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
vdsm-4.19.14-1.el7ev.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch

Scenario tested:
Upload images using python-sdk:
   a. rate: 1MB/s
   b. rate 512KB/s

Both uploads finished successfully.

Comment 34 Natalie Gavrielov 2017-05-24 12:02:52 UTC
(In reply to Natalie Gavrielov from comment #33)
> Verified using:
> rhevm-4.1.2.2-0.1.el7.noarch
> ovirt-imageio-proxy-1.0.0-0.el7ev.noarch
> ovirt-imageio-common-1.0.0-0.el7ev.noarch
> vdsm-4.19.14-1.el7ev.x86_64
> ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
> 
> Scenario tested:
> Upload images using python-sdk:
>    a. rate: 1MB/s
>    b. rate 512KB/s
> 
> Both uploads finished successfully.

Wrong builds, retested with the following builds:
ovirt-engine-4.2.0-0.0.master.20170523140304.git04be891.el7.centos.noarch
ovirt-imageio-proxy-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch
ovirt-imageio-common-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch
vdsm-4.20.0-886.gitf9accf8.el7.centos.x86_64
ovirt-imageio-daemon-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch

Comment 39 errata-xmlrpc 2018-05-15 17:56:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1519

Comment 40 Franta Kust 2019-05-16 13:09:28 UTC
BZ<2>Jira Resync