1430243 – disk image upload fails (network error)

Bug 1430243 - disk image upload fails (network error)

Summary: disk image upload fails (network error)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-imageio-proxy
Sub Component:
Version:	4.0.6
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Assignee:	Daniel Erez
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1446922
TreeView+	depends on / blocked

Reported:	2017-03-08 06:54 UTC by Marcus West
Modified:	2021-08-30 13:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1446922 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:56:55 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs: engine, imageio-proxy, imageio-daemon, vdsm (1.39 MB, application/zip) 2017-05-15 18:18 UTC, Natalie Gavrielov	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43250	None	None	None	2021-08-30 12:32:57 UTC
Red Hat Knowledge Base (Solution)	2730191	None	None	None	2017-03-09 04:13:29 UTC
Red Hat Product Errata	RHBA-2018:1519	None	None	None	2018-05-15 17:57:22 UTC
oVirt gerrit	74329	'None'	MERGED	core: increase default value of UploadImageXhrTimeoutInSeconds	2020-07-08 12:55:08 UTC
oVirt gerrit	74423	'None'	MERGED	core: increase default value of UploadImageXhrTimeoutInSeconds	2020-07-08 12:55:08 UTC

Description Marcus West 2017-03-08 06:54:30 UTC

## Description of problem:

When I try to upload a disk image, it fails with the error:

WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-29) [cor-id] Correlation ID: [cor-id], Call Stack: null, Custom Event ID: -1, Message: Unable to upload image to disk <disk-img> due to a network error. Make sure ovirt-imageio-proxy service is installed and configured, and ovirt-engine's certificate is registered as a valid CA in the browser.


## Version-Release number of selected component (if applicable):

rhevm-4.0.6.3-0.1.el7ev.noarch
ovirt-engine-4.0.6.3-0.1.el7ev.noarch
ovirt-imageio-common-0.3.0-0.el7ev.noarch
ovirt-imageio-proxy-0.4.0-0.el7ev.noarch
ovirt-imageio-proxy-setup-0.4.0-0.el7ev.noarch


## How reproducible:

Always

## Steps to Reproduce:

1. import CA
2. click 'disks' -> 'upload' -> start
3. select image to upload

## Actual results:

Some traffic can be seen over port 54323 (between client and rhevm), but then the status goes to 'paused by system'.  After a log time, originally stated error appears in the event tab

## Expected results:

disk image to import, or a more specific error provided as to the cause of the problem.

## Additional info

required ports are open, and 'ImageProxyAddress' is correct

Comment 4 Daniel Erez 2017-03-08 09:19:24 UTC

Hi Marcus,

Please make sure:
* The *engine* ca is imported, i.e.:
  http://<engine_url>/ovirt-engine/services/pki-resource?resource=ca-certificate&format=X509-PEM-CA
* Network isn't blocked (firewall, etc)
* You're logged to the WebAdmin using https and fqdn.

Comment 14 Nir Soffer 2017-03-10 12:33:26 UTC

(In reply to Marcus West from comment #0)
Which browser are you using? did you try with another browser?

Comment 15 Daniel Erez 2017-03-13 08:04:19 UTC

@Marcus:
* Does the client machine have any limitation on upload transfer speed?
* Can you please try to reduce the value of UploadImageChunkSizeKB (in vdc_options table) and check whether it mitigates the issue.

Comment 16 Daniel Erez 2017-03-13 15:58:52 UTC

Another solution: try increasing the 'imaged_read_timeout_sec' value (e.g. 300), located in the proxy conf file (/etc/ovirt-imageio-proxy/ovirt-imageio-proxy.conf),
and restart the proxy service..

Comment 17 Marcus West 2017-03-14 06:43:36 UTC

> which version of rhel and webob are installed in the working 4.0 env?

on rhevmhe1:

Red Hat Enterprise Linux Server release 7.3 (Maipo)
python-webob-1.2.3-6.el7.noarch
eap7-jboss-metadata-web-10.0.0-6.Final_redhat_1.1.ep7.el7.noarch
eap7-jboss-websocket-api_1.1_spec-1.1.1-1.Final_redhat_1.1.ep7.el7.noarch
ovirt-engine-4.0.6.3-0.1.el7ev.noarch
ovirt-imageio-common-0.5.0-0.el7ev.noarch
ovirt-imageio-proxy-0.5.0-0.el7ev.noarch
ovirt-imageio-proxy-setup-0.5.0-0.el7ev.noarch

> Which browser are you using? did you try with another browser?

browser: Firefox 45.4.0, on RHEL7 CSB  
I also have Chrome 54.0.2840.59, will try that also

Comment 26 Natalie Gavrielov 2017-04-30 11:15:29 UTC

Nir / Daniel,

Anything special in the scenario that should be tested? 
It seems to me just a regular UI upload..

Comment 28 Daniel Erez 2017-05-07 08:41:29 UTC

(In reply to Natalie Gavrielov from comment #26)
> Nir / Daniel,
> 
> Anything special in the scenario that should be tested? 
> It seems to me just a regular UI upload..

The addressed issue here is network bandwidth. The fix was to increase the request timeout to a more 'reasonable' one (from 10 seconds to 120 seconds). So upload should now succeed with less than optimal network conditions (depends on the UploadImageXhrTimeoutInSeconds configuration value).

Comment 29 Natalie Gavrielov 2017-05-15 18:18:39 UTC

Created attachment 1279065 [details]
logs: engine, imageio-proxy, imageio-daemon, vdsm

Daniel,

I tried to verify this issue the following way:
1. start a UI upload of a disk.
2. configure: tc qdisk add dev eno1 root netem delay 40s
(This means there should be a delay of 40 seconds of each chunk that us being sent to the engine)
3. view log and UI to see the progress of the upload.

Note: I tried a few values: 115, 100, 80, 40, 20 ,1 second. the only value that is not being paused by the system is 1 (all the rest gets paused).

engine.log shows the following warnings (a warning for a test that was paused):
-------------------------------------------------------------------------------
2017-05-15 19:52:49,793+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler1) [b36c5aac-68bd-4efe-bfbe-1f458df96dea] Transfer paused due to no updates in 65 seconds. Upload disk 'test-upload-with-delay' (id '257639b8-3c35-4d68-8c0a-fc02294fce37')
2017-05-15 20:15:32,702+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:18:20,943+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler2) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:28:25,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 66 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:44:45,336+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 67 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
2017-05-15 20:52:11,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 65 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
-------------------------------------------------------------------------------

I was expecting the warning to be generated after ~120 seconds, because UploadImageXhrTimeoutInSeconds is configured with 120..

engine=# SELECT * FROM vdc_options where option_name ilike '%UploadImageXhrTimeoutInSeconds%';
 option_id |          option_name           | option_value | version 
-----------+--------------------------------+--------------+---------
       445 | UploadImageXhrTimeoutInSeconds | 120          | general
(1 row)


Any idea why this warning pops up after approximately a minute?

Comment 30 Daniel Erez 2017-05-16 07:50:17 UTC

The 'Transfer paused due to no updates in 65 seconds' message only means that the ImageTransfer entity wasn't updated for about a minute. The entity update should be triggered by the client every few seconds by invoking TransferImageStatus command (for keepalive reasons). I.e. the status wasn't polled for about a minute, probably during the traffic control execution. Any way, the UploadImageXhrTimeoutInSeconds value is only the timeout value of the request, which can be estimated by the logs [*]. Indeed about 2 minutes were elapsed between resuming and pausing the transfer. 

[*}:

2017-05-15 20:50:19,590+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

2017-05-15 20:52:11,883+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 65 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

Comment 31 Natalie Gavrielov 2017-05-16 08:34:10 UTC

(In reply to Daniel Erez from comment #30)
> The 'Transfer paused due to no updates in 65 seconds' message only means
> that the ImageTransfer entity wasn't updated for about a minute. The entity
> update should be triggered by the client every few seconds by invoking
> TransferImageStatus command (for keepalive reasons). I.e. the status wasn't
> polled for about a minute, probably during the traffic control execution.
> Any way, the UploadImageXhrTimeoutInSeconds value is only the timeout value
> of the request, which can be estimated by the logs [*]. Indeed about 2
> minutes were elapsed between resuming and pausing the transfer. 
> 
> [*}:
> 
> 2017-05-15 20:50:19,590+03 INFO 
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (DefaultQuartzScheduler6) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming
> session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')
> 
> 2017-05-15 20:52:11,883+03 WARN 
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (DefaultQuartzScheduler4) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer
> paused due to no updates in 65 seconds. Upload disk 'test-2' (id
> '9916ab7b-b3c4-4d77-a06d-376a519406f8')

I may have described the steps in comment 28 a bit different than how it really was.
After the first test (as described in comment 28), I continued the test the following way:
1. remove the delay: sudo tc qdisk del dev eno1 root
2. resume upload.
3. *wait for the upload to continue*, configure the delay again.
4. view log and UI to see the progress of the upload.

Now, regarding step 3 - I in some cases I allowed upload to work for some time (a few minutes), before configuring the delay. So basically, these two log messages don't really say much, for example: 
2017-05-15 20:37:43,804+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler9) [c18d551d-a489-43f4-92dc-b79d458be981] Resuming session for Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

2017-05-15 20:44:45,336+03 WARN  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (DefaultQuartzScheduler10) [c18d551d-a489-43f4-92dc-b79d458be981] Transfer paused due to no updates in 67 seconds. Upload disk 'test-2' (id '9916ab7b-b3c4-4d77-a06d-376a519406f8')

Here we have a 7 minutes..

Any suggestions on how to test this?

Comment 32 Daniel Erez 2017-05-16 09:56:56 UTC

It could be a bit tricky to measure the exact time as the client might retry the request a few times, but you could try using the console log in the browser. However, the issue in this bug is not really about the exact timeout period but rather to simply verify that the upload won't fail when network bandwidth isn't optimal. Just an example: upload rate less than 1MB/s. So I think that measuring the exact timing isn't relevant as we rely on both chunk size and timeout values, i.e. we can only roughly estimate the supported network conditions. In short, simply check upload rate of ~1MB/s should do it for this bug.

Comment 33 Natalie Gavrielov 2017-05-17 08:24:38 UTC

Verified using:
rhevm-4.1.2.2-0.1.el7.noarch
ovirt-imageio-proxy-1.0.0-0.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
vdsm-4.19.14-1.el7ev.x86_64
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch

Scenario tested:
Upload images using python-sdk:
   a. rate: 1MB/s
   b. rate 512KB/s

Both uploads finished successfully.

Comment 34 Natalie Gavrielov 2017-05-24 12:02:52 UTC

(In reply to Natalie Gavrielov from comment #33)
> Verified using:
> rhevm-4.1.2.2-0.1.el7.noarch
> ovirt-imageio-proxy-1.0.0-0.el7ev.noarch
> ovirt-imageio-common-1.0.0-0.el7ev.noarch
> vdsm-4.19.14-1.el7ev.x86_64
> ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
> 
> Scenario tested:
> Upload images using python-sdk:
>    a. rate: 1MB/s
>    b. rate 512KB/s
> 
> Both uploads finished successfully.

Wrong builds, retested with the following builds:
ovirt-engine-4.2.0-0.0.master.20170523140304.git04be891.el7.centos.noarch
ovirt-imageio-proxy-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch
ovirt-imageio-common-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch
vdsm-4.20.0-886.gitf9accf8.el7.centos.x86_64
ovirt-imageio-daemon-1.0.0-0.201705230845.gitff244bb.el7.centos.noarch

Comment 39 errata-xmlrpc 2018-05-15 17:56:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1519

Comment 40 Franta Kust 2019-05-16 13:09:28 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.