1565681 – Engine doesn't track transfers in progress correctly

Bug 1565681 - Engine doesn't track transfers in progress correctly

Summary: Engine doesn't track transfers in progress correctly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.2.3
Target Release:	---
Assignee:	Daniel Erez
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-10 14:26 UTC by Richard W.M. Jones
Modified:	2019-05-16 13:09 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:48:41 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	derez: needinfo- derez: needinfo- derez: needinfo- derez: needinfo-

Attachments	(Terms of Use)
upload_with_inactivity_script (5.78 KB, text/x-python) 2018-05-14 05:23 UTC, Avihai	no flags	Details
download with inactivity script (5.56 KB, text/x-python) 2018-05-14 12:43 UTC, Avihai	no flags	Details
Upload test script (5.78 KB, text/plain) 2018-05-14 12:52 UTC, Nir Soffer	no flags	Details
download script (5.56 KB, text/plain) 2018-05-14 12:57 UTC, Nir Soffer	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:49:12 UTC
oVirt gerrit	90076	master	POST	daemon: Track and report ticket idle time	2018-04-10 19:44:31 UTC
oVirt gerrit	90275	master	MERGED	core: handle image ticket idle_time	2018-04-22 11:46:47 UTC
oVirt gerrit	90555	ovirt-engine-4.2	MERGED	core: handle image ticket idle_time	2018-04-25 08:32:50 UTC

Description Richard W.M. Jones 2018-04-10 14:26:29 UTC

Description of problem:

As explained in the following thread from Nir Soffer, the way
that engine tracks in progress transfers is wrong:

> I think this is another bug Daniel and me discussed the other day.
>
> 1. you start by zeroing the entire image
>
> 2018-04-10 12:05:04,619 INFO    (Thread-9608) [images] Zeroing 33554432
> bytes at offset 0 flush False to
>
/rhev/data-center/mnt/ham0.home.annexia.org:_mnt_ovirt-data/7b4ae32d-bc79-4368-b
683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c3d-
bb91-0652646c7d9f
> for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01
> ...
> 2018-04-10 12:26:29,217 INFO    (Thread-10888) [images] Zeroing 10240 bytes
> at offset 42949662720 flush False to
> /rhev/data-center/mnt/ham0.home.annexia.org:
> _mnt_ovirt-data/7b4ae32d-bc79-436
>
8-b683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c
3d-bb91-0652646c7d9f
> for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01
>
> At this point, when engine queries ticket status, it gets:
>
> {
>    "transfered": 42949663744,
>    ...
> }
>
> Engine storage the ticket transferred value the database each time the
> ticket
> state changes.
>
> 2. You start to write data
>
> 2018-04-10 12:26:29,290 INFO    (Thread-10889) [images] Writing 512 bytes
> at offset 0 flush False to
/rhev/data-center/mnt/ham0.home.annexia.org:_mnt_ovirt-data/7b4ae32d-bc79-4368-b
683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c3d-
bb91-0652646c7d9f
> for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01
>
> Your writes do not change the "transfered" value, since the same range
> was already written by the zero requests.
>
> 3. Engine queries ticket state
>
> 2018-04-10 12:26:29,116 INFO    (ticket.server) [tickets] Retrieving ticket
> 2eb03c48-433d-4a12-9894-b09eb1ddcb01
> [repeated every 10 seconds]
> 2018-04-10 12:29:19,727 INFO    (ticket.server) [tickets] Retrieving ticket
> 2eb03c48-433d-4a12-9894-b09eb1ddcb01
>
> Since ticket transferred value did not change, engine thinks the ticket is
> inactive.
>
> We need to fix the way engine tracks inactivity.

Version-Release number of selected component (if applicable):

ovirt-imageio-daemon-1.3.0-0.201804031158.git8c388d1.el7.centos.noarch

How reproducible:

100%

Steps to Reproduce:
1. Do a large virt-v2v conversion.

Comment 1 Nir Soffer 2018-04-10 14:53:36 UTC

This issue effects only virt-v2v because it try to zero the entire image before 
writing the data. This zeroing is not needed when using file based disks, so it
can be solve by skipping the unneeded zero requests on virt-v2v side, but for 
block based disks we cannot skip the zero, so this must be solved on oVirt side.

I think what we need to do is:

# daemon side

- add access_time timestamp to Ticket
- every request (GET,PUT,PATCH,OPTIONS) will update the ticket access time on the
  end of the request
- report access_time in GET /tickets/ticket-uuid

# engine side

- consider last_access time when checking if ticket is inactive
- consider ticket.active status when checking if a ticket is inactive

# typical flows - large PUT/PATCH request

user submit PUT or PATCH request with big paylod

- While the the request is processed, ticket "active" value will be True.
- When the request is done, ticket "active" value becomes False, and the "access_time" will be updated to the end time.

Comment 2 Nir Soffer 2018-04-10 15:54:53 UTC

Reporting access time for a ticket is problematic, since now the caller need to 
know which clock is used (daemon uses only monotonic clock), and engine clock
may not be synchronized with host clock.

We can simplify the solution by reporting "idle_time" for a ticket.

- if a ticket is active, report 0
- if a ticket is not active, report time since last request ended

So the complete solution:

# daemon side

- keep access_time attribute per ticket
- update access_time when an operation is completed
- report idle_time in Ticket.info()

# engine side
- pause transfer if idle_time > inactivity_timeout

One issue, we don't have an easy way to update the access time when operation
completes yet, but we need for fixing inefficient way we report progress. So this
will not an easy fix.

Comment 3 Nir Soffer 2018-04-10 19:48:53 UTC

The attached patch adds idle time tracking to ovirt-imageio daemon. Engine can use
the new ticket "idle_time" value to pause inactive transfers.

Comment 4 Nir Soffer 2018-04-10 20:26:03 UTC

To test ticket idle time, use:

    watch curl -s --unix-socket /run/vdsm/ovirt-imageio-daemon.sock \
        http://localhost/tickets/<ticket-uuid>

When no operation is active, "idle_time" should increase. When uploading or
downloading, "idle_time" is always 0.

Comment 6 Nir Soffer 2018-04-11 18:50:50 UTC

The ovirt-imageio-daemon patch was merged. Now we need the engine patch using new
idle_time for pausing inactive transfers.

Comment 11 Avihai 2018-05-14 05:23:14 UTC

Created attachment 1436011 [details]
upload_with_inactivity_script

Comment 13 Avihai 2018-05-14 09:06:26 UTC

Hi Daniel, I checked the following on 

To summarize, one issue was seen -> during download inactivity 'idle_time' = 0.
I tried with both SDK script & manually pausing disk download via browser(chrome).

Tests done:
1) During upload/download see 'idle_time' = 0  -  Looks good !
upload -  'idle_time' = 0 
download -  'idle_time' = 0 


2) During upload/download inactivity 'idle_time' should increase .
   Upload looks good.
   Download does NOT !'idle_time' = 0 during inactivity.

Upload + 700 sec's inactivity - See increasing  'idle_time' ->  'idle_time' increases .
Donwload + 700 sec's inactivity - See increasing  'idle_time'  ->  'idle_time' = 0

Comment 14 Avihai 2018-05-14 09:07:19 UTC

Tests in C#13 was done on build 4.2.3.5-0.1.el7

Comment 15 Richard W.M. Jones 2018-05-14 09:09:04 UTC

If you mean have we removed the restriction on -of/-oa parameters
then no.  I can do that today, but I'm still catching up
with all my email from last week.

Comment 21 Nir Soffer 2018-05-14 12:21:50 UTC

Adding back needinfo for Avihay, see comment 19.

Comment 23 Avihai 2018-05-14 12:43:33 UTC

Created attachment 1436161 [details]
download with inactivity script

Comment 25 Nir Soffer 2018-05-14 12:52:49 UTC

Created attachment 1436173 [details]
Upload test script

Fixing content type to test/plain so we can view the script in the browser.

Comment 26 Nir Soffer 2018-05-14 12:57:30 UTC

Created attachment 1436174 [details]
download script

Comment 27 Nir Soffer 2018-05-14 13:20:18 UTC

(In reply to Avihai from comment #22)

Avihay, I looked the upload and download scripts (which I had to download since
you uploaded them without specifying content type, next time please specify
"text/plain" content type so we can view the code in the browser directly).

Upload is based on the old example code, sending a new request for each chunk:

# Simplified
with open(path, "rb") as disk:
    ...
    while pos < size:
        ...
        proxy_connection.request(
            'PUT',
            proxy_url.path,
            disk.read(chunk_size),
            headers=upload_headers,
        )
        ...
        sleep(SLEEP)

This code create an operation on the daemon side for every chunk (8MiB). When
the PUT request is completed, the ticket becomes inactive, so idle_time is 
expected to grow while you sleep.

The current example code send one request for entire image, you should use this
code for testing:
https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py

When using this code idle_time will not grow while the request is not completed.
We don't support idle_time for single request, only for multiple requests (what 
virt-v2v is using).


The download script uses newer code, that make one request and loop to read
chunks and write to the file.

# Simplified
    # Send the request
    proxy_connection.request(
        'GET',
        proxy_url.path,
        headers=transfer_headers,
    )
    ...
    with open(path, "wb") as mydisk:
        ...
        while pos < image_size:
            ...
            chunk = r.read(to_read)
            ...
            mydisk.write(chunk)
            ...
            time.sleep(SLEEP)

So from daemon point of view you have an active operation, even when you sleep.

To test you want to do (both upload and download):

- Start transfer without downloading or uploading, idle_time should grow
- Or use multiple requests, sleep between the requests, idle_time should grow after
  every request was finished, and should be zero once a new request was started.

Comment 28 Avihai 2018-05-14 14:35:05 UTC

Change scripts as suggested by Nir& Daniel & test looks good meaning  "idle_time" increases when there is transfer inactivity and  "idle_time"=0 when the transfer is active. 

Also, After discussing this with Nir & Daniel it looks like the new V2V that usage ImageIO transfer is not implemented yet in the build going to GA so this cannot be tested in this bug scope.

Verified at ovirt-engine-4.2.3.5-0.1.el7.noarch
vdsm-4.20.27.1-1.el7ev.x86_64
ovirt-imageio-proxy-1.3.1.2-0.el7ev.noarch
ovirt-imageio-daemon-1.3.1.2-0.el7ev.noarch

Comment 32 errata-xmlrpc 2018-05-15 17:48:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 33 Richard W.M. Jones 2018-06-06 10:28:49 UTC

The erratum says this is fixed in  ovirt-engine-4.2.3.5-0.1.el7

I have ovirt-engine-4.2.4.1-1.el7 (on CentOS), but I see
an error indicating that inactivity_timeout is not supported.
I'll get the exact error message in a minute ...

What do I need to upgrade to get this fix?

Comment 34 Richard W.M. Jones 2018-06-06 10:47:07 UTC

nbdkit: error: /home/rjones/d/libguestfs/tmp/rhvupload.DHDuFp/rhv-upload-plugin.py: open: error: __init__() got an unexpected keyword argument 'inactivity_timeout'

I wonder if I need to upgrade the client side as well?

Comment 35 Richard W.M. Jones 2018-06-06 11:32:09 UTC

It turns out upgrading the client was necessary
(to python3-ovirt-engine-sdk4-4.3.0-0.a0.20180523git6961094.fc28).

Comment 36 Franta Kust 2019-05-16 13:09:40 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.