Description of problem: As explained in the following thread from Nir Soffer, the way that engine tracks in progress transfers is wrong: > I think this is another bug Daniel and me discussed the other day. > > 1. you start by zeroing the entire image > > 2018-04-10 12:05:04,619 INFO (Thread-9608) [images] Zeroing 33554432 > bytes at offset 0 flush False to > /rhev/data-center/mnt/ham0.home.annexia.org:_mnt_ovirt-data/7b4ae32d-bc79-4368-b 683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c3d- bb91-0652646c7d9f > for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01 > ... > 2018-04-10 12:26:29,217 INFO (Thread-10888) [images] Zeroing 10240 bytes > at offset 42949662720 flush False to > /rhev/data-center/mnt/ham0.home.annexia.org: > _mnt_ovirt-data/7b4ae32d-bc79-436 > 8-b683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c 3d-bb91-0652646c7d9f > for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01 > > At this point, when engine queries ticket status, it gets: > > { > "transfered": 42949663744, > ... > } > > Engine storage the ticket transferred value the database each time the > ticket > state changes. > > 2. You start to write data > > 2018-04-10 12:26:29,290 INFO (Thread-10889) [images] Writing 512 bytes > at offset 0 flush False to /rhev/data-center/mnt/ham0.home.annexia.org:_mnt_ovirt-data/7b4ae32d-bc79-4368-b 683-3e0b58b64bf6/images/0c96a8f1-4730-49bc-ab4b-7bb59e988ced/5fc7b730-a779-4c3d- bb91-0652646c7d9f > for ticket 2eb03c48-433d-4a12-9894-b09eb1ddcb01 > > Your writes do not change the "transfered" value, since the same range > was already written by the zero requests. > > 3. Engine queries ticket state > > 2018-04-10 12:26:29,116 INFO (ticket.server) [tickets] Retrieving ticket > 2eb03c48-433d-4a12-9894-b09eb1ddcb01 > [repeated every 10 seconds] > 2018-04-10 12:29:19,727 INFO (ticket.server) [tickets] Retrieving ticket > 2eb03c48-433d-4a12-9894-b09eb1ddcb01 > > Since ticket transferred value did not change, engine thinks the ticket is > inactive. > > We need to fix the way engine tracks inactivity. Version-Release number of selected component (if applicable): ovirt-imageio-daemon-1.3.0-0.201804031158.git8c388d1.el7.centos.noarch How reproducible: 100% Steps to Reproduce: 1. Do a large virt-v2v conversion.
This issue effects only virt-v2v because it try to zero the entire image before writing the data. This zeroing is not needed when using file based disks, so it can be solve by skipping the unneeded zero requests on virt-v2v side, but for block based disks we cannot skip the zero, so this must be solved on oVirt side. I think what we need to do is: # daemon side - add access_time timestamp to Ticket - every request (GET,PUT,PATCH,OPTIONS) will update the ticket access time on the end of the request - report access_time in GET /tickets/ticket-uuid # engine side - consider last_access time when checking if ticket is inactive - consider ticket.active status when checking if a ticket is inactive # typical flows - large PUT/PATCH request user submit PUT or PATCH request with big paylod - While the the request is processed, ticket "active" value will be True. - When the request is done, ticket "active" value becomes False, and the "access_time" will be updated to the end time.
Reporting access time for a ticket is problematic, since now the caller need to know which clock is used (daemon uses only monotonic clock), and engine clock may not be synchronized with host clock. We can simplify the solution by reporting "idle_time" for a ticket. - if a ticket is active, report 0 - if a ticket is not active, report time since last request ended So the complete solution: # daemon side - keep access_time attribute per ticket - update access_time when an operation is completed - report idle_time in Ticket.info() # engine side - pause transfer if idle_time > inactivity_timeout One issue, we don't have an easy way to update the access time when operation completes yet, but we need for fixing inefficient way we report progress. So this will not an easy fix.
The attached patch adds idle time tracking to ovirt-imageio daemon. Engine can use the new ticket "idle_time" value to pause inactive transfers.
To test ticket idle time, use: watch curl -s --unix-socket /run/vdsm/ovirt-imageio-daemon.sock \ http://localhost/tickets/<ticket-uuid> When no operation is active, "idle_time" should increase. When uploading or downloading, "idle_time" is always 0.
The ovirt-imageio-daemon patch was merged. Now we need the engine patch using new idle_time for pausing inactive transfers.
Created attachment 1436011 [details] upload_with_inactivity_script
Hi Daniel, I checked the following on To summarize, one issue was seen -> during download inactivity 'idle_time' = 0. I tried with both SDK script & manually pausing disk download via browser(chrome). Tests done: 1) During upload/download see 'idle_time' = 0 - Looks good ! upload - 'idle_time' = 0 download - 'idle_time' = 0 2) During upload/download inactivity 'idle_time' should increase . Upload looks good. Download does NOT !'idle_time' = 0 during inactivity. Upload + 700 sec's inactivity - See increasing 'idle_time' -> 'idle_time' increases . Donwload + 700 sec's inactivity - See increasing 'idle_time' -> 'idle_time' = 0
Tests in C#13 was done on build 4.2.3.5-0.1.el7
If you mean have we removed the restriction on -of/-oa parameters then no. I can do that today, but I'm still catching up with all my email from last week.
Adding back needinfo for Avihay, see comment 19.
Created attachment 1436161 [details] download with inactivity script
Created attachment 1436173 [details] Upload test script Fixing content type to test/plain so we can view the script in the browser.
Created attachment 1436174 [details] download script
(In reply to Avihai from comment #22) Avihay, I looked the upload and download scripts (which I had to download since you uploaded them without specifying content type, next time please specify "text/plain" content type so we can view the code in the browser directly). Upload is based on the old example code, sending a new request for each chunk: # Simplified with open(path, "rb") as disk: ... while pos < size: ... proxy_connection.request( 'PUT', proxy_url.path, disk.read(chunk_size), headers=upload_headers, ) ... sleep(SLEEP) This code create an operation on the daemon side for every chunk (8MiB). When the PUT request is completed, the ticket becomes inactive, so idle_time is expected to grow while you sleep. The current example code send one request for entire image, you should use this code for testing: https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py When using this code idle_time will not grow while the request is not completed. We don't support idle_time for single request, only for multiple requests (what virt-v2v is using). The download script uses newer code, that make one request and loop to read chunks and write to the file. # Simplified # Send the request proxy_connection.request( 'GET', proxy_url.path, headers=transfer_headers, ) ... with open(path, "wb") as mydisk: ... while pos < image_size: ... chunk = r.read(to_read) ... mydisk.write(chunk) ... time.sleep(SLEEP) So from daemon point of view you have an active operation, even when you sleep. To test you want to do (both upload and download): - Start transfer without downloading or uploading, idle_time should grow - Or use multiple requests, sleep between the requests, idle_time should grow after every request was finished, and should be zero once a new request was started.
Change scripts as suggested by Nir& Daniel & test looks good meaning "idle_time" increases when there is transfer inactivity and "idle_time"=0 when the transfer is active. Also, After discussing this with Nir & Daniel it looks like the new V2V that usage ImageIO transfer is not implemented yet in the build going to GA so this cannot be tested in this bug scope. Verified at ovirt-engine-4.2.3.5-0.1.el7.noarch vdsm-4.20.27.1-1.el7ev.x86_64 ovirt-imageio-proxy-1.3.1.2-0.el7ev.noarch ovirt-imageio-daemon-1.3.1.2-0.el7ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
The erratum says this is fixed in ovirt-engine-4.2.3.5-0.1.el7 I have ovirt-engine-4.2.4.1-1.el7 (on CentOS), but I see an error indicating that inactivity_timeout is not supported. I'll get the exact error message in a minute ... What do I need to upgrade to get this fix?
nbdkit: error: /home/rjones/d/libguestfs/tmp/rhvupload.DHDuFp/rhv-upload-plugin.py: open: error: __init__() got an unexpected keyword argument 'inactivity_timeout' I wonder if I need to upgrade the client side as well?
It turns out upgrading the client was necessary (to python3-ovirt-engine-sdk4-4.3.0-0.a0.20180523git6961094.fc28).
BZ<2>Jira Resync