Bug 1565681
| Summary: | Engine doesn't track transfers in progress correctly | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Richard W.M. Jones <rjones> | ||||||||||
| Component: | ovirt-engine | Assignee: | Daniel Erez <derez> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Avihai <aefrat> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 4.2.2 | CC: | aefrat, derez, ebenahar, lsurette, mavital, nsoffer, ratamir, rbalakri, Rhev-m-bugs, rjones, srevivo, tnisan, ykaul, ylavi | ||||||||||
| Target Milestone: | ovirt-4.2.3 | Flags: | derez:
needinfo-
derez: needinfo- derez: needinfo- derez: needinfo- |
||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2018-05-15 17:48:41 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Richard W.M. Jones
2018-04-10 14:26:29 UTC
This issue effects only virt-v2v because it try to zero the entire image before writing the data. This zeroing is not needed when using file based disks, so it can be solve by skipping the unneeded zero requests on virt-v2v side, but for block based disks we cannot skip the zero, so this must be solved on oVirt side. I think what we need to do is: # daemon side - add access_time timestamp to Ticket - every request (GET,PUT,PATCH,OPTIONS) will update the ticket access time on the end of the request - report access_time in GET /tickets/ticket-uuid # engine side - consider last_access time when checking if ticket is inactive - consider ticket.active status when checking if a ticket is inactive # typical flows - large PUT/PATCH request user submit PUT or PATCH request with big paylod - While the the request is processed, ticket "active" value will be True. - When the request is done, ticket "active" value becomes False, and the "access_time" will be updated to the end time. Reporting access time for a ticket is problematic, since now the caller need to know which clock is used (daemon uses only monotonic clock), and engine clock may not be synchronized with host clock. We can simplify the solution by reporting "idle_time" for a ticket. - if a ticket is active, report 0 - if a ticket is not active, report time since last request ended So the complete solution: # daemon side - keep access_time attribute per ticket - update access_time when an operation is completed - report idle_time in Ticket.info() # engine side - pause transfer if idle_time > inactivity_timeout One issue, we don't have an easy way to update the access time when operation completes yet, but we need for fixing inefficient way we report progress. So this will not an easy fix. The attached patch adds idle time tracking to ovirt-imageio daemon. Engine can use the new ticket "idle_time" value to pause inactive transfers. To test ticket idle time, use:
watch curl -s --unix-socket /run/vdsm/ovirt-imageio-daemon.sock \
http://localhost/tickets/<ticket-uuid>
When no operation is active, "idle_time" should increase. When uploading or
downloading, "idle_time" is always 0.
The ovirt-imageio-daemon patch was merged. Now we need the engine patch using new idle_time for pausing inactive transfers. Created attachment 1436011 [details]
upload_with_inactivity_script
Hi Daniel, I checked the following on To summarize, one issue was seen -> during download inactivity 'idle_time' = 0. I tried with both SDK script & manually pausing disk download via browser(chrome). Tests done: 1) During upload/download see 'idle_time' = 0 - Looks good ! upload - 'idle_time' = 0 download - 'idle_time' = 0 2) During upload/download inactivity 'idle_time' should increase . Upload looks good. Download does NOT !'idle_time' = 0 during inactivity. Upload + 700 sec's inactivity - See increasing 'idle_time' -> 'idle_time' increases . Donwload + 700 sec's inactivity - See increasing 'idle_time' -> 'idle_time' = 0 Tests in C#13 was done on build 4.2.3.5-0.1.el7 If you mean have we removed the restriction on -of/-oa parameters then no. I can do that today, but I'm still catching up with all my email from last week. Adding back needinfo for Avihay, see comment 19. Created attachment 1436161 [details]
download with inactivity script
Created attachment 1436173 [details]
Upload test script
Fixing content type to test/plain so we can view the script in the browser.
Created attachment 1436174 [details]
download script
(In reply to Avihai from comment #22) Avihay, I looked the upload and download scripts (which I had to download since you uploaded them without specifying content type, next time please specify "text/plain" content type so we can view the code in the browser directly). Upload is based on the old example code, sending a new request for each chunk: # Simplified with open(path, "rb") as disk: ... while pos < size: ... proxy_connection.request( 'PUT', proxy_url.path, disk.read(chunk_size), headers=upload_headers, ) ... sleep(SLEEP) This code create an operation on the daemon side for every chunk (8MiB). When the PUT request is completed, the ticket becomes inactive, so idle_time is expected to grow while you sleep. The current example code send one request for entire image, you should use this code for testing: https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py When using this code idle_time will not grow while the request is not completed. We don't support idle_time for single request, only for multiple requests (what virt-v2v is using). The download script uses newer code, that make one request and loop to read chunks and write to the file. # Simplified # Send the request proxy_connection.request( 'GET', proxy_url.path, headers=transfer_headers, ) ... with open(path, "wb") as mydisk: ... while pos < image_size: ... chunk = r.read(to_read) ... mydisk.write(chunk) ... time.sleep(SLEEP) So from daemon point of view you have an active operation, even when you sleep. To test you want to do (both upload and download): - Start transfer without downloading or uploading, idle_time should grow - Or use multiple requests, sleep between the requests, idle_time should grow after every request was finished, and should be zero once a new request was started. Change scripts as suggested by Nir& Daniel & test looks good meaning "idle_time" increases when there is transfer inactivity and "idle_time"=0 when the transfer is active. Also, After discussing this with Nir & Daniel it looks like the new V2V that usage ImageIO transfer is not implemented yet in the build going to GA so this cannot be tested in this bug scope. Verified at ovirt-engine-4.2.3.5-0.1.el7.noarch vdsm-4.20.27.1-1.el7ev.x86_64 ovirt-imageio-proxy-1.3.1.2-0.el7ev.noarch ovirt-imageio-daemon-1.3.1.2-0.el7ev.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 The erratum says this is fixed in ovirt-engine-4.2.3.5-0.1.el7 I have ovirt-engine-4.2.4.1-1.el7 (on CentOS), but I see an error indicating that inactivity_timeout is not supported. I'll get the exact error message in a minute ... What do I need to upgrade to get this fix? nbdkit: error: /home/rjones/d/libguestfs/tmp/rhvupload.DHDuFp/rhv-upload-plugin.py: open: error: __init__() got an unexpected keyword argument 'inactivity_timeout' I wonder if I need to upgrade the client side as well? It turns out upgrading the client was necessary (to python3-ovirt-engine-sdk4-4.3.0-0.a0.20180523git6961094.fc28). BZ<2>Jira Resync |