1524184 – [RFE] Support transfer cancellation - abort active transfers without client cooperation

Bug 1524184 - [RFE] Support transfer cancellation - abort active transfers without client cooperation

Summary: [RFE] Support transfer cancellation - abort active transfers without client c...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-imageio
Classification:	oVirt
Component:	Daemon
Sub Component:
Version:	1.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.4.3
Target Release:	2.0.10
Assignee:	Nir Soffer
QA Contact:	Ilan Zuckerman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-10 14:48 UTC by Idan Shaby
Modified:	2020-12-27 06:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-imageio-2.0.10
Clone Of:
Environment:
Last Closed:	2020-11-11 06:41:49 UTC
oVirt Team:	Storage
Embargoed:
Flags:	sbonazzo: ovirt-4.4? ylavi: exception+ izuckerm: testing_plan_complete+ pm-rhel: planning_ack? pm-rhel: devel_ack+ pm-rhel: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	110139	master	MERGED	ops: Support cancelling operations	2021-01-14 10:43:26 UTC
oVirt gerrit	110140	master	MERGED	auth: Support canceling tickets	2021-01-14 10:42:48 UTC
oVirt gerrit	110141	master	MERGED	http: Add connection id	2021-01-14 10:42:47 UTC
oVirt gerrit	110142	master	MERGED	backends: Keep connection context in the ticket	2021-01-14 10:42:48 UTC
oVirt gerrit	110143	master	MERGED	auth: Enable ticket cancellation	2021-01-14 10:42:48 UTC

Description Idan Shaby 2017-12-10 14:48:02 UTC

Description of problem:
When calling to Tickets.delete, current active operations are still served and can be used by the client until the timeout is reached.

Version-Release number of selected component (if applicable):
0062fd1cdd02b3aca740121504edf814cb2d9b4b

How reproducible:
100%

Steps to Reproduce:
1. Start an upload flow using the sdk.
2. Cancel the upload via the ui.

Actual results:
The transfer and disk entities are removed from the db, the ticket is removed from the daemon, but the upload is still in progress. If it makes it before the timeout, the upload completes and the disk remains untouched in the storage.

Expected results:
The upload should fail in the middle (the next chunk that the client sends after the ticket is removed should not be accepted).

Additional info:
Since the sdk's upload/download examples use a single request (which is legitimate), no further authntications are required but the first one, which occurs before the upload starts (and when extending the ticket, but this bug is not about this case). Therefore, the client is not rejected by the daemon in this case.

Comment 1 Nir Soffer 2017-12-10 16:00:44 UTC

I think we need to abort the current operations when a ticket expires, otherwise expiration has no meaning.

On the other hand, this is a minor issue, since the client had a valid ticket, and was allowed to use it. The purpose of the expiration is to protect the owner of the
ticket.

I think the best way to fix it is:

1. Add the schedule module from vdsm to ovirt_imageio_common/schedule.py
2. schedule ticket expiration when adding a ticket
   see how vdsm uses the schedule module in the executor module.
3. when a ticket expires, abort the current operations, and remove the ticket
4. aborting an operation should set a flag in the operation object
5. operation object should check the "aborted" flag on each iteration, and raise
   a special OperationAborted exception when the flag is set.

Same solution is needed in the proxy.

Comment 2 Idan Shaby 2017-12-11 05:17:54 UTC

Nir, this can be a great idea, but it's irrelevant at all for this bug.
We can open another bug to implement what you suggested, but in this bug we simply need to abort the operations when the ticket is removed, not when it expired.

Comment 3 Nir Soffer 2018-01-29 15:31:30 UTC

We want to fix this sometimes, but we don't have a real need to do it now. Touching
this code now will make it harder to change the code to support the new sparse
format, needed for backup solution. I suggest to defer this to 4.3 or later 4.2.z.

Comment 4 Tal Nisan 2018-01-29 15:47:49 UTC

How does adding a logic to stop running actions upon a ticket removal will block us from handling the sparse format?

Comment 5 Nir Soffer 2018-01-29 16:58:59 UTC

(In reply to Tal Nisan from comment #4)
> How does adding a logic to stop running actions upon a ticket removal will
> block us from handling the sparse format?

It requires implementing abort in the new code now. We can implement abort later
since there is no real need to do it now.

Comment 7 Sandro Bonazzola 2019-01-28 09:36:33 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 8 Michal Skrivanek 2020-03-19 15:41:14 UTC

We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 9 Nir Soffer 2020-03-27 21:04:55 UTC

This is important technical debt that we need to address.

Current code is not correct, removing a ticket and assuming that an image
is not used after that.

Because the image is still open, deactivating a volume on block storage
will fail, and removing the image will also fail, leaving stale volume that
engine does not know about on storage.

We need at least to fix engine to wait until an image is actually unused
after removing a ticket, but we cannot do this because we remove the ticket
in the daemon.

The minimal fix for this is:

- Fail DELETE ticket if the ticket is active (has active connections)
- Fix engine to retry ticket removal if the removing a ticket fail because
  the ticket was active.

Supporting abort needs more work, not sure that we need it. Most storage operations
do not support abort.

Because imageio is used for backup, this issue is more important in 4.4.

Re-targeting to 4.4.z.

Comment 10 Nir Soffer 2020-06-17 12:19:41 UTC

Increasing priority, since this is the worst issue in imageio right now.

Note this this is a big task requiring non-trivial changes in imageio
and in engine.

We need a new flow for canceling a ticket:

1. engine sends Host.patch_image_ticket with payload of {"canceled": true}
2. vdsm sends PATCH request to imageio with the payload above
3. imageio marks the ticket as canceled
4. imageio closes inactive backends (e.g. open file descriptor or nbd connection)
5. engine polls Host.get_image_ticket until ticket is inactive
6. engine send Host.delete_image_ticket when ticket is inactive


Changes needed:

- Changing the way backends are managed, currently kept in a connection,
  but should be kept in a ticket.

We should be able to close all backends associated with a ticket when
a ticket is canceled.

- Adding a pause or cancel API such as:

    PATCH /tickets/{ticket-id}

    {
        "canceled": true
    }

This should mark a ticket as canceled.

Active requests will be canceled as soon as they finish the current I/O.
This can take minutes if a request is blocked on inaccessible storage.

Requests blocked reading from client socket will be canceled when the 
read returns, either because of a timeout (60 seconds), or because the 
client sent a new request.

Inactive backends associated with the ticket should be closed at this point
to minimize the cancellation time. Otherwise we have to wait until
a connection time out when reading from client socket.

Buffers associated with a connection can be closed only by the connection
thread since the buffer is used for reading from the socket.

- Change imageio DELETE /ticket to fail if ticket is active

Current code will delete the ticket, leaving open backends that will fail
deactivation of the volume.

There is no way to delete a ticket safely now, if the client did not close
all connections using the ticket.

- Adding vdsm API for patching a ticket

    Host.patch_image_ticket(ticket_id, patch)

This api can be used to extend a ticket, deprecating Host.extend_image_ticket.

- Changing engine flow to send a pause/cancel request when cancelling
  a transfer and wait until ticket is inactive.

- Change engine flow to send DELETE request only after ticket was
  paused/canceled and become inactive.

Comment 11 Nir Soffer 2020-07-05 16:41:30 UTC

I worked on a POC, and I think we can simplify the solution:

## Normal flow

Reporting ticket status:
- report canceled status
- report number of connections using the ticket 

Removing a ticket:
- mark the ticket as canceled
- if ticket is used, wait until ticket unused
- remove the ticket

Completing operations:
- close the backend if ticket was canceled

Authorizing operations:
- close the backend if ticket was canceled

All backends are closed in less than 0.3-2.5 seconds, testing uploads
using 4 connections.

## Negative flows

Removing a ticket:
- Time out after 60 seconds or user selected timeout
- Fail with "409 Conflict"

Dead client keeping connection open without sending anything:
- reading from connection will time out after 60 seconds, closing
  backend

Operation blocked on inaccessible storage
- backend will be closed when operation is unblocked. Can take several
  minutes on NFS.

Recovery from timeouts:

We have 2 options:
- retry the remove ticket several times, each retry may time out
- poll ticket status until number of connections is 0, then remove
  the ticket (cannot time out).

The first option should be easier, I think we don't need to add new
phase, just a retry mechanism like we have in other case. But maybe
adding new phase is needed anyway for the retry.

The important thing is that the normal case will just work, removing
a ticket will cancel it without introducing new API in imageio or vdsm.

## Changes needed

vdsm:
- add specific error for removing ticket timeout
- handle 409 Conflict, raise new error for engine

engine:
- handle remove ticket timeout by retrying or polling

Daniel, what do you think?

Comment 12 Daniel Erez 2020-07-05 18:07:15 UTC

(In reply to Nir Soffer from comment #11)
> I worked on a POC, and I think we can simplify the solution:
> 
> ## Normal flow
> 
> Reporting ticket status:
> - report canceled status
> - report number of connections using the ticket 
> 
> Removing a ticket:
> - mark the ticket as canceled
> - if ticket is used, wait until ticket unused
> - remove the ticket
> 
> Completing operations:
> - close the backend if ticket was canceled
> 
> Authorizing operations:
> - close the backend if ticket was canceled
> 
> All backends are closed in less than 0.3-2.5 seconds, testing uploads
> using 4 connections.
> 
> ## Negative flows
> 
> Removing a ticket:
> - Time out after 60 seconds or user selected timeout
> - Fail with "409 Conflict"
> 
> Dead client keeping connection open without sending anything:
> - reading from connection will time out after 60 seconds, closing
>   backend
> 
> Operation blocked on inaccessible storage
> - backend will be closed when operation is unblocked. Can take several
>   minutes on NFS.
> 
> Recovery from timeouts:
> 
> We have 2 options:
> - retry the remove ticket several times, each retry may time out
> - poll ticket status until number of connections is 0, then remove
>   the ticket (cannot time out).
> 
> The first option should be easier, I think we don't need to add new
> phase, just a retry mechanism like we have in other case. But maybe
> adding new phase is needed anyway for the retry.
> 
> The important thing is that the normal case will just work, removing
> a ticket will cancel it without introducing new API in imageio or vdsm.
> 
> ## Changes needed
> 
> vdsm:
> - add specific error for removing ticket timeout
> - handle 409 Conflict, raise new error for engine
> 
> engine:
> - handle remove ticket timeout by retrying or polling
> 
> Daniel, what do you think?

RemoveImageTicket is invoked on FINALIZING_SUCCESS phase.
So upon failure, we just keep the transfer in the same phase,
and retry every command poll interval (iirc, 10 sec by default).
So we can either let it retry until we successfully remove the ticket;
which means that we need to allow cancel in this phase. 
Or, just move the transfer to paused, so it will be retried next
time cancel is invoked.

Comment 13 Nir Soffer 2020-07-22 12:41:23 UTC

The imageio part is done.

- We need to fix vdsm error handling - bug 1858956
- We need to fix engine error handling - Daniel posted:
  https://gerrit.ovirt.org/c/110401/

Comment 14 Nir Soffer 2020-07-22 15:48:47 UTC

How ticket cancellation works?

When removing a ticket:
- Ticket is marked as canceled
- All active operations are marked as canceled
- When active operations complete, they will fail with "403 Forbidden"
  error with the error message: "Ticket {id} was canceled"
- New requests for this ticket will fail with "403 Forbidden"
  error with the error message: "Ticket {id} was canceled"
- Wait until no connections are using the ticket, the ticket is removed
  and the request will succeed with "200 OK"
- If there is an idle connection or operation blocked on storage, the
  REMOVE request will fail with "409 Conflict" with error message 
  "Timeout canceling ticket {id}". The caller will need to retry the
  operation until the operation succeeds.

How to test:

## Normal flow

Cancel from UI:
1. Upload big enough disk from API/SDK
2. Cancel the upload from the UI

Cancel from SDK:
1. Start a download or upload from UI or from SDK with big enough disk.
2. Cancel the transfer using the transfer ID

Expected behavior:
1. Engine will send a REMOVE ticket request
2. The ticket will be canceled quickly (30-600 milliseconds in my tests)
3. Engine will clean up the transfer normally without errors.

## Negative flows

Idle connection:
- Change imageio remove_timeout to 10 seconds:

$ cat /etc/ovirt-imageio/conf.d/99-test.conf
[control]
remove_timeout = 10

$ systemctl restart ovirt-imageio

- Create new disk for the upload in engine UI

- Start upload transfer using image_transfer.py example from sdk:
  https://gerrit.ovirt.org/c/110392/

$ ./image_transfer.py --engine-url https://engine3 \
    --username admin@internal \
    --password-file engine3-password \
    --cafile engine3.pem \
    upload {disk-uuid}
Connecting...
Looking up disk 58daea80-1229-4c6b-b33c-1a4e568c8ad7
Creating image transfer for download
Transfer ID: f0978307-7aea-4694-bdbe-673beffe05eb
Transfer host name: host4
Transfer URL: https://host4:54322/images/817d6390-1e33-4459-b202-984d31f83441
Proxy URL: https://engine3:54323/images/817d6390-1e33-4459-b202-984d31f83441
Conneted to imageio server

Wait until the script say:

   Transfer is alive

And cancel the operation from the UI or the SDK.

You can also amend the script to cancel the transfer - but continue to keep
the connection open.


Blocked storage:
1. Start a regular upload or download
2. Block access to storage, so upload or download operation get stuck on storage
   (this may be tricky to do)
3. Cancel the transfer using UI (upload only) or SDK.


Expected behavior:
1. Engine will send REMOVE ticket request
2. The request will fail after the remove_timeout with "409 Conflict"
   and error "Timeout canceling ticket {id}"
3. Engine will send one or more REMOVE ticket request
4. Once the image_transfer send the next request, it will fail with
   "409 Conflict" with error "Ticket {id} was canceled"
5. The next REMOVE request after the client fail will succeed and engine
   will complete the normal cleanup flow.

Comment 16 Ilan Zuckerman 2020-10-12 08:13:21 UTC

(In reply to Nir Soffer from comment #14)

> How to test:
> 
> ## Normal flow
> 
> Cancel from UI:
> 1. Upload big enough disk from API/SDK
> 2. Cancel the upload from the UI

Actual result:
I can see that the engine sends REMOVE ticket:
2020-10-12 11:03:34,002 INFO    (Thread-4) [tickets] [127.0.0.1] REMOVE ticket=e5bf0314-bb66-40bb-882f-dee968fd5c4f

The engine cleans up the transfer, but not without errors:

2020-10-12 11:03:45,829+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-35) [178660b4] Failed to transfer disk '00000000-0000-0000-0000-000000000000' (command id '20196067-16cb-4d02-bb30-bbbf012e6f56')

2020-10-12 11:03:45,910+03 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-35) [d53c1ca9-1f9f-4c35-8e40-cca0884b519a] Error during log command: org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand. Exception null


Can you please confirm that those ERRORs are expected to appear?
Tested on ovirt-imageio-daemon-2.0.10-1.el8ev.x86_64
rhv-release-4.4.3-8-001.noarch


> 
> Expected behavior:
> 1. Engine will send a REMOVE ticket request
> 2. The ticket will be canceled quickly (30-600 milliseconds in my tests)
> 3. Engine will clean up the transfer normally without errors.


Also, how to 2. Cancel the transfer using the transfer ID?
I cant find   image_transfer.py   script anywhere on the host.

Comment 17 Nir Soffer 2020-10-13 08:59:40 UTC

(In reply to Ilan Zuckerman from comment #16)
> (In reply to Nir Soffer from comment #14)
> 
> > How to test:
> > 
> > ## Normal flow
> > 
> > Cancel from UI:
> > 1. Upload big enough disk from API/SDK
> > 2. Cancel the upload from the UI
> 
> Actual result:
> I can see that the engine sends REMOVE ticket:
> 2020-10-12 11:03:34,002 INFO    (Thread-4) [tickets] [127.0.0.1] REMOVE
> ticket=e5bf0314-bb66-40bb-882f-dee968fd5c4f
> 
> The engine cleans up the transfer, but not without errors:
> 
> 2020-10-12 11:03:45,829+03 ERROR
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-35)
> [178660b4] Failed to transfer disk '00000000-0000-0000-0000-000000000000'
> (command id '20196067-16cb-4d02-bb30-bbbf012e6f56')
> 
> 2020-10-12 11:03:45,910+03 ERROR
> [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-35)
> [d53c1ca9-1f9f-4c35-8e40-cca0884b519a] Error during log command:
> org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.
> Exception null
> 
> 
> Can you please confirm that those ERRORs are expected to appear?

I don't know it they are expected. If you think they are not file a new
bug for these errors.

Canceling a transfer should result in:
- engine remove the ticket
- if removing ticket failed, engine retry the operation until
  the ticket is removed
- if a client is connected to imageio daemon or proxy,
  the current request will fail with an error about canceling the ticket
- engine report an event that the transfer was canceled by user

Comment 27 Ilan Zuckerman 2020-10-22 05:21:23 UTC

Verified on rhv-4.4.3-9 according guidelines from comment #14

Comment 28 Sandro Bonazzola 2020-11-11 06:41:49 UTC

This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.