Bug 1540539 - ImageIO - Cancel download disk cause VDSM exception Cannot deactivate Logical Volume as lv is in use
Summary: ImageIO - Cancel download disk cause VDSM exception Cannot deactivate Logical...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.1.3
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ovirt-4.2.2
: ---
Assignee: Idan Shaby
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-31 10:55 UTC by Avihai
Modified: 2018-03-29 11:10 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-29 11:10:21 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+
ylavi: exception+


Attachments (Terms of Use)
engine , image_proxy ,vdsm,deamon logs (1.36 MB, application/x-gzip)
2018-01-31 10:55 UTC, Avihai
no flags Details
relevant engine , image_proxy ,vdsm,deamon logs (750.23 KB, application/x-gzip)
2018-02-22 12:43 UTC, Avihai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 87813 0 master MERGED frontend: block download cancellation when in progress 2018-02-19 11:48:27 UTC
oVirt gerrit 87841 0 ovirt-engine-4.2 MERGED frontend: block download cancellation when in progress 2018-02-21 12:15:47 UTC

Description Avihai 2018-01-31 10:55:48 UTC
Created attachment 1388905 [details]
engine , image_proxy ,vdsm,deamon logs

Description of problem:
After Canceling download VDSM tried to deactivate the LV that the download disk image is on (why?) & exception is raised the VDSM , see below. 

VDSM log:
2018-01-31 12:33:35,246+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb
/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54cedd
d, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:50)
2018-01-31 12:33:35,248+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in teardownImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3225, in teardownImage
    dom.deactivateImage(imgUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1296, in deactivateImage
    lvm.deactivateLVs(self.sdUUID, volUUIDs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1303, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 834, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',)


Version-Release number of selected component (if applicable):
Engine: 4.2.1.3-0.1.el7
VDSM: 4.20.17-1

How reproducible:
100%

Steps to Reproduce:
1.Via webadmin start disk download & wait until it starts transferring data.
2.Cancel disk download & watch VDSM log of the host that handles the download 

Actual results:
VDSM tried to deactivate the LV of the downloaded disk & exception is raised as the disk is in use.

Expected results:
VDSM should not try to deactivate the LV of the downloaded disk & exception should not be raised.

Additional info:

Comment 1 Idan Shaby 2018-02-01 06:47:00 UTC
Nir/Tal,

I believe that bug 1524184 that Nir asked to defer to 4.3 a few days ago blocks this one, so either we defer this one as well or retarget bug 1524184 to 4.2.2.

Comment 2 Nir Soffer 2018-02-01 10:33:05 UTC
(In reply to Idan Shaby from comment #1)
> I believe that bug 1524184 that Nir asked to defer to 4.3 a few days ago
> blocks this one.

Please show logs proving that the image tear down failed because an operation was
active after the ticket was removed.

If this is the case we may want to handle this now, leaving active lvs is bad.

Comment 3 Idan Shaby 2018-02-01 13:55:34 UTC
There you go:

2018-01-31 12:33:24,207+0200 INFO  (jsonrpc/6) [vdsm.api] START remove_image_ticket(uuid='ffd42646-df8c-4537-98e7-5203631e32a7') from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=e2ba7f83-9175-49fa-8f7a-77338a6c6e32 (api:46)
2018-01-31 12:33:24,208+0200 DEBUG (jsonrpc/6) [storage.imagetickets] Sending request method='DELETE', ticket='ffd42646-df8c-4537-98e7-5203631e32a7', body=None (imagetickets:78)
2018-01-31 12:33:24,219+0200 INFO  (jsonrpc/6) [vdsm.api] FINISH remove_image_ticket return=None from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=e2ba7f83-9175-49fa-8f7a-77338a6c6e32 (api:52)
2018-01-31 12:33:24,221+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') finished: None (task:1201)
2018-01-31 12:33:24,221+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') moving from state preparing -> state finished (task:602)
2018-01-31 12:33:24,222+0200 DEBUG (jsonrpc/6) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910)
2018-01-31 12:33:24,223+0200 DEBUG (jsonrpc/6) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947)
2018-01-31 12:33:24,224+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') ref 0 aborting False (task:1002)
2018-01-31 12:33:24,224+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.remove_image_ticket succeeded in 0.02 seconds (__init__:573)
2018-01-31 12:33:24,352+0200 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') moving from state init -> state preparing (task:602)
2018-01-31 12:33:24,353+0200 INFO  (jsonrpc/1) [vdsm.api] START teardownImage(sdUUID='5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb', spUUID='5c31ee48-6da5-4dea-9f0d-2c091081445d', imgUUID='b63a51ea-9142-44dd-8821-11bc508d0285', volUUID=None) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:46)

...

2018-01-31 12:33:35,246+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:50)
2018-01-31 12:33:35,248+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in teardownImage
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3225, in teardownImage
    dom.deactivateImage(imgUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1296, in deactivateImage
    lvm.deactivateLVs(self.sdUUID, volUUIDs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1303, in deactivateLVs
    _setLVAvailability(vgName, toDeactivate, "n")
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 834, in _setLVAvailability
    raise error(str(e))
CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',)


Tal, please retarget Bug 1524184.

Comment 4 Tal Nisan 2018-02-02 09:28:38 UTC
Nir was strongly objecting having fixing bug 1524184 for 4.2.2, Nir does this change given this bug?

Comment 5 Nir Soffer 2018-02-07 08:41:44 UTC
(In reply to Idan Shaby from comment #3)
> There you go:
Where are daemon logs?

We need to understand why the lv is used. Do we have an active operation when we 
cancel the download?

According to comment 0 - we cancel the download in the ui, while the browser is 
downloading the data. I assume that there is an active operation sending data to 
the browser in the daemon when we remove the ticket and deactivate the lv, but I
did not check the logs. Please confirm that this is the case.

If this is the case, we need to:
- invalidate the ticket, so new requests are rejected
- cancel the active operations
- wait until the operations is inactive
  this should be very quick normally, but if storage is non-responsive, can take
  minutes
- delete the ticket
- deactivate the lv

bug 1524184 suggest to abort active download when deleting a ticket. This will not
solve this issue, since aborting operations is async. You flag the operation as
canceled, and the operation thread will abort the read-write loop on the next 
iteration.

Since we must wait for active operations, we cannot delete a ticket.

I think we should do this:

1. Cancel the ticket by patching it

PATCH /tickets/ticket-id

{
    "timeout": -1
}

This request should also mark all active operations as canceled.

2. Implement operation canceling

- When canceling an operation, set a canceled=True flag
- The operation should check this flag on every cycle, after returning from
  blocking operations (read/write socket/file).
- If the operation was canceled, abort the operation copy loop

We can use bug 1524184 for this.

3. Engine should continue to watch the ticket until all operations are inactive

This will normally take one iteration.

4. delete the ticket

DELETE must fail with 403 (or similar error) if a ticket has active operations.

5. deactivate the lv

Daniel, what do you think?

Comment 6 Daniel Erez 2018-02-07 09:41:21 UTC
Cancel is now invoked automatically when download is finished or after a timeout (when inactivity is detected) - will be covered by bug 1539363.
So, it seems that we don't really need the cancel button anymore. Or, maybe we can just keep it for failure scenarios (to release stuck transferred), but simply block the cancel operation while download is in progress (we can detect it easily as we monitor the download by using information from 'get_image_ticket').
As download is entirely handled by the client (browser/sdk/wget/etc), I guess we can just remove the button for now, what do you think?

Comment 7 Nir Soffer 2018-02-07 09:46:24 UTC
(In reply to Daniel Erez from comment #6)
Disabling the cancel button is an easy fix for now.

I think the button can be useful if you want to cancel another download started
from another machine, so for the long term, we should have a way to cancel an
active download. Without this you cannot move to host to maintenance.

I would disable the button for now, and file RFE to allow canceling active 
operations.

Comment 8 Daniel Erez 2018-02-07 10:06:14 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Daniel Erez from comment #6)
> Disabling the cancel button is an easy fix for now.

OK. @Idan - can you just hide the button for now? (after the fix for bug 1539363 is merged)

> 
> I think the button can be useful if you want to cancel another download
> started
> from another machine, so for the long term, we should have a way to cancel an
> active download. Without this you cannot move to host to maintenance.
> 
> I would disable the button for now, and file RFE to allow canceling active 
> operations.

Sounds good. Though you can always just restart the daemon to remove the ticket, which will invoke cancel by timeout eventually ;)

Comment 9 Idan Shaby 2018-02-07 12:12:45 UTC
I can, but it won't solve this issue..
As you said, inactivity because of network issues, daemon or proxy restart will cause the exact same errors. Oh and you can also cancel a download from the sdk, IINM.
Anyway, this will not solve anything.
Can you guys remind me why we don't want to solve this bug and bug 1524184 appropriately?

Comment 10 Daniel Erez 2018-02-07 12:28:27 UTC
(In reply to Idan Shaby from comment #9)
> I can, but it won't solve this issue..
> As you said, inactivity because of network issues, daemon or proxy restart
> will cause the exact same errors. Oh and you can also cancel a download from
> the sdk, IINM.

Any scenario of inactivity should cancel the download after a timeout (when bug 1539363 is solved). So we don't really need the cancel button any more.

> Anyway, this will not solve anything.
> Can you guys remind me why we don't want to solve this bug and bug 1524184
> appropriately?

Comment 11 Daniel Erez 2018-02-07 12:40:06 UTC
(In reply to Daniel Erez from comment #10)
> (In reply to Idan Shaby from comment #9)
> > I can, but it won't solve this issue..
> > As you said, inactivity because of network issues, daemon or proxy restart
> > will cause the exact same errors. Oh and you can also cancel a download from
> > the sdk, IINM.
> 
> Any scenario of inactivity should cancel the download after a timeout (when
> bug 1539363 is solved). So we don't really need the cancel button any more.
> 
> > Anyway, this will not solve anything.
> > Can you guys remind me why we don't want to solve this bug and bug 1524184
> > appropriately?

Isn't it actually the "RFE to allow canceling active operations" Nir suggested in comment #7?

Comment 12 Idan Shaby 2018-02-07 12:45:44 UTC
Yes, it is.
And as we discussed before, and Nir wrote down in comment 5, it's a part of a larger solution that will solve these kinds of bugs.
Nir asked to defer bug 1524184, so I am asking why we don't want to solve it now.

Comment 13 Daniel Erez 2018-02-07 12:59:30 UTC
(In reply to Idan Shaby from comment #12)
> Yes, it is.
> And as we discussed before, and Nir wrote down in comment 5, it's a part of
> a larger solution that will solve these kinds of bugs.
> Nir asked to defer bug 1524184, so I am asking why we don't want to solve it
> now.

To me it sounds like a material for 4.3. Not sure its priority or amount of work. 
As manual cancel shouldn't be a common use-case, if at all. Do you think it's easy enough and safe to make it for 4.2?

Comment 14 Idan Shaby 2018-02-07 13:07:32 UTC
I am not sure, you may be right.
I'll make myself more clear - it's not a problem to disable the button, but we should understand that it won't solve this bug. If we decide to defer this bug,  let's retarget it (4.3?).

Comment 15 Daniel Erez 2018-02-07 13:22:30 UTC
(In reply to Idan Shaby from comment #14)
> I am not sure, you may be right.
> I'll make myself more clear - it's not a problem to disable the button, but
> we should understand that it won't solve this bug. If we decide to defer
> this bug,  let's retarget it (4.3?).

If it's too much work to fix, let's just keep this bug for hiding the button (for 4.2), and leave the complete solution for bug 1524184

Comment 17 Avihai 2018-02-22 12:43:56 UTC
Created attachment 1399362 [details]
relevant engine , image_proxy ,vdsm,deamon logs

Comment 18 Idan Shaby 2018-02-22 14:32:37 UTC
Talked to Tal and we saw that the bug was accidentally moved to ON_QA, as the patch was not included in the build.
Thanks!

Comment 23 Avihai 2018-03-19 09:58:43 UTC
Verified on 4.2.2.4-0.1 due to the fact that the scenario cannot be executed as cancel download operation is now blocked.

Comment 24 Sandro Bonazzola 2018-03-29 11:10:21 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.