Created attachment 1388905 [details] engine , image_proxy ,vdsm,deamon logs Description of problem: After Canceling download VDSM tried to deactivate the LV that the download disk image is on (why?) & exception is raised the VDSM , see below. VDSM log: 2018-01-31 12:33:35,246+0200 INFO (jsonrpc/1) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\' Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb /42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54cedd d, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:50) 2018-01-31 12:33:35,248+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in teardownImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3225, in teardownImage dom.deactivateImage(imgUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1296, in deactivateImage lvm.deactivateLVs(self.sdUUID, volUUIDs) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1303, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 834, in _setLVAvailability raise error(str(e)) CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\' Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) Version-Release number of selected component (if applicable): Engine: 4.2.1.3-0.1.el7 VDSM: 4.20.17-1 How reproducible: 100% Steps to Reproduce: 1.Via webadmin start disk download & wait until it starts transferring data. 2.Cancel disk download & watch VDSM log of the host that handles the download Actual results: VDSM tried to deactivate the LV of the downloaded disk & exception is raised as the disk is in use. Expected results: VDSM should not try to deactivate the LV of the downloaded disk & exception should not be raised. Additional info:
Nir/Tal, I believe that bug 1524184 that Nir asked to defer to 4.3 a few days ago blocks this one, so either we defer this one as well or retarget bug 1524184 to 4.2.2.
(In reply to Idan Shaby from comment #1) > I believe that bug 1524184 that Nir asked to defer to 4.3 a few days ago > blocks this one. Please show logs proving that the image tear down failed because an operation was active after the ticket was removed. If this is the case we may want to handle this now, leaving active lvs is bad.
There you go: 2018-01-31 12:33:24,207+0200 INFO (jsonrpc/6) [vdsm.api] START remove_image_ticket(uuid='ffd42646-df8c-4537-98e7-5203631e32a7') from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=e2ba7f83-9175-49fa-8f7a-77338a6c6e32 (api:46) 2018-01-31 12:33:24,208+0200 DEBUG (jsonrpc/6) [storage.imagetickets] Sending request method='DELETE', ticket='ffd42646-df8c-4537-98e7-5203631e32a7', body=None (imagetickets:78) 2018-01-31 12:33:24,219+0200 INFO (jsonrpc/6) [vdsm.api] FINISH remove_image_ticket return=None from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=e2ba7f83-9175-49fa-8f7a-77338a6c6e32 (api:52) 2018-01-31 12:33:24,221+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') finished: None (task:1201) 2018-01-31 12:33:24,221+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') moving from state preparing -> state finished (task:602) 2018-01-31 12:33:24,222+0200 DEBUG (jsonrpc/6) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910) 2018-01-31 12:33:24,223+0200 DEBUG (jsonrpc/6) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947) 2018-01-31 12:33:24,224+0200 DEBUG (jsonrpc/6) [storage.TaskManager.Task] (Task='e2ba7f83-9175-49fa-8f7a-77338a6c6e32') ref 0 aborting False (task:1002) 2018-01-31 12:33:24,224+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.remove_image_ticket succeeded in 0.02 seconds (__init__:573) 2018-01-31 12:33:24,352+0200 DEBUG (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') moving from state init -> state preparing (task:602) 2018-01-31 12:33:24,353+0200 INFO (jsonrpc/1) [vdsm.api] START teardownImage(sdUUID='5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb', spUUID='5c31ee48-6da5-4dea-9f0d-2c091081445d', imgUUID='b63a51ea-9142-44dd-8821-11bc508d0285', volUUID=None) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:46) ... 2018-01-31 12:33:35,246+0200 INFO (jsonrpc/1) [vdsm.api] FINISH teardownImage error=Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\' Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) from=::ffff:10.35.162.7,56324, flow_id=87a9ac0f-de7f-4382-81c1-694bd54ceddd, task_id=62c47765-b392-411e-aebb-63721c2f0299 (api:50) 2018-01-31 12:33:35,248+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='62c47765-b392-411e-aebb-63721c2f0299') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in teardownImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3225, in teardownImage dom.deactivateImage(imgUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1296, in deactivateImage lvm.deactivateLVs(self.sdUUID, volUUIDs) File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1303, in deactivateLVs _setLVAvailability(vgName, toDeactivate, "n") File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 834, in _setLVAvailability raise error(str(e)) CannotDeactivateLogicalVolume: Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\' Logical volume 5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/42607b3f-e253-41a7-9143-7e7b7324bc59 in use.\']\\n5c32f78a-0fd3-46cf-81a7-58fa58c4fbfb/[\'42607b3f-e253-41a7-9143-7e7b7324bc59\']",)',) Tal, please retarget Bug 1524184.
Nir was strongly objecting having fixing bug 1524184 for 4.2.2, Nir does this change given this bug?
(In reply to Idan Shaby from comment #3) > There you go: Where are daemon logs? We need to understand why the lv is used. Do we have an active operation when we cancel the download? According to comment 0 - we cancel the download in the ui, while the browser is downloading the data. I assume that there is an active operation sending data to the browser in the daemon when we remove the ticket and deactivate the lv, but I did not check the logs. Please confirm that this is the case. If this is the case, we need to: - invalidate the ticket, so new requests are rejected - cancel the active operations - wait until the operations is inactive this should be very quick normally, but if storage is non-responsive, can take minutes - delete the ticket - deactivate the lv bug 1524184 suggest to abort active download when deleting a ticket. This will not solve this issue, since aborting operations is async. You flag the operation as canceled, and the operation thread will abort the read-write loop on the next iteration. Since we must wait for active operations, we cannot delete a ticket. I think we should do this: 1. Cancel the ticket by patching it PATCH /tickets/ticket-id { "timeout": -1 } This request should also mark all active operations as canceled. 2. Implement operation canceling - When canceling an operation, set a canceled=True flag - The operation should check this flag on every cycle, after returning from blocking operations (read/write socket/file). - If the operation was canceled, abort the operation copy loop We can use bug 1524184 for this. 3. Engine should continue to watch the ticket until all operations are inactive This will normally take one iteration. 4. delete the ticket DELETE must fail with 403 (or similar error) if a ticket has active operations. 5. deactivate the lv Daniel, what do you think?
Cancel is now invoked automatically when download is finished or after a timeout (when inactivity is detected) - will be covered by bug 1539363. So, it seems that we don't really need the cancel button anymore. Or, maybe we can just keep it for failure scenarios (to release stuck transferred), but simply block the cancel operation while download is in progress (we can detect it easily as we monitor the download by using information from 'get_image_ticket'). As download is entirely handled by the client (browser/sdk/wget/etc), I guess we can just remove the button for now, what do you think?
(In reply to Daniel Erez from comment #6) Disabling the cancel button is an easy fix for now. I think the button can be useful if you want to cancel another download started from another machine, so for the long term, we should have a way to cancel an active download. Without this you cannot move to host to maintenance. I would disable the button for now, and file RFE to allow canceling active operations.
(In reply to Nir Soffer from comment #7) > (In reply to Daniel Erez from comment #6) > Disabling the cancel button is an easy fix for now. OK. @Idan - can you just hide the button for now? (after the fix for bug 1539363 is merged) > > I think the button can be useful if you want to cancel another download > started > from another machine, so for the long term, we should have a way to cancel an > active download. Without this you cannot move to host to maintenance. > > I would disable the button for now, and file RFE to allow canceling active > operations. Sounds good. Though you can always just restart the daemon to remove the ticket, which will invoke cancel by timeout eventually ;)
I can, but it won't solve this issue.. As you said, inactivity because of network issues, daemon or proxy restart will cause the exact same errors. Oh and you can also cancel a download from the sdk, IINM. Anyway, this will not solve anything. Can you guys remind me why we don't want to solve this bug and bug 1524184 appropriately?
(In reply to Idan Shaby from comment #9) > I can, but it won't solve this issue.. > As you said, inactivity because of network issues, daemon or proxy restart > will cause the exact same errors. Oh and you can also cancel a download from > the sdk, IINM. Any scenario of inactivity should cancel the download after a timeout (when bug 1539363 is solved). So we don't really need the cancel button any more. > Anyway, this will not solve anything. > Can you guys remind me why we don't want to solve this bug and bug 1524184 > appropriately?
(In reply to Daniel Erez from comment #10) > (In reply to Idan Shaby from comment #9) > > I can, but it won't solve this issue.. > > As you said, inactivity because of network issues, daemon or proxy restart > > will cause the exact same errors. Oh and you can also cancel a download from > > the sdk, IINM. > > Any scenario of inactivity should cancel the download after a timeout (when > bug 1539363 is solved). So we don't really need the cancel button any more. > > > Anyway, this will not solve anything. > > Can you guys remind me why we don't want to solve this bug and bug 1524184 > > appropriately? Isn't it actually the "RFE to allow canceling active operations" Nir suggested in comment #7?
Yes, it is. And as we discussed before, and Nir wrote down in comment 5, it's a part of a larger solution that will solve these kinds of bugs. Nir asked to defer bug 1524184, so I am asking why we don't want to solve it now.
(In reply to Idan Shaby from comment #12) > Yes, it is. > And as we discussed before, and Nir wrote down in comment 5, it's a part of > a larger solution that will solve these kinds of bugs. > Nir asked to defer bug 1524184, so I am asking why we don't want to solve it > now. To me it sounds like a material for 4.3. Not sure its priority or amount of work. As manual cancel shouldn't be a common use-case, if at all. Do you think it's easy enough and safe to make it for 4.2?
I am not sure, you may be right. I'll make myself more clear - it's not a problem to disable the button, but we should understand that it won't solve this bug. If we decide to defer this bug, let's retarget it (4.3?).
(In reply to Idan Shaby from comment #14) > I am not sure, you may be right. > I'll make myself more clear - it's not a problem to disable the button, but > we should understand that it won't solve this bug. If we decide to defer > this bug, let's retarget it (4.3?). If it's too much work to fix, let's just keep this bug for hiding the button (for 4.2), and leave the complete solution for bug 1524184
Created attachment 1399362 [details] relevant engine , image_proxy ,vdsm,deamon logs
Talked to Tal and we saw that the bug was accidentally moved to ON_QA, as the patch was not included in the build. Thanks!
Verified on 4.2.2.4-0.1 due to the fact that the scenario cannot be executed as cancel download operation is now blocked.
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.