Description of problem: Customer tried to remove a block disk that had wipe-after-delete set and the engine reports it as completing successfully (successfully removed) when it actually fails with a timeout. Version-Release number of selected component (if applicable): RHV 4.3.7 How reproducible: UKNOWN Steps to Reproduce: 1. Deactivated disk 2. Issued remove disk 3. Engine reports successful Actual results: The vdsm task timeout and the disk is deleted from the engine DB but still exists. Expected results: The engine should report correctly that the remove disk failed. Additional info: ENGINE LOG: 2020-05-12 12:50:43,899+05 INFO [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] Running command: RemoveDiskCommand internal: false. Entities affected : ID: 8d5779f9-a5a6-4d1d-a705-000cceef73d0 Type: DiskAction group DELETE_DISK with role type USER 2020-05-12 12:50:43,933+05 INFO [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] Running command: RemoveImageCommand internal: true. Entities affected : ID: 81dbbab0-46a1-4999-acfb-0f593fa05b58 Type: Storage 2020-05-12 12:50:43,994+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='352b829e-4d72-11ea-aa35-00163e452e33', ignoreFailoverLimit='false', storageDomainId='81dbbab0-46a1-4999-acfb-0f593fa05b58', imageGroupId='8d5779f9-a5a6-4d1d-a705-000cceef73d0', postZeros='true', discard='false', forceDelete='false'}), log id: 11a2d7b3 2020-05-12 12:50:44,257+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] FINISH, DeleteImageGroupVDSCommand, return: , log id: 11a2d7b3 2020-05-12 12:50:44,262+05 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] CommandAsyncTask::Adding CommandMultiAsyncTasks object for command '9000a374-3fae-4110-936a-13bbc6297a74' 2020-05-12 12:50:44,262+05 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] CommandMultiAsyncTasks::attachTask: Attaching task 'c2041ad5-b98e-490c-bf8f-52671536189c' to command '9000a374-3fae-4110-936a-13bbc6297a74'. 2020-05-12 12:50:44,271+05 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] Adding task 'c2041ad5-b98e-490c-bf8f-52671536189c' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet.. 2020-05-12 12:50:44,371+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] EVENT_ID: USER_FINISHED_REMOVE_DISK_ATTACHED_TO_VMS(2,042), Disk VM-Storage_Disk2 associated to the VMs Sip-Proxy was successfully removed from domain mvpl_data (User admin@internal-authz). 2020-05-12 12:50:44,371+05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] BaseAsyncTask::startPollingTask: Starting to poll task 'c2041ad5-b98e-490c-bf8f-52671536189c'. 2020-05-12 12:50:44,372+05 INFO [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-88609) [6b6cebe6-6368-4278-aa3f-40163da8c7e4] Lock freed to object 'EngineLock:{exclusiveLocks='[8d5779f9-a5a6-4d1d-a705-000cceef73d0=DISK]', sharedLocks='[e89c01f5-411a-4a73-b1d6-0392bb7ceec8=VM]'}' 2020-05-14 14:50:53,675+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: TASK_STOPPING_ASYNC_TASK(9,500), Stopping async task RemoveDisk that started at Tue May 12 12:50:43 IST 2020 2020-05-14 14:50:53,675+05 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Cleaning zombie tasks: Stopping async task 'RemoveDisk' that started at 'Tue May 12 12:50:43 IST 2020' since it reached a timeout of 3000 minutes 2020-05-14 14:50:53,675+05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] SPMAsyncTask::StopTask: Attempting to stop task 'c2041ad5-b98e-490c-bf8f-52671536189c' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'). 2020-05-14 14:50:53,682+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMStopTaskVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] START, SPMStopTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='352b829e-4d72-11ea-aa35-00163e452e33', ignoreFailoverLimit='false', taskId='c2041ad5-b98e-490c-bf8f-52671536189c'}), log id: 98b806b 2020-05-14 14:50:53,684+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMStopTaskVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] START, HSMStopTaskVDSCommand(HostName = mpokket-host1, HSMTaskGuidBaseVDSCommandParameters:{hostId='d43be6b5-9d04-4967-bacd-606419f37198', taskId='c2041ad5-b98e-490c-bf8f-52671536189c'}), log id: 23f3f2c6 2020-05-14 14:50:53,687+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMStopTaskVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] FINISH, HSMStopTaskVDSCommand, return: , log id: 23f3f2c6 2020-05-14 14:50:53,687+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMStopTaskVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] FINISH, SPMStopTaskVDSCommand, return: , log id: 98b806b 2020-05-14 14:51:03,705+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM mpokket-host1 command HSMGetAllTasksStatusesVDS failed: c2041ad5-b98e-490c-bf8f-52671536189c 2020-05-14 14:51:03,712+05 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] BaseAsyncTask::logEndTaskFailure: Task 'c2041ad5-b98e-490c-bf8f-52671536189c' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'success' -- Message: '', -- Exception: '[null]' 2020-05-14 14:51:03,724+05 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] CommandAsyncTask::endActionIfNecessary: All tasks of command '9000a374-3fae-4110-936a-13bbc6297a74' has ended -> executing 'endAction' 2020-05-14 14:51:03,724+05 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] CommandAsyncTask::endAction: Ending action for '1' tasks (command ID: '9000a374-3fae-4110-936a-13bbc6297a74'): calling endAction '. 2020-05-14 14:51:03,725+05 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-136752) [] CommandAsyncTask::endCommandAction [within thread] context: Attempting to endAction 'RemoveDisk', 2020-05-14 14:51:03,727+05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] BaseAsyncTask::onTaskEndSuccess: Task 'c2041ad5-b98e-490c-bf8f-52671536189c' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended successfully. 2020-05-14 14:51:03,727+05 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [] Task with DB Task ID '72bcc259-78ee-4b46-96e0-01dd6a0bca4a' and VDSM Task ID 'c2041ad5-b98e-490c-bf8f-52671536189c' is in state AttemptingEndAction. End action for command 9000a374-3fae-4110-936a-13bbc6297a74 will proceed when all the entity's tasks are completed. >>> LVS: 07df6929-7c6b-427d-90fb-2cb6315b0d62 81dbbab0-46a1-4999-acfb-0f593fa05b58 204 -wi-a----- <5.63t -1 -1 253 28 RDWBzk-HOgi-91Qh-Asdd-qrLN-6bSb-pZf01P IU__remove_me_ZERO_8d5779f9-a5a6-4d1d-a705-000cceef73d0,MD_11,PU_00000000-0000-0000-0000-000000000000 /rhev/data-center/mnt/blockSD/81dbbab0-46a1-4999-acfb-0f593fa05b58/images/8d5779f9-a5a6-4d1d-a705-000cceef73d0: total 4 lrwxrwxrwx. 1 vdsm kvm 78 Apr 14 14:38 07df6929-7c6b-427d-90fb-2cb6315b0d62 -> /dev/81dbbab0-46a1-4999-acfb-0f593fa05b58/07df6929-7c6b-427d-90fb-2cb6315b0d62 >>> VDSM LOGS 2020-05-12 12:48:04,905+0530 INFO (jsonrpc/2) [api.virt] START hotunplugDisk(params={u'xml': u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?><hotunplug><devices><disk><alias name="ua-8d5779f9-a5a6-4d1d-a705-000cceef73d0"/></disk></devices></hotunplug>', u'vmId': u'e89c01f5-411a-4a73-b1d6-0392bb7ceec8'}) from=::ffff:192.168.0.55,39268, flow_id=049dfdfb-bf07-48d1-8610-286d4dc06253, vmId=e89c01f5-411a-4a73-b1d6-0392bb7ceec8 (api:48) 2020-05-12 12:48:04,907+0530 INFO (jsonrpc/2) [virt.vm] (vmId='e89c01f5-411a-4a73-b1d6-0392bb7ceec8') Hotunplug disk xml: <?xml version='1.0' encoding='utf-8'?> <disk device="disk" snapshot="no" type="block"> <address bus="0" controller="0" target="0" type="drive" unit="2" /> <source dev="/rhev/data-center/mnt/blockSD/81dbbab0-46a1-4999-acfb-0f593fa05b58/images/8d5779f9-a5a6-4d1d-a705-000cceef73d0/07df6929-7c6b-427d-90fb-2cb6315b0d62"> <seclabel model="dac" relabel="no" type="none" /> </source> <target bus="scsi" dev="sdc" /> <serial>8d5779f9-a5a6-4d1d-a705-000cceef73d0</serial> <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2" /> <alias name="ua-8d5779f9-a5a6-4d1d-a705-000cceef73d0" /> </disk> (vm:3862) 2020-05-12 12:48:04,949+0530 INFO (libvirt/events) [virt.vm] (vmId='e89c01f5-411a-4a73-b1d6-0392bb7ceec8') Device removal reported: ua-8d5779f9-a5a6-4d1d-a705-000cceef73d0 (vm:6046) 2020-05-12 12:48:04,969+0530 INFO (jsonrpc/2) [vdsm.api] START teardownImage(sdUUID='81dbbab0-46a1-4999-acfb-0f593fa05b58', spUUID='352b829e-4d72-11ea-aa35-00163e452e33', imgUUID='8d5779f9-a5a6-4d1d-a705-000cceef73d0', volUUID=None) from=::ffff:192.168.0.55,39268, flow_id=049dfdfb-bf07-48d1-8610-286d4dc06253, task_id=bf659180-7bb7-4139-8032-ea2bc4eb7540 (api:48) 2020-05-12 12:48:04,970+0530 INFO (jsonrpc/2) [storage.StorageDomain] Removing image run directory '/var/run/vdsm/storage/81dbbab0-46a1-4999-acfb-0f593fa05b58/8d5779f9-a5a6-4d1d-a705-000cceef73d0' (blockSD:1448) 2020-05-12 12:48:04,970+0530 INFO (jsonrpc/2) [storage.fileUtils] Removing directory: /var/run/vdsm/storage/81dbbab0-46a1-4999-acfb-0f593fa05b58/8d5779f9-a5a6-4d1d-a705-000cceef73d0 (fileUtils:180) 2020-05-12 12:48:30,521+0530 INFO (jsonrpc/5) [vdsm.api] START deleteImage(sdUUID=u'81dbbab0-46a1-4999-acfb-0f593fa05b58', spUUID=u'352b829e-4d72-11ea-aa35-00163e452e33', imgUUID=u'8d5779f9-a5a6-4d1d-a705-000cceef73d0', postZero=u'true', force=u'false', discard=False) from=::ffff:192.168.0.55,39286, flow_id=6b6cebe6-6368-4278-aa3f-40163da8c7e4, task_id=c2041ad5-b98e-490c-bf8f-52671536189c (api:48) 2020-05-12 12:48:30,766+0530 INFO (jsonrpc/5) [vdsm.api] FINISH deleteImage return=None from=::ffff:192.168.0.55,39286, flow_id=6b6cebe6-6368-4278-aa3f-40163da8c7e4, task_id=c2041ad5-b98e-490c-bf8f-52671536189c (api:54) 2020-05-12 12:48:30,780+0530 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Image.delete succeeded in 0.26 seconds (__init__:312) 2020-05-12 12:48:30,780+0530 INFO (tasks/6) [storage.ThreadPool.WorkerThread] START task c2041ad5-b98e-490c-bf8f-52671536189c (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f301818fb48>>, args=None) (threadPool:208) 2020-05-12 12:48:31,316+0530 INFO (jsonrpc/1) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'c2041ad5-b98e-490c-bf8f-52671536189c': {'code': 0, 'message': 'running job 1 of 1', 'taskState': 'running', 'taskResult': '', 'taskID': 'c2041ad5-b98e-490c-bf8f-52671536189c'}}} from=::ffff:192.168.0.55,39268, task_id=62490b0a-fb24-4753-9623-2d6f850e2e40 (api:54) 2020-05-14 14:01:05,164+0530 INFO (jsonrpc/6) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'c2041ad5-b98e-490c-bf8f-52671536189c': {'code': 0, 'message': 'running job 1 of 1', 'taskState': 'running', 'taskResult': '', 'taskID': 'c2041ad5-b98e-490c-bf8f-52671536189c'}}} from=::ffff:192.168.0.55,39268, task_id=33cd3664-e9b8-4fca-8065-b6ad7f22b18d (api:54) 2020-05-14 14:48:37,075+0530 INFO (jsonrpc/2) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'c2041ad5-b98e-490c-bf8f-52671536189c': {'code': 0, 'message': 'running job 1 of 1', 'taskState': 'running', 'taskResult': '', 'taskID': 'c2041ad5-b98e-490c-bf8f-52671536189c'}}} from=::ffff:192.168.0.55,39268, task_id=bca1aed4-4cb5-4fdf-801d-33bb6042c637 (api:54) 2020-05-14 14:48:37,075+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllTasksStatuses succeeded in 0.00 seconds (__init__:312) 2020-05-14 14:48:37,093+0530 INFO (jsonrpc/4) [vdsm.api] START stopTask(taskID=u'c2041ad5-b98e-490c-bf8f-52671536189c', spUUID=None, options=None) from=::ffff:192.168.0.55,39268, task_id=dae564d4-354c-4a22-bb05-bda97f870187 (api:48) 2020-05-14 14:48:37,093+0530 INFO (jsonrpc/4) [storage.operation] Aborting <Command ['/sbin/blkdiscard', '--zeroout', '--step', '2097152', '--length', '6190121615360', u'/dev/81dbbab0-46a1-4999-acfb-0f593fa05b58/07df6929-7c6b-427d-90fb-2cb6315b0d62'] aborting, cwd=None at 0x7f2ff8081d90> (operation:127) 2020-05-14 14:48:37,093+0530 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Task.stop succeeded in 0.00 seconds (__init__:312) 2020-05-14 14:48:37,190+0530 ERROR (tasks/6) [storage.TaskManager.Task] (Task='c2041ad5-b98e-490c-bf8f-52671536189c') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1375, in zeroImage zeroImgVolumes(sdUUID, imgUUID, toZero, discard) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 302, in zeroImgVolumes raise se.VolumesZeroingError(errors) VolumesZeroingError: Cannot zero out volume: (['Action was stopped: ()'],) 2020-05-14 14:48:37,228+0530 INFO (tasks/6) [storage.ThreadPool.WorkerThread] FINISH task c2041ad5-b98e-490c-bf8f-52671536189c (threadPool:210) 2020-05-14 14:48:47,105+0530 INFO (jsonrpc/7) [vdsm.api] FINISH getAllTasksStatuses return={'allTasksStatus': {'c2041ad5-b98e-490c-bf8f-52671536189c': {'code': 411, 'message': u'c2041ad5-b98e-490c-bf8f-52671536189c', 'taskState': 'finished', 'taskResult': 'cleanSuccess', 'taskID': 'c2041ad5-b98e-490c-bf8f-52671536189c'}}} from=::ffff:192.168.0.55,39268, task_id=bd026500-baac-4df9-8956-66d544669a22 (api:54) 2020-05-14 14:48:47,105+0530 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllTasksStatuses succeeded in 0.00 seconds (__init__:312) 2020-05-14 14:48:47,139+0530 INFO (jsonrpc/6) [vdsm.api] START clearTask(taskID=u'c2041ad5-b98e-490c-bf8f-52671536189c', spUUID=None, options=None) from=::ffff:192.168.0.55,39268, task_id=36dc0bac-9f08-4a26-bc67-507bfd7d82de (api:48) 2020-05-14 14:48:47,141+0530 INFO (jsonrpc/6) [vdsm.api] FINISH clearTask return=None from=::ffff:192.168.0.55,39268, task_id=36dc0bac-9f08-4a26-bc67-507bfd7d82de (api:54)
Created attachment 1689002 [details] Engine log from timeframe
Created attachment 1689004 [details] VDSM logs from start of remove disk
Created attachment 1689005 [details] VDSM logs from timeout failure
This is known issue. Engine must track deleted disks until the disk is actually deleted from storage. We have other bugs about the same issue. Deferring to 4.5 since this require redesign of the way disks are tracked in engine and 4.4 development phases has ended.
This bug/RFE didn't get enough attention so far and is now flagged as pending close. Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.
This should not be closed as it can still happen in RHV 4.4. Nir stated that this bug is a known issue and that there may be other bugs tracking this so if this is to be closed then the other bugs should be noted so we can still track the fix for this problem. Frank
This didn't get into 4.5 either, it's time to close this unless there is any new information or reproduction scenario
no new information, didn't make it into 4.5, closing
The implication of ignoring this one is that we remain with junk on the storage side Which is maybe not that bad, but we are able to identify the failure on VDSM (according to comment 1) so let's give a try to monitor that property on the engine side instead of dropping the disk from the database
We noticed in engine.log that RemoveDisk failed but probably after the relevant parts were already removed from the database
Verified. The disk was successfully removed and doesn't exist. Versions: engine-4.5.3.1-2.el8ev vdsm-4.50.3.2-1.el8ev.x86_64
The RemoveDisk finishes before the disk was removed from the DB Description: The job for RemoveDisk finishes before the disk was actually removed from the DB. This causes failures in our automation tests in all tiers in the teardown when removing a disk and afterward removing a VM - which fails because the disk is still being, and a VM can't be removed when disks are locked. How reproducible: 100% Versions: ovirt-engine-4.5.3.1-2.el8ev vdsm-4.50.3.4-1.el8ev
The priority of *this bug* should not be urgent and the issue described in comment 19 is unrelated to the reported scenario - it's a consequence of the fix that affects our automation tests and deserves its own bug (where we can discuss its severity)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8502