Created attachment 1472879 [details] engine and VDSM logs Description of problem: VM + 4 disks have 4 matching volumes before snapshot . Create snapshot and block connection SPM -> disks storage domain thus failing the snapshot creation. Indeed I see snapshot is deleted but when I check the number of volumes(lvs) per each disk I see that one disk has 2 lv's instead of one - Leftover from failed snapshot operation. VDSM - ERRORS Appears in ROLLBACK : 2018-08-02 22:07:55,384+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') running recovery Create volume metadata rollback: BlockVolume->createVolumeMetadataRollback(c a5552b7-305a-4667-9688-a42c180806ff,12) (task:812) 2018-08-02 22:07:55,384+0300 INFO (tasks/5) [storage.Volume] Metadata rollback for sdUUID=ca5552b7-305a-4667-9688-a42c180806ff offs=12 (blockVolume:489) 2018-08-02 22:07:55,389+0300 ERROR (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run return function(self.task, *argslist) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 491, in createVolumeMetadataRollback {"NONE": "#" * (sc.METADATA_SIZE - 10)}) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 873, in _putMetadata cls.manifestClass._putMetadata(metaId, meta) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 244, in _putMetadata f.write(data) File "/usr/lib/python2.7/site-packages/vdsm/storage/directio.py", line 161, in write raise OSError(err, msg) OSError: [Errno 5] Input/output error 2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Task._run: f838b186-7ee4-4643-a0b2-3addf43d5482 () {} failed - stopping task (task:894) 2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') stopping in state raborting (force False) (task:1256) 2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') ref 1 aborting True (task:1002) 2018-08-02 22:07:55,389+0300 WARN (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') task f838b186-7ee4-4643-a0b2-3addf43d5482: recovery failed: Task is aborted: u'[Errno 5] Input/output error' - code 100 (task:818) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 815, in _recover self._run(rec.run) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 896, in _run raise se.TaskAborted(message, code) TaskAborted: Task is aborted: u'[Errno 5] Input/output error' - code 100 2018-08-02 22:07:55,390+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') moving from state raborting -> state raborting (task:602) 2018-08-02 22:08:03,412+0300 ERROR (monitor/ca5552b) [storage.Monitor] Error checking domain ca5552b7-305a-4667-9688-a42c180806ff (monitor:424) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1171, in selftest elif lvm.getVG(self.sdUUID).partial != lvm.VG_OK: File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 81, in __getattr__ raise AttributeError("Failed reload: %s" % self.name) AttributeError: Failed reload: ca5552b7-305a-4667-9688-a42c180806ff ENGINE log: 2018-08-02 22:08:08,672+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure. 2018-08-02 22:08:08,692+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='289db27b-7067-4c88-8311-88e0092e692d', ignoreFailoverLimit='false', storageDomainId='ca5552b7-305a-4667-9688-a42c180806ff', imageGroupId='b4f0b2c4-e99f-47b4-a581-2fd19bd19feb', imageId='00000000-0000-0000-0000-000000000000', imageList='[9dfe49e0-ec55-4e7d-b16c-2c45dce3d6e9]', postZero='false', force='true'}), log id: 32b1981b 2018-08-02 22:08:09,812+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] FINISH, DestroyImageVDSCommand, log id: 32b1981b 2018-08-02 22:08:09,858+03 INFO [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Lock freed to object 'EngineLock:{exclusiveLocks='[42cb2951-a16b-4231-be68-7129c39e8dae=VM]', sharedLocks=''}' 2018-08-02 22:08:09,883+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'. 2018-08-02 22:08:09,884+03 WARN [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Command 'CreateSnapshotForVm' id: '8d8028d7-58c1-4043-878a-f8a476a701ab' end method execution failed, as the command isn't marked for endAction() retries silently ignoring 2018-08-02 22:08:30,893+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' recovered from problem. vds: 'host_mixed_1' 2018-08-02 22:08:30,893+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2018-08-02 22:08:30,894+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' recovered from problem. vds: 'host_mixed_1' 2018-08-02 22:08:30,894+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2018-08-02 22:08:30,894+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' recovered from problem. vds: 'host_mixed_1' 2018-08-02 22:08:30,894+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. Version-Release number of selected component (if applicable): 4.2.6_SNAPSHOT-84.gad3de30.0.scratch.master.el7ev How reproducible: Happened once so far Steps to Reproduce: 1) VM with 4 disks attached , each disk has 1 volume . Disks ID: 2035083e-d5af-4145-9b2b-71b9b77a2873 43f313e1-7679-44fd-8903-078f2cf740b5 43f313e1-7679-44fd-8903-078f2cf740b5 43f313e1-7679-44fd-8903-078f2cf740b5 2) check on SPM the matching volumes to the created disks . As each disk have 1 volumes we get a total of 4 volumes . 3) Fail Create snapshot by blocking connection from SPM to storage domain that the disks resides on (in this case SD 'iscsi_1') -> Snapshot creation Failed Aug 2, 2018, 10:08:09 PM Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'. 3) Unblock connection from SPM -> SD . 2018-08-02 22:08:14 4) Now check on SPM the matching volumes 2018-08-02 22:08:37 Actual results: We expect 4 volumes(lvs) but instead we get 5 volumes! Leftover on 1 disk image seen by running the following 3 commands from SPM host : [root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l => GIVES 2 volumes instead of once MEANING LEFTOVER IF FOUND [root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | 1) 43f313e1-7679-44fd-8903-078f2cf740b5 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_9,PU_00000000-0000-0000-0000-000000000000 2) 70a9c30e-1845-4278-8685-159b78f2d051 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_12,PU_43f313e1-7679-44fd-8903-078f2cf740b5 => THIS IS THE LEFTOVER FROM THE SNAPSHOT !!! YOU CAN SEE HIS PARENT IS THE ORIGINAL DISK VOLUME . 2018-08-03 06:58:00,873 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l 2018-08-03 06:58:01,762 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l 2018-08-03 06:58:01,763 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] OUT: 1 -- 2018-08-03 06:58:04,484 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l 2018-08-03 06:58:05,477 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l 2018-08-03 06:58:05,478 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] OUT: 2 => LEFTOVER FOUND !!! -- 2018-08-03 06:58:08,492 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l 2018-08-03 06:58:09,496 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l 2018-08-03 06:58:09,498 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] OUT: 1 -- 2018-08-03 06:58:12,392 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l 2018-08-03 06:58:13,201 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l 2018-08-03 06:58:13,202 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] OUT: 1 Expected results: We expect to get the same volume amount as before snapshot creation. Additional info:
I see a deletion of the leftover was attempted as expected here: 2018-08-02 22:08:08,046+0300 INFO (jsonrpc/5) [vdsm.api] START deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff', spUUID='289db27b-7067-4c88-8311-88e0092e692d', imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873', volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false', force='true', discard=True) from=::ffff:10.35.162.7,45088, flow_id=snapshots_create_cbd1c26b-d708-47f7, task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46) But failed, probably because the connection was still blocked 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in deleteVolume with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1025, in acquireResource return _manager.acquireResource(namespace, name, lockType, timeout=timeout) File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 475, in acquireResource raise se.ResourceAcqusitionFailed() ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: () 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run: f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894) 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running (force False) (task:1256) 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002) 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted: u'Could not acquire resource. Probably resource factory threw an exception.: ()' - code 100 (task:928) Unfortunately there is not match we can do if the storage is blocked when we attempt to cleanup Regardless, there is patch posted for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1598996 Which enables the tracking of the deletion task which is currently not tracked
(In reply to Benny Zlotnik from comment #1) > I see a deletion of the leftover was attempted as expected here: > > 2018-08-02 22:08:08,046+0300 INFO (jsonrpc/5) [vdsm.api] START > deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff', > spUUID='289db27b-7067-4c88-8311-88e0092e692d', > imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873', > volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false', > force='true', discard=True) from=::ffff:10.35.162.7,45088, > flow_id=snapshots_create_cbd1c26b-d708-47f7, > task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46) > > But failed, probably because the connection was still blocked > > 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task] > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in > _run > return fn(*args, **kargs) > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in > run > return self.cmd(*self.argslist, **self.argsdict) > File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line > 79, in wrapper > return method(self, *args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in > deleteVolume > with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > line 1025, in acquireResource > return _manager.acquireResource(namespace, name, lockType, > timeout=timeout) > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > line 475, in acquireResource > raise se.ResourceAcqusitionFailed() > ResourceAcqusitionFailed: Could not acquire resource. Probably resource > factory threw an exception.: () > 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run: > f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894) > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running > (force False) (task:1256) > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002) > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted: > u'Could not acquire resource. Probably resource factory threw an exception.: > ()' - code 100 (task:928) > > Unfortunately there is not match we can do if the storage is blocked when we > attempt to cleanup I would expect that we either: 1) Wait until storage domain is up before trying to delete the leftover volume or 2)Re-attempt removing leftover volumes every X seconds so when the storage is available we will remove it. If we leave it in the current solution we can have multiple volumes leftovers after a failed snapshot creation ( think of a VM/s with multiple disks) > Regardless, there is patch posted for this bug: > https://bugzilla.redhat.com/show_bug.cgi?id=1598996 > Which enables the tracking of the deletion task which is currently not > tracked Yes, but this bug only clear the deleteImage task after a create snapshot failure but will not take care of leftover volumes.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
(In reply to Avihai from comment #2) > (In reply to Benny Zlotnik from comment #1) > > I see a deletion of the leftover was attempted as expected here: > > > > 2018-08-02 22:08:08,046+0300 INFO (jsonrpc/5) [vdsm.api] START > > deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff', > > spUUID='289db27b-7067-4c88-8311-88e0092e692d', > > imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873', > > volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false', > > force='true', discard=True) from=::ffff:10.35.162.7,45088, > > flow_id=snapshots_create_cbd1c26b-d708-47f7, > > task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46) > > > > But failed, probably because the connection was still blocked > > > > 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task] > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875) > > Traceback (most recent call last): > > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in > > _run > > return fn(*args, **kargs) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in > > run > > return self.cmd(*self.argslist, **self.argsdict) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line > > 79, in wrapper > > return method(self, *args, **kwargs) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in > > deleteVolume > > with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): > > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > > line 1025, in acquireResource > > return _manager.acquireResource(namespace, name, lockType, > > timeout=timeout) > > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > > line 475, in acquireResource > > raise se.ResourceAcqusitionFailed() > > ResourceAcqusitionFailed: Could not acquire resource. Probably resource > > factory threw an exception.: () > > 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run: > > f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894) > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running > > (force False) (task:1256) > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002) > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted: > > u'Could not acquire resource. Probably resource factory threw an exception.: > > ()' - code 100 (task:928) > > > > Unfortunately there is not match we can do if the storage is blocked when we > > attempt to cleanup > > I would expect that we either: > 1) Wait until storage domain is up before trying to delete the leftover > volume > > or > > 2)Re-attempt removing leftover volumes every X seconds so when the storage > is available we will remove it. > > If we leave it in the current solution we can have multiple volumes > leftovers after a failed snapshot creation ( think of a VM/s with multiple > disks) > > > > Regardless, there is patch posted for this bug: > > https://bugzilla.redhat.com/show_bug.cgi?id=1598996 > > Which enables the tracking of the deletion task which is currently not > > tracked > Yes, but this bug only clear the deleteImage task after a create snapshot > failure but will not take care of leftover volumes. We do not have the infra to do something like this, I'm closing for now. Please open an RFE if it's improtant.
(In reply to Benny Zlotnik from comment #4) > (In reply to Avihai from comment #2) > > (In reply to Benny Zlotnik from comment #1) > > > I see a deletion of the leftover was attempted as expected here: > > > > > > 2018-08-02 22:08:08,046+0300 INFO (jsonrpc/5) [vdsm.api] START > > > deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff', > > > spUUID='289db27b-7067-4c88-8311-88e0092e692d', > > > imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873', > > > volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false', > > > force='true', discard=True) from=::ffff:10.35.162.7,45088, > > > flow_id=snapshots_create_cbd1c26b-d708-47f7, > > > task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46) > > > > > > But failed, probably because the connection was still blocked > > > > > > 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task] > > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875) > > > Traceback (most recent call last): > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in > > > _run > > > return fn(*args, **kargs) > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in > > > run > > > return self.cmd(*self.argslist, **self.argsdict) > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line > > > 79, in wrapper > > > return method(self, *args, **kwargs) > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in > > > deleteVolume > > > with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE): > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > > > line 1025, in acquireResource > > > return _manager.acquireResource(namespace, name, lockType, > > > timeout=timeout) > > > File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", > > > line 475, in acquireResource > > > raise se.ResourceAcqusitionFailed() > > > ResourceAcqusitionFailed: Could not acquire resource. Probably resource > > > factory threw an exception.: () > > > 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run: > > > f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894) > > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running > > > (force False) (task:1256) > > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002) > > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] > > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted: > > > u'Could not acquire resource. Probably resource factory threw an exception.: > > > ()' - code 100 (task:928) > > > > > > Unfortunately there is not match we can do if the storage is blocked when we > > > attempt to cleanup > > > > I would expect that we either: > > 1) Wait until storage domain is up before trying to delete the leftover > > volume > > > > or > > > > 2)Re-attempt removing leftover volumes every X seconds so when the storage > > is available we will remove it. > > > > If we leave it in the current solution we can have multiple volumes > > leftovers after a failed snapshot creation ( think of a VM/s with multiple > > disks) > > > > > > > Regardless, there is patch posted for this bug: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1598996 > > > Which enables the tracking of the deletion task which is currently not > > > tracked > > Yes, but this bug only clear the deleteImage task after a create snapshot > > failure but will not take care of leftover volumes. > > We do not have the infra to do something like this, I'm closing for now. > Please open an RFE if it's improtant. Opened RFE Bug 1695585 on this issue as I think this is important as we can have an ENV with many snapshot's and many leftovers LV's which are not cleaned up as the SD is not available at the moment but we do not handle it after first failure.
*** Bug 1650985 has been marked as a duplicate of this bug. ***