1611929 – Leftover volume after a failed snapshot creation (initiated by SPM block network connection to storage)

Bug 1611929 - Leftover volume after a failed snapshot creation (initiated by SPM block network connection to storage)

Summary: Leftover volume after a failed snapshot creation (initiated by SPM block netw...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.3.4
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1650985 (view as bug list)
Depends On:
Blocks:	1565685
TreeView+	depends on / blocked

Reported:	2018-08-03 05:15 UTC by Avihai
Modified:	2019-05-28 12:09 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-04-03 11:00:45 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+

Attachments	(Terms of Use)
engine and VDSM logs (1.40 MB, application/x-gzip) 2018-08-03 05:15 UTC, Avihai	no flags	Details
View All

Description Avihai 2018-08-03 05:15:35 UTC

Created attachment 1472879 [details]
engine and VDSM logs

Description of problem:
VM + 4 disks have 4 matching volumes before snapshot .
Create snapshot and block connection SPM -> disks storage domain thus failing the snapshot creation.

Indeed I see snapshot is deleted but when I check the number of volumes(lvs) per each disk I see that one disk has 2 lv's instead of one - Leftover from failed snapshot operation.

VDSM - ERRORS Appears in ROLLBACK :
2018-08-02 22:07:55,384+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') running recovery Create volume metadata rollback: BlockVolume->createVolumeMetadataRollback(c
a5552b7-305a-4667-9688-a42c180806ff,12) (task:812)
2018-08-02 22:07:55,384+0300 INFO  (tasks/5) [storage.Volume] Metadata rollback for sdUUID=ca5552b7-305a-4667-9688-a42c180806ff offs=12 (blockVolume:489)
2018-08-02 22:07:55,389+0300 ERROR (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run
    return function(self.task, *argslist)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 491, in createVolumeMetadataRollback
    {"NONE": "#" * (sc.METADATA_SIZE - 10)})
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 873, in _putMetadata
    cls.manifestClass._putMetadata(metaId, meta)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 244, in _putMetadata
    f.write(data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/directio.py", line 161, in write
    raise OSError(err, msg)
OSError: [Errno 5] Input/output error
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Task._run: f838b186-7ee4-4643-a0b2-3addf43d5482 () {} failed - stopping task (task:894)
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') stopping in state raborting (force False) (task:1256)
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') ref 1 aborting True (task:1002)
2018-08-02 22:07:55,389+0300 WARN  (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') task f838b186-7ee4-4643-a0b2-3addf43d5482: recovery failed: Task is aborted: u'[Errno 5] Input/output error' - code 100 (task:818)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 815, in _recover
    self._run(rec.run)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 896, in _run
    raise se.TaskAborted(message, code)
TaskAborted: Task is aborted: u'[Errno 5] Input/output error' - code 100
2018-08-02 22:07:55,390+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') moving from state raborting -> state raborting (task:602)

2018-08-02 22:08:03,412+0300 ERROR (monitor/ca5552b) [storage.Monitor] Error checking domain ca5552b7-305a-4667-9688-a42c180806ff (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1171, in selftest
    elif lvm.getVG(self.sdUUID).partial != lvm.VG_OK:
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 81, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: ca5552b7-305a-4667-9688-a42c180806ff

ENGINE log:

2018-08-02 22:08:08,672+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure.
2018-08-02 22:08:08,692+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='289db27b-7067-4c88-8311-88e0092e692d', ignoreFailoverLimit='false', storageDomainId='ca5552b7-305a-4667-9688-a42c180806ff', imageGroupId='b4f0b2c4-e99f-47b4-a581-2fd19bd19feb', imageId='00000000-0000-0000-0000-000000000000', imageList='[9dfe49e0-ec55-4e7d-b16c-2c45dce3d6e9]', postZero='false', force='true'}), log id: 32b1981b
2018-08-02 22:08:09,812+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] FINISH, DestroyImageVDSCommand, log id: 32b1981b
2018-08-02 22:08:09,858+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Lock freed to object 'EngineLock:{exclusiveLocks='[42cb2951-a16b-4231-be68-7129c39e8dae=VM]', sharedLocks=''}'
2018-08-02 22:08:09,883+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'.
2018-08-02 22:08:09,884+03 WARN  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Command 'CreateSnapshotForVm' id: '8d8028d7-58c1-4043-878a-f8a476a701ab' end method execution failed, as the command isn't marked for endAction() retries silently ignoring
2018-08-02 22:08:30,893+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,893+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.




Version-Release number of selected component (if applicable):
4.2.6_SNAPSHOT-84.gad3de30.0.scratch.master.el7ev

How reproducible:
Happened once so far

Steps to Reproduce:
1) VM with 4 disks attached , each disk has 1 volume .

Disks ID:
2035083e-d5af-4145-9b2b-71b9b77a2873
43f313e1-7679-44fd-8903-078f2cf740b5
43f313e1-7679-44fd-8903-078f2cf740b5
43f313e1-7679-44fd-8903-078f2cf740b5

2) check on SPM the matching volumes to the created disks . As each disk have 1 volumes we get a total of 4 volumes .

3)  Fail Create snapshot by blocking connection from SPM to storage domain that the disks resides on (in this case SD 'iscsi_1')  -> Snapshot creation Failed 
Aug 2, 2018, 10:08:09 PM Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'.

3) Unblock connection from SPM -> SD .
2018-08-02 22:08:14

4) Now check on SPM the matching volumes
2018-08-02 22:08:37


Actual results:
We expect 4 volumes(lvs) but instead we get 5 volumes! 

Leftover on 1 disk image seen by running the following 3 commands from SPM host :

[root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l  => GIVES 2 volumes instead of once MEANING LEFTOVER IF FOUND

[root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 |

1)  43f313e1-7679-44fd-8903-078f2cf740b5 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_9,PU_00000000-0000-0000-0000-000000000000 
 
2) 70a9c30e-1845-4278-8685-159b78f2d051 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_12,PU_43f313e1-7679-44fd-8903-078f2cf740b5 
 => THIS IS THE LEFTOVER FROM THE SNAPSHOT !!! 
YOU CAN SEE HIS PARENT IS THE ORIGINAL DISK VOLUME .


2018-08-03 06:58:00,873 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l
2018-08-03 06:58:01,762 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l
2018-08-03 06:58:01,763 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

--
2018-08-03 06:58:04,484 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l
2018-08-03 06:58:05,477 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l
2018-08-03 06:58:05,478 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 2 => LEFTOVER FOUND !!!

--
2018-08-03 06:58:08,492 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l
2018-08-03 06:58:09,496 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l
2018-08-03 06:58:09,498 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

--
2018-08-03 06:58:12,392 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l
2018-08-03 06:58:13,201 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l
2018-08-03 06:58:13,202 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

Expected results:
We expect to get the same volume amount as before snapshot creation.


Additional info:

Comment 1 Benny Zlotnik 2018-08-06 08:16:48 UTC

I see a deletion of the leftover was attempted as expected here:

2018-08-02 22:08:08,046+0300 INFO  (jsonrpc/5) [vdsm.api] START deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff', spUUID='289db27b-7067-4c88-8311-88e0092e692d', imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873', volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false', force='true', discard=True) from=::ffff:10.35.162.7,45088, flow_id=snapshots_create_cbd1c26b-d708-47f7, task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46)

But failed, probably because the connection was still blocked

2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in deleteVolume
    with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 1025, in acquireResource
    return _manager.acquireResource(namespace, name, lockType, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py", line 475, in acquireResource
    raise se.ResourceAcqusitionFailed()
ResourceAcqusitionFailed: Could not acquire resource. Probably resource factory threw an exception.: ()
2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run: f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894)
2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running (force False) (task:1256)
2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002)
2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task] (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted: u'Could not acquire resource. Probably resource factory threw an exception.: ()' - code 100 (task:928)

Unfortunately there is not match we can do if the storage is blocked when we attempt to cleanup

Regardless, there is patch posted for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1598996
Which enables the tracking of the deletion task which is currently not tracked

Comment 2 Avihai 2018-08-06 10:58:48 UTC

(In reply to Benny Zlotnik from comment #1)
> I see a deletion of the leftover was attempted as expected here:
> 
> 2018-08-02 22:08:08,046+0300 INFO  (jsonrpc/5) [vdsm.api] START
> deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff',
> spUUID='289db27b-7067-4c88-8311-88e0092e692d',
> imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873',
> volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false',
> force='true', discard=True) from=::ffff:10.35.162.7,45088,
> flow_id=snapshots_create_cbd1c26b-d708-47f7,
> task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46)
> 
> But failed, probably because the connection was still blocked
> 
> 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task]
> (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> _run
>     return fn(*args, **kargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in
> run
>     return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> 79, in wrapper
>     return method(self, *args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in
> deleteVolume
>     with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE):
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> line 1025, in acquireResource
>     return _manager.acquireResource(namespace, name, lockType,
> timeout=timeout)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> line 475, in acquireResource
>     raise se.ResourceAcqusitionFailed()
> ResourceAcqusitionFailed: Could not acquire resource. Probably resource
> factory threw an exception.: ()
> 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run:
> f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894)
> 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running
> (force False) (task:1256)
> 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002)
> 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted:
> u'Could not acquire resource. Probably resource factory threw an exception.:
> ()' - code 100 (task:928)
> 
> Unfortunately there is not match we can do if the storage is blocked when we
> attempt to cleanup

I would expect that we either:
1) Wait until storage domain is up before trying to delete the leftover volume 

or

2)Re-attempt removing leftover volumes every X seconds so when the storage is available we will remove it.

If we leave it in the current solution we can have multiple volumes leftovers after a failed snapshot creation ( think of a VM/s with multiple disks)


> Regardless, there is patch posted for this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1598996
> Which enables the tracking of the deletion task which is currently not
> tracked
Yes, but this bug only clear the deleteImage task after a create snapshot failure but will not take care of leftover volumes.

Comment 3 Sandro Bonazzola 2019-01-28 09:34:21 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 4 Benny Zlotnik 2019-04-03 11:00:45 UTC

(In reply to Avihai from comment #2)
> (In reply to Benny Zlotnik from comment #1)
> > I see a deletion of the leftover was attempted as expected here:
> > 
> > 2018-08-02 22:08:08,046+0300 INFO  (jsonrpc/5) [vdsm.api] START
> > deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff',
> > spUUID='289db27b-7067-4c88-8311-88e0092e692d',
> > imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873',
> > volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false',
> > force='true', discard=True) from=::ffff:10.35.162.7,45088,
> > flow_id=snapshots_create_cbd1c26b-d708-47f7,
> > task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46)
> > 
> > But failed, probably because the connection was still blocked
> > 
> > 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task]
> > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875)
> > Traceback (most recent call last):
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> > _run
> >     return fn(*args, **kargs)
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in
> > run
> >     return self.cmd(*self.argslist, **self.argsdict)
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> > 79, in wrapper
> >     return method(self, *args, **kwargs)
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in
> > deleteVolume
> >     with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE):
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> > line 1025, in acquireResource
> >     return _manager.acquireResource(namespace, name, lockType,
> > timeout=timeout)
> >   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> > line 475, in acquireResource
> >     raise se.ResourceAcqusitionFailed()
> > ResourceAcqusitionFailed: Could not acquire resource. Probably resource
> > factory threw an exception.: ()
> > 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run:
> > f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894)
> > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running
> > (force False) (task:1256)
> > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002)
> > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted:
> > u'Could not acquire resource. Probably resource factory threw an exception.:
> > ()' - code 100 (task:928)
> > 
> > Unfortunately there is not match we can do if the storage is blocked when we
> > attempt to cleanup
> 
> I would expect that we either:
> 1) Wait until storage domain is up before trying to delete the leftover
> volume 
> 
> or
> 
> 2)Re-attempt removing leftover volumes every X seconds so when the storage
> is available we will remove it.
> 
> If we leave it in the current solution we can have multiple volumes
> leftovers after a failed snapshot creation ( think of a VM/s with multiple
> disks)
> 
> 
> > Regardless, there is patch posted for this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1598996
> > Which enables the tracking of the deletion task which is currently not
> > tracked
> Yes, but this bug only clear the deleteImage task after a create snapshot
> failure but will not take care of leftover volumes.

We do not have the infra to do something like this, I'm closing for now.
Please open an RFE if it's improtant.

Comment 5 Avihai 2019-04-03 11:55:25 UTC

(In reply to Benny Zlotnik from comment #4)
> (In reply to Avihai from comment #2)
> > (In reply to Benny Zlotnik from comment #1)
> > > I see a deletion of the leftover was attempted as expected here:
> > > 
> > > 2018-08-02 22:08:08,046+0300 INFO  (jsonrpc/5) [vdsm.api] START
> > > deleteVolume(sdUUID='ca5552b7-305a-4667-9688-a42c180806ff',
> > > spUUID='289db27b-7067-4c88-8311-88e0092e692d',
> > > imgUUID='2035083e-d5af-4145-9b2b-71b9b77a2873',
> > > volumes=['70a9c30e-1845-4278-8685-159b78f2d051'], postZero='false',
> > > force='true', discard=True) from=::ffff:10.35.162.7,45088,
> > > flow_id=snapshots_create_cbd1c26b-d708-47f7,
> > > task_id=f5c436cb-5d86-44fb-a18c-86024a857d2e (api:46)
> > > 
> > > But failed, probably because the connection was still blocked
> > > 
> > > 2018-08-02 22:08:09,453+0300 ERROR (tasks/1) [storage.TaskManager.Task]
> > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Unexpected error (task:875)
> > > Traceback (most recent call last):
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
> > > _run
> > >     return fn(*args, **kargs)
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in
> > > run
> > >     return self.cmd(*self.argslist, **self.argsdict)
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> > > 79, in wrapper
> > >     return method(self, *args, **kwargs)
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1966, in
> > > deleteVolume
> > >     with rm.acquireResource(img_ns, imgUUID, rm.EXCLUSIVE):
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> > > line 1025, in acquireResource
> > >     return _manager.acquireResource(namespace, name, lockType,
> > > timeout=timeout)
> > >   File "/usr/lib/python2.7/site-packages/vdsm/storage/resourceManager.py",
> > > line 475, in acquireResource
> > >     raise se.ResourceAcqusitionFailed()
> > > ResourceAcqusitionFailed: Could not acquire resource. Probably resource
> > > factory threw an exception.: ()
> > > 2018-08-02 22:08:09,453+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') Task._run:
> > > f5c436cb-5d86-44fb-a18c-86024a857d2e () {} failed - stopping task (task:894)
> > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') stopping in state running
> > > (force False) (task:1256)
> > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') ref 1 aborting True (task:1002)
> > > 2018-08-02 22:08:09,454+0300 DEBUG (tasks/1) [storage.TaskManager.Task]
> > > (Task='f5c436cb-5d86-44fb-a18c-86024a857d2e') aborting: Task is aborted:
> > > u'Could not acquire resource. Probably resource factory threw an exception.:
> > > ()' - code 100 (task:928)
> > > 
> > > Unfortunately there is not match we can do if the storage is blocked when we
> > > attempt to cleanup
> > 
> > I would expect that we either:
> > 1) Wait until storage domain is up before trying to delete the leftover
> > volume 
> > 
> > or
> > 
> > 2)Re-attempt removing leftover volumes every X seconds so when the storage
> > is available we will remove it.
> > 
> > If we leave it in the current solution we can have multiple volumes
> > leftovers after a failed snapshot creation ( think of a VM/s with multiple
> > disks)
> > 
> > 
> > > Regardless, there is patch posted for this bug:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1598996
> > > Which enables the tracking of the deletion task which is currently not
> > > tracked
> > Yes, but this bug only clear the deleteImage task after a create snapshot
> > failure but will not take care of leftover volumes.
> 
> We do not have the infra to do something like this, I'm closing for now.
> Please open an RFE if it's improtant.

Opened RFE Bug 1695585 on this issue as I think this is important as we can have an ENV with many snapshot's and many leftovers LV's which are not cleaned up as the SD is not available at the moment but we do not handle it after first failure.

Comment 6 Benny Zlotnik 2019-05-28 12:09:59 UTC

*** Bug 1650985 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.