Bug 1695585 - [RFE] - Clean Leftover volumes even if cleanup failed for the first time after a failed snapshot creation (due to unavailable SD)
Summary: [RFE] - Clean Leftover volumes even if cleanup failed for the first time afte...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.2.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks: 1565685
TreeView+ depends on / blocked
 
Reported: 2019-04-03 11:53 UTC by Avihai
Modified: 2022-01-18 11:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-29 11:33:06 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-37817 0 None None None 2021-08-24 12:19:16 UTC

Description Avihai 2019-04-03 11:53:40 UTC
Description of problem:
This is taken from Bug 1611929 , all logs are there.

You can see from that bug Benny's remark that cleanup is attempted for the first time and fails (due to blocked storage ) .

The expected result:
Add logic in the case of a leftover LV to wait for the SD to be available again and then delete the leftover LV  

Description of problem:
VM + 4 disks have 4 matching volumes before snapshot .
Create snapshot and block connection SPM -> disks storage domain thus failing the snapshot creation.

Indeed I see snapshot is deleted but when I check the number of volumes(lvs) per each disk I see that one disk has 2 lv's instead of one - Leftover from failed snapshot operation.

VDSM - ERRORS Appears in ROLLBACK :
2018-08-02 22:07:55,384+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') running recovery Create volume metadata rollback: BlockVolume->createVolumeMetadataRollback(c
a5552b7-305a-4667-9688-a42c180806ff,12) (task:812)
2018-08-02 22:07:55,384+0300 INFO  (tasks/5) [storage.Volume] Metadata rollback for sdUUID=ca5552b7-305a-4667-9688-a42c180806ff offs=12 (blockVolume:489)
2018-08-02 22:07:55,389+0300 ERROR (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run
    return function(self.task, *argslist)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 491, in createVolumeMetadataRollback
    {"NONE": "#" * (sc.METADATA_SIZE - 10)})
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 873, in _putMetadata
    cls.manifestClass._putMetadata(metaId, meta)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 244, in _putMetadata
    f.write(data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/directio.py", line 161, in write
    raise OSError(err, msg)
OSError: [Errno 5] Input/output error
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') Task._run: f838b186-7ee4-4643-a0b2-3addf43d5482 () {} failed - stopping task (task:894)
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') stopping in state raborting (force False) (task:1256)
2018-08-02 22:07:55,389+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') ref 1 aborting True (task:1002)
2018-08-02 22:07:55,389+0300 WARN  (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') task f838b186-7ee4-4643-a0b2-3addf43d5482: recovery failed: Task is aborted: u'[Errno 5] Input/output error' - code 100 (task:818)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 815, in _recover
    self._run(rec.run)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 896, in _run
    raise se.TaskAborted(message, code)
TaskAborted: Task is aborted: u'[Errno 5] Input/output error' - code 100
2018-08-02 22:07:55,390+0300 DEBUG (tasks/5) [storage.TaskManager.Task] (Task='f838b186-7ee4-4643-a0b2-3addf43d5482') moving from state raborting -> state raborting (task:602)

2018-08-02 22:08:03,412+0300 ERROR (monitor/ca5552b) [storage.Monitor] Error checking domain ca5552b7-305a-4667-9688-a42c180806ff (monitor:424)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 405, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1171, in selftest
    elif lvm.getVG(self.sdUUID).partial != lvm.VG_OK:
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 81, in __getattr__
    raise AttributeError("Failed reload: %s" % self.name)
AttributeError: Failed reload: ca5552b7-305a-4667-9688-a42c180806ff

ENGINE log:

2018-08-02 22:08:08,672+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure.
2018-08-02 22:08:08,692+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] START, DestroyImageVDSCommand( DestroyImageVDSCommandParameters:{storagePoolId='289db27b-7067-4c88-8311-88e0092e692d', ignoreFailoverLimit='false', storageDomainId='ca5552b7-305a-4667-9688-a42c180806ff', imageGroupId='b4f0b2c4-e99f-47b4-a581-2fd19bd19feb', imageId='00000000-0000-0000-0000-000000000000', imageList='[9dfe49e0-ec55-4e7d-b16c-2c45dce3d6e9]', postZero='false', force='true'}), log id: 32b1981b
2018-08-02 22:08:09,812+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DestroyImageVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [snapshots_create_cbd1c26b-d708-47f7] FINISH, DestroyImageVDSCommand, log id: 32b1981b
2018-08-02 22:08:09,858+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Lock freed to object 'EngineLock:{exclusiveLocks='[42cb2951-a16b-4231-be68-7129c39e8dae=VM]', sharedLocks=''}'
2018-08-02 22:08:09,883+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'.
2018-08-02 22:08:09,884+03 WARN  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Command 'CreateSnapshotForVm' id: '8d8028d7-58c1-4043-878a-f8a476a701ab' end method execution failed, as the command isn't marked for endAction() retries silently ignoring
2018-08-02 22:08:30,893+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,893+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'd2fb82b9-7545-49be-88e0-988f9ce9c46e:iscsi_0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain 'ca5552b7-305a-4667-9688-a42c180806ff:iscsi_1' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' recovered from problem. vds: 'host_mixed_1'
2018-08-02 22:08:30,894+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-52059) [] Domain '3367b82a-21bc-47e0-83d4-741e0745b241:iscsi_2' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.




Version-Release number of selected component (if applicable):
4.2.6_SNAPSHOT-84.gad3de30.0.scratch.master.el7ev

How reproducible:
Happened once so far

Steps to Reproduce:
1) VM with 4 disks attached , each disk has 1 volume .

Disks ID:
2035083e-d5af-4145-9b2b-71b9b77a2873
43f313e1-7679-44fd-8903-078f2cf740b5
43f313e1-7679-44fd-8903-078f2cf740b5
43f313e1-7679-44fd-8903-078f2cf740b5

2) check on SPM the matching volumes to the created disks . As each disk have 1 volumes we get a total of 4 volumes .

3)  Fail Create snapshot by blocking connection from SPM to storage domain that the disks resides on (in this case SD 'iscsi_1')  -> Snapshot creation Failed 
Aug 2, 2018, 10:08:09 PM Failed to complete snapshot 'snap_TestCase6015_0222065805' creation for VM 'vm_TestCase6015_0222022658'.

3) Unblock connection from SPM -> SD .
2018-08-02 22:08:14

4) Now check on SPM the matching volumes
2018-08-02 22:08:37


Actual results:
We expect 4 volumes(lvs) but instead we get 5 volumes! 

Leftover on 1 disk image seen by running the following 3 commands from SPM host :

[root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l  => GIVES 2 volumes instead of once MEANING LEFTOVER IF FOUND

[root@storage-ge8-vdsm1 ~]# lvs -o lv_name,lv_tagslvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 |

1)  43f313e1-7679-44fd-8903-078f2cf740b5 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_9,PU_00000000-0000-0000-0000-000000000000 
 
2) 70a9c30e-1845-4278-8685-159b78f2d051 IU_2035083e-d5af-4145-9b2b-71b9b77a2873,MD_12,PU_43f313e1-7679-44fd-8903-078f2cf740b5 
 => THIS IS THE LEFTOVER FROM THE SNAPSHOT !!! 
YOU CAN SEE HIS PARENT IS THE ORIGINAL DISK VOLUME .


2018-08-03 06:58:00,873 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l
2018-08-03 06:58:01,762 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep b4f0b2c4-e99f-47b4-a581-2fd19bd19feb | wc -l
2018-08-03 06:58:01,763 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

--
2018-08-03 06:58:04,484 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l
2018-08-03 06:58:05,477 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 2035083e-d5af-4145-9b2b-71b9b77a2873 | wc -l
2018-08-03 06:58:05,478 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 2 => LEFTOVER FOUND !!!

--
2018-08-03 06:58:08,492 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l
2018-08-03 06:58:09,496 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep 7ed04b54-e589-4a1d-bccb-912d53e298e2 | wc -l
2018-08-03 06:58:09,498 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

--
2018-08-03 06:58:12,392 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Executing: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l
2018-08-03 06:58:13,201 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net] Results of command: lvs -o lv_name,lv_tags | grep df895e8d-5f44-447c-9331-a18516efc2b9 | wc -l
2018-08-03 06:58:13,202 - MainThread - RemoteExecutor - DEBUG - [root.82.79/qum5net]   OUT: 1

Expected results:
We expect to get the same volume amount as before snapshot creation.


Additional info:

Comment 1 Avihai 2019-04-04 06:11:55 UTC
I would expect that we either:
1) Wait until storage domain is up before trying to delete the leftover volume 

or

2)Re-attempt removing leftover volumes every X seconds so when the storage is available we will remove it.

If we leave it in the current solution we can have multiple volumes leftovers after a failed snapshot creation ( think of a VM/s with multiple disks)

Comment 2 Eyal Shenitzky 2021-08-24 12:18:03 UTC
This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 3 Michal Skrivanek 2021-09-29 11:33:06 UTC
This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it.
Please feel free to reopen if you have a plan how to contribute this feature/bug fix.


Note You need to log in before you can comment on or make changes to this bug.