1399688 – Snapshot delete on HSM fails

Bug 1399688 - Snapshot delete on HSM fails

Summary: Snapshot delete on HSM fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.18.15.2
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.0.6
Target Release:	4.18.20
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1321018
TreeView+	depends on / blocked

Reported:	2016-11-29 14:57 UTC by Kevin Alon Goldblatt
Modified:	2017-01-18 07:27 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-18 07:27:21 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: blocker+

Attachments	(Terms of Use)
server, vdsm, engine.log (1.36 MB, application/x-gzip) 2016-11-29 14:58 UTC, Kevin Alon Goldblatt	no flags	Details
server, vdsm, engine.log (2.01 MB, application/x-gzip) 2016-12-06 18:27 UTC, Kevin Alon Goldblatt	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	67982	0	ovirt-4.0	MERGED	Live Merge: Fix teardown volume support	2016-12-08 14:39:42 UTC
oVirt gerrit	68396	0	ovirt-4.0	MERGED	sd: Move teardownVolume to StorageDomain	2016-12-14 08:37:36 UTC

Description Kevin Alon Goldblatt 2016-11-29 14:57:24 UTC

Description of problem:
Snapshot delete on HSM fails - works fine on SPM

Version-Release number of selected component (if applicable):
vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM
vdsm-4.18.17-1.el7ev.x86_64 HSM
rhevm-4.0.6-0.1.el7ev.noarch

How reproducible:
Every tine
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center

Actual results:
snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center


Expected results:


Additional info:


SPM
Before
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036


With snapshots
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 79463314-d607-4f93-82e9-372c84697c1f -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/79463314-d607-4f93-82e9-372c84697c1f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036



After deleting snapshot
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036



HSM
Before
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6


With snapshots
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036


After deleting snapshots
 ├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036  SHOULD HAVE BEEN DELETED



From engine.log
-----------------------------------------------------------------
2016-11-24 16:42:40,181 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [] Correlation ID: 5e2a8d2d, Job ID: f03fe7bf-9e3a-433b-8356-8c11d6bd9
cb1, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snapper1' deletion for VM 'bbbb' was initiated by admin@internal-authz.
2016-11-24 16:42:40,184 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-3) [21f25490] Running command: RemoveDiskCommand internal: true. Entities affected : 
 ID: aab7e02f-229f-4bf3-8942-634858e873f5 Type: DiskAction group DELETE_DISK with role type USER
2016-11-24 16:42:40,191 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (pool-5-thread-3) [7ea6c4ea] Running command: RemoveImageCommand internal: true. Entities aff
ected :  ID: 38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e Type: Storage
2016-11-24 16:42:40,228 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-3) [7ea6c4ea] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSComma
ndParameters:{runAsync='true', storagePoolId='f30058ff-5799-45b2-a272-22b2d198aa16', ignoreFailoverLimit='false', storageDomainId='38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', imageGroupId='aab7e
02f-229f-4bf3-8942-634858e873f5', postZeros='false', forceDelete='false'}), log id: 223961f5

Comment 1 Kevin Alon Goldblatt 2016-11-29 14:58:56 UTC

Created attachment 1225888 [details]
server, vdsm, engine.log

Added logs

Comment 2 Richard Shaw 2016-11-29 16:12:28 UTC

Umm... I'm thinking you got the wrong bug as 1321018 is a review request...

Comment 3 Allon Mureinik 2016-11-29 20:40:11 UTC

I see quite a few of these in the HSM's log, not sure what to make of them ATM:

periodic/1::ERROR::2016-11-24 18:04:48,962::executor::232::Executor::(_execute_task) Unhandled exception in Task(callable=<BlockjobMonitor vm=20a5e29e-5576-4e2b-8901-4147bfba13ab at 0x2f8fe50>, timeout=7.5)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 230, in _execute_task
    task.callable()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 302, in __call__
    self._execute()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 373, in _execute
    self._vm.updateVmJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4506, in updateVmJobs
    self._vmJobs = self.queryBlockJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4529, in queryBlockJobs
    drive = self._findDriveByUUIDs(storedJob['disk'])
  File "/usr/share/vdsm/virt/vm.py", line 3248, in _findDriveByUUIDs
    raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'domainID': '38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', 'imageID': '2df48dca-c4d1-431d-81e6-08676f7a469f', 'volumeID': '1ac5ad49-3e63-454a-954b-6f074acd4ed6', 'poolID': 'f30058ff-5799-45b2-a272-22b2d198aa16'}'

Comment 4 Francesco Romani 2016-12-05 11:51:46 UTC

(In reply to Allon Mureinik from comment #3)
> I see quite a few of these in the HSM's log, not sure what to make of them
> ATM:
> 
> periodic/1::ERROR::2016-11-24
> 18:04:48,962::executor::232::Executor::(_execute_task) Unhandled exception
> in Task(callable=<BlockjobMonitor vm=20a5e29e-5576-4e2b-8901-4147bfba13ab at
> 0x2f8fe50>, timeout=7.5)
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 230, in
> _execute_task
>     task.callable()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 302,
> in __call__
>     self._execute()
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 373,
> in _execute
>     self._vm.updateVmJobs()
>   File "/usr/share/vdsm/virt/vm.py", line 4506, in updateVmJobs
>     self._vmJobs = self.queryBlockJobs()
>   File "/usr/share/vdsm/virt/vm.py", line 4529, in queryBlockJobs
>     drive = self._findDriveByUUIDs(storedJob['disk'])
>   File "/usr/share/vdsm/virt/vm.py", line 3248, in _findDriveByUUIDs
>     raise LookupError("No such drive: '%s'" % drive)
> LookupError: No such drive: '{'domainID':
> '38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', 'imageID':
> '2df48dca-c4d1-431d-81e6-08676f7a469f', 'volumeID':
> '1ac5ad49-3e63-454a-954b-6f074acd4ed6', 'poolID':
> 'f30058ff-5799-45b2-a272-22b2d198aa16'}'

This is related, but should not be the root cause
This task periodically update the block jobs info to be reported to Engine using getVmStats() and friends.

This error means that the block job tracking information is stale. This is bad news, but it is compatible with the reported state.

Comment 5 Michal Skrivanek 2016-12-05 12:54:18 UTC

can you please include more of the logs, even at the beginning the err is already happening.

Comment 6 Kevin Alon Goldblatt 2016-12-06 18:26:22 UTC

Added a new set of logs:
NB! To help search the scenario in the logs in engine with:
echo "testone_start" >> engine.log
echo "testone_end" >> engine.log

NB! To help search the scenario in the logs in SPM and HSM with:
echo "testone_start" >> vdsm.log
echo "testone_end" >> vdsm.log




Verified again with the following code:
---------------------------------------------------
vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM
vdsm-4.18.17-1.el7ev.x86_64 HSM
ovirt-engine-4.0.6-0.1.el7ev.noarch


How reproducible:
Every tine
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center

Actual results:
snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center


Expected results:


Additional info:


HSM
Before
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462



With snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
    │   │       │   └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7



After deleting snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
    │   │       │   └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7

^^^^^^SHOULD HAVE BEEN DELETED^^^^^^^





From Engine.log
-------------------------------
ternal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2016-12-06 19:34:10,181 INFO  [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (pool-5-thread-7) [1489e1f1] Running command: RemoveMemoryVolumesCommand internal: true.
2016-12-06 19:34:10,217 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] Correlation ID: 5cab2dba, Job ID: 9056c6db-b868-4807-8776-ba8c48183a0b, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'sns_hms_testone' deletion for VM 'testone' was initiated by admin@internal-authz.
2016-12-06 19:34:10,254 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Lock Acquired to object 'EngineLock:{exclusiveLocks='[40c3733e-b0b4-4f72-aa09-a903dacba9cd=<DISK, ACTION_TYPE_FAILED_DISK_IS_BEING_REMOVED$DiskName snapshot_memory>]', sharedLocks='null'}'
2016-12-06 19:34:10,273 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Running command: RemoveDiskCommand internal: true. Entities affected :  ID: 40c3733e-b0b4-4f72-aa09-a903dacba9cd Type: DiskAction group DELETE_DISK with role type USER

Comment 7 Kevin Alon Goldblatt 2016-12-06 18:27:51 UTC

Created attachment 1228662 [details]
server, vdsm, engine.log

added new logs

Comment 8 Tal Nisan 2016-12-12 12:33:35 UTC

This will only be included in v4.18.19 and onwards, moving back to MODIFIED

Comment 9 Sandro Bonazzola 2016-12-12 13:27:39 UTC

(In reply to Tal Nisan from comment #8)
> This will only be included in v4.18.19 and onwards, moving back to MODIFIED

Tal, 4.18.20 has been released in 4.0.6 RC4

Comment 10 Tal Nisan 2016-12-12 15:48:40 UTC

wasn't it supposed to be released today? It was moved to ON_QA and QE had 4.18.18 as the latest version

Comment 11 Kevin Alon Goldblatt 2016-12-13 18:11:33 UTC

Verified with the following version:
----------------------------------------------
vdsm-4.18.20-1.el7ev.x86_64 SPM
vdsm-4.18.20-1.el7ev.x86_64 HSM
rhevm-4.0.6.3-0.1.el7ev.noarch

How reproducible:
Every tine

Verified with the following scenario:
----------------------------------------------
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot successfully deleted and links deleted


Actual results:
snapshot is deleted and link removed in /rhev/data-center

Note You need to log in before you can comment on or make changes to this bug.