Bug 1400137 - Snapshot remains locked when performing live-merge on HSM
Summary: Snapshot remains locked when performing live-merge on HSM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.18.15.2
Hardware: Unspecified
OS: All
unspecified
urgent vote
Target Milestone: ovirt-4.0.6
: ---
Assignee: Ala Hino
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-30 14:09 UTC by Kevin Alon Goldblatt
Modified: 2017-01-18 07:29 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-18 07:29:00 UTC
oVirt Team: Storage
rule-engine: ovirt-4.0.z+
rule-engine: blocker+


Attachments (Terms of Use)
server, vdsm, engine.log (1.36 MB, application/x-gzip)
2016-11-30 14:10 UTC, Kevin Alon Goldblatt
no flags Details
server, vdsm, engine.log (2.01 MB, application/x-gzip)
2016-12-06 18:28 UTC, Kevin Alon Goldblatt
no flags Details
engine and hsm logs (2.30 MB, application/x-gzip)
2016-12-13 16:51 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 67902 0 master MERGED vm: storage: fix _diskXMLGetVolumeChainInfo 2020-08-18 08:19:02 UTC
oVirt gerrit 67982 0 ovirt-4.0 MERGED Live Merge: Fix teardown volume support 2020-08-18 08:19:02 UTC
oVirt gerrit 68396 0 ovirt-4.0 MERGED sd: Move teardownVolume to StorageDomain 2020-08-18 08:19:02 UTC

Description Kevin Alon Goldblatt 2016-11-30 14:09:46 UTC
Description of problem:
Snapshot delete on HSM fails - works fine on SPM

Version-Release number of selected component (if applicable):
vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM
vdsm-4.18.17-1.el7ev.x86_64 HSM
rhevm-4.0.6-0.1.el7ev.noarch

How reproducible:
Every tine
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center

Actual results:
snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center


Expected results:


Additional info:


SPM
Before
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036


With snapshots
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 79463314-d607-4f93-82e9-372c84697c1f -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/79463314-d607-4f93-82e9-372c84697c1f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036



After deleting snapshot
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036



HSM
Before
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6


With snapshots
├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036


After deleting snapshots
 ├── 2df48dca-c4d1-431d-81e6-08676f7a469f
    │   │       │   ├── 1ac5ad49-3e63-454a-954b-6f074acd4ed6 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/1ac5ad49-3e63-454a-954b-6f074acd4ed6
    │   │       │   └── a0a14c70-2b3b-453d-a94d-9da790b7f036 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/a0a14c70-2b3b-453d-a94d-9da790b7f036  SHOULD HAVE BEEN DELETED



From engine.log
-----------------------------------------------------------------
2016-11-24 16:42:40,181 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [] Correlation ID: 5e2a8d2d, Job ID: f03fe7bf-9e3a-433b-8356-8c11d6bd9
cb1, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snapper1' deletion for VM 'bbbb' was initiated by admin@internal-authz.
2016-11-24 16:42:40,184 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-3) [21f25490] Running command: RemoveDiskCommand internal: true. Entities affected : 
 ID: aab7e02f-229f-4bf3-8942-634858e873f5 Type: DiskAction group DELETE_DISK with role type USER
2016-11-24 16:42:40,191 INFO  [org.ovirt.engine.core.bll.storage.disk.image.RemoveImageCommand] (pool-5-thread-3) [7ea6c4ea] Running command: RemoveImageCommand internal: true. Entities aff
ected :  ID: 38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e Type: Storage
2016-11-24 16:42:40,228 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-5-thread-3) [7ea6c4ea] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSComma
ndParameters:{runAsync='true', storagePoolId='f30058ff-5799-45b2-a272-22b2d198aa16', ignoreFailoverLimit='false', storageDomainId='38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', imageGroupId='aab7e
02f-229f-4bf3-8942-634858e873f5', postZeros='false', forceDelete='false'}), log id: 223961f5

Comment 1 Kevin Alon Goldblatt 2016-11-30 14:10:44 UTC
Created attachment 1226330 [details]
server, vdsm, engine.log

Added Logs

Comment 2 Ala Hino 2016-12-04 19:32:43 UTC
Francesco,

While looking into this BZ, I found that merge command wasn't received by vdsm at all and engine got network timeout. When run top on the host, I found that vdsm consumes almost 100% cpu. I doubled checked and found that before live merge vdsm consumed almost no cpu (as expected as the host does nothing really), and after starting live merge, vdsm immediately consumes ~100% cpu and continues to consume 100% even after engine receive timeout.
The vm I use to test this is a really 'thin' vm with 1Gb disk and no OS installed on it.

In Vdsm log I see the following error:

2016-12-04 11:18:57,008 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.0.131', 59896, 0, 0) at 0x3ca1710>: unexpected eof (betterAsyncore:113)

Dan mentioned that recently, Mor Kalfon reported something similar when he tested OVN migration. Dan suggested consulting you since there were changes made in
the virt code to accommodate post-copy migration.

Would appreciate if you can a look and advice.

Comment 3 Francesco Romani 2016-12-05 11:33:16 UTC
(In reply to Ala Hino from comment #2)
> Francesco,
[...]ore:113)
> 
> Dan mentioned that recently, Mor Kalfon reported something similar when he
> tested OVN migration. Dan suggested consulting you since there were changes
> made in
> the virt code to accommodate post-copy migration.

True, but those changes are been made in master branch, with no intentions of backporting, while here it seems we are talking about 4.0.z, aren't we?

> Would appreciate if you can a look and advice.

On it

Comment 4 Francesco Romani 2016-12-05 11:58:27 UTC
(In reply to Francesco Romani from comment #3)
> (In reply to Ala Hino from comment #2)
> > Francesco,
> [...]ore:113)
> > 
> > Dan mentioned that recently, Mor Kalfon reported something similar when he
> > tested OVN migration. Dan suggested consulting you since there were changes
> > made in
> > the virt code to accommodate post-copy migration.
> 
> True, but those changes are been made in master branch, with no intentions
> of backporting, while here it seems we are talking about 4.0.z, aren't we?
> 
> > Would appreciate if you can a look and advice.
> 
> On it

We need more Vdsm logs to understand what happened VDSM side. I don't see the beginning of the operation. I can tell somehow Vdsm lost its state:

periodic/2::ERROR::2016-11-24 16:43:33,868::executor::232::Executor::(_execute_task) Unhandled exception in Task(callable=<BlockjobMonitor vm=20a5e29e-5576-4e2b-8901-4147bfba13ab at 0x29f74d0>, timeout=7.5)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 230, in _execute_task
    task.callable()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 302, in __call__
    self._execute()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 373, in _execute
    self._vm.updateVmJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4506, in updateVmJobs
    self._vmJobs = self.queryBlockJobs()
  File "/usr/share/vdsm/virt/vm.py", line 4529, in queryBlockJobs
    drive = self._findDriveByUUIDs(storedJob['disk'])
  File "/usr/share/vdsm/virt/vm.py", line 3248, in _findDriveByUUIDs
    raise LookupError("No such drive: '%s'" % drive)
LookupError: No such drive: '{'domainID': '38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e', 'imageID': '2df48dca-c4d1-431d-81e6-08676f7a469f', 'volumeID': '1ac5ad49-3e63-454a-954b-6f074acd4ed6', 'poolID': 'f30058ff-5799-45b2-a272-22b2d198aa16'}'

But this is not the cause, it's the periodic update which fails because Vdsm has now stale information.

Please make sure both Vdsm and Engine are updated properly and the comminication between the two works fine.

Comment 5 Kevin Alon Goldblatt 2016-12-06 18:26:46 UTC
Added a new set of logs:
NB! To help search the scenario in the logs in engine with:
echo "testone_start" >> engine.log
echo "testone_end" >> engine.log

NB! To help search the scenario in the logs in SPM and HSM with:
echo "testone_start" >> vdsm.log
echo "testone_end" >> vdsm.log




Verified again with the following code:
---------------------------------------------------
vdsm-4.18.999-759.git435a852.el7.centos.x86_64 SPM
vdsm-4.18.17-1.el7ev.x86_64 HSM
ovirt-engine-4.0.6-0.1.el7ev.noarch


How reproducible:
Every tine
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center

Actual results:
snapshot remains LOCKED and snapshot link is NOT deleted in /rhev/data-center


Expected results:


Additional info:


HSM
Before
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462



With snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
    │   │       │   └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7



After deleting snapshots
├── adabaeb2-09c7-407c-a93d-5144f52b2a4f
    │   │       │   ├── 7ee69716-54b1-4be1-828b-fbc6760a8462 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/7ee69716-54b1-4be1-828b-fbc6760a8462
    │   │       │   └── 93d39a43-7d81-40dd-9060-f33857d7bfd7 -> /dev/38b9ece5-6e9b-47c8-8f4c-3edc3e20d72e/93d39a43-7d81-40dd-9060-f33857d7bfd7

^^^^^^SHOULD HAVE BEEN DELETED^^^^^^^





From Engine.log
-------------------------------
ternal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2016-12-06 19:34:10,181 INFO  [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (pool-5-thread-7) [1489e1f1] Running command: RemoveMemoryVolumesCommand internal: true.
2016-12-06 19:34:10,217 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [] Correlation ID: 5cab2dba, Job ID: 9056c6db-b868-4807-8776-ba8c48183a0b, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'sns_hms_testone' deletion for VM 'testone' was initiated by admin@internal-authz.
2016-12-06 19:34:10,254 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Lock Acquired to object 'EngineLock:{exclusiveLocks='[40c3733e-b0b4-4f72-aa09-a903dacba9cd=<DISK, ACTION_TYPE_FAILED_DISK_IS_BEING_REMOVED$DiskName snapshot_memory>]', sharedLocks='null'}'
2016-12-06 19:34:10,273 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (pool-5-thread-7) [65a94a76] Running command: RemoveDiskCommand internal: true. Entities affected :  ID: 40c3733e-b0b4-4f72-aa09-a903dacba9cd Type: DiskAction group DELETE_DISK with role type USER

Comment 6 Kevin Alon Goldblatt 2016-12-06 18:28:29 UTC
Created attachment 1228663 [details]
server, vdsm, engine.log

added new logs

Comment 7 Francesco Romani 2016-12-07 09:23:38 UTC
Turns out https://bugzilla.redhat.com/show_bug.cgi?id=1400137#c2 is the first blocker, and https://gerrit.ovirt.org/#/c/67902/ will fix it.

While 67902 is surely needed, I'm not sure yet it's the only fix we need for this BZ.

Comment 8 Francesco Romani 2016-12-07 09:25:08 UTC
Please note that 67902 is master only.
The patch who broke master never landed on 4.0.z (and there are no future plans for it), and this BZ refers to 4.0.5.

Comment 9 Roy Golan 2016-12-08 09:32:37 UTC
Ala you wrote that the engine misses a task and that really could be explained by Bug 1401585 which specifically fixes this missing task scenario.

Comment 10 Tal Nisan 2016-12-08 15:01:39 UTC
*** Bug 1400707 has been marked as a duplicate of this bug. ***

Comment 11 Tal Nisan 2016-12-12 12:32:28 UTC
This will only be included in v4.18.19 and onwards, moving back to MODIFIED

Comment 12 Ala Hino 2016-12-13 09:05:08 UTC
(In reply to Roy Golan from comment #9)
> Ala you wrote that the engine misses a task and that really could be
> explained by Bug 1401585 which specifically fixes this missing task scenario.

Thanks for Roy for raising this but eventually root cause wasn't related to BZ 1401585.

Comment 13 Raz Tamir 2016-12-13 16:51:27 UTC
Created attachment 1231283 [details]
engine and hsm logs

Still not working - the snapshot remains in locked state and from logs of hsm which running the VM:
Thread-2783::ERROR::2016-12-13 18:03:08,878::utils::375::root::(wrapper) Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 372, in wrapper
    return f(*a, **kw)
  File "/usr/share/vdsm/virt/vm.py", line 5003, in run
    self.teardown_top_volume()
  File "/usr/share/vdsm/virt/vm.py", line 4989, in teardown_top_volume
    sd_manifest = sdc.sdCache.produce_manifest(self.drive.domainID)
AttributeError: StorageDomainCache instance has no attribute 'produce_manifest'


* attached

Comment 14 Raz Tamir 2016-12-13 17:05:29 UTC
Seems like this host was with wrong vdsm vdsm-4.18.18-1.el7ev.x86_64 .
moving back to ON_QA

Comment 15 Kevin Alon Goldblatt 2016-12-13 18:12:06 UTC
Verified with the following version:
----------------------------------------------
vdsm-4.18.20-1.el7ev.x86_64 SPM
vdsm-4.18.20-1.el7ev.x86_64 HSM
rhevm-4.0.6.3-0.1.el7ev.noarch

How reproducible:
Every tine

Verified with the following scenario:
----------------------------------------------
Steps to Reproduce:
1. Create vm with disk on SPM and run
2. Create snapshot
3. Delete snapshot - snapshot link is deleted in /rhev/data-center
4. Stop vm
5. Start the vm on HSM
6.Create a new snapshot
7. Delete the snapshot - snapshot successfully deleted and links deleted


Actual results:
snapshot is deleted and link removed in /rhev/data-center


Note You need to log in before you can comment on or make changes to this bug.