Bug 961038 - vdsm: after delete of a snapshot that was created during a failed LSM, vm fails to run with "Meta Data key not found" error
Summary: vdsm: after delete of a snapshot that was created during a failed LSM, vm fai...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Eduardo Warszawski
QA Contact: Leonid Natapov
URL:
Whiteboard: storage
Depends On: 960952
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-08 15:14 UTC by Dafna Ron
Modified: 2016-02-10 17:21 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
After deleting a snapshot that was created during a failed live storage migration, the virtual machine failed to run with a "Meta Data key not found" error. This was caused by merging snapshots on the SPM and then trying to run the virtual machine on a host without refreshing its knowledge of the changes, rendering the LVM cache out of sync. This has now been fixed, so virtual machines can run after snapshot deletion.
Clone Of:
Environment:
Last Closed: 2014-01-21 16:07:07 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
abaron: Triaged+


Attachments (Terms of Use)
logs (3.72 MB, application/x-gzip)
2013-05-08 15:14 UTC, Dafna Ron
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC

Description Dafna Ron 2013-05-08 15:14:41 UTC
Created attachment 745284 [details]
logs

Description of problem:

I failed a live storage migration by stoping the vdsm on the hsm host (where the vm was running on). 
the LSM live snapshot was created and I shut down the vm
after I deleted the LSM snapshots I started the vm
the vm tried to start on the hsm host and failed with Meta Data key not found error on prepareVolume. 

Version-Release number of selected component (if applicable):

sf15
vdsm-4.10.2-17.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create and run a Desktop type vm from template (i.e thin provision copy), on iscsi storage (template must be copied to all domains) - make sure that the vm is run on hsm
2. live migrate the vm -> stop vdsm on hsm
3. start vdsm after failure of migration 
4. shut down the vm and delete the snapshot
5. run the vm on the hsm
  
Actual results:

vm fails to run with Meta Data key not found error on prepareVolume. 

Expected results:

vm should run. 

Additional info: logs

Thread-103::DEBUG::2013-05-08 18:10:39,811::misc::344::Storage.Misc::(validateDDBytes) err: ['1+0 records in', '1+0 records out', '512 bytes (512 B) copied, 0.000438394 s, 1.2 MB/s'], size: 512
Thread-103::ERROR::2013-05-08 18:10:39,811::task::850::TaskManager.Task::(_setError) Task=`18266d6b-4843-464a-a6db-e03bc4991c44`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3135, in prepareImage
    imgVolumes = img.prepare(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 364, in prepare
    chain = self.getChain(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 336, in getChain
    while not srcVol.isShared():
  File "/usr/share/vdsm/storage/volume.py", line 662, in isShared
    return self.getVolType() == type2name(SHARED_VOL)
  File "/usr/share/vdsm/storage/volume.py", line 623, in getVolType
    self.voltype = self.getMetaParam(VOLTYPE)
  File "/usr/share/vdsm/storage/volume.py", line 867, in getMetaParam
    raise se.MetaDataKeyNotFoundError(str(meta) + ":" + str(key))
MetaDataKeyNotFoundError: Meta Data key not found error: ("{'NONE': '#######################################################################################################################################################################
############################################################################################################################################################################################################################################
###################################################################################################'}:VOLTYPE",)


hread-103::ERROR::2013-05-08 18:10:39,816::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Meta Data key not found error: ("{\'NONE\': \'########################################################################
############################################################################################################################################################################################################################################
##################################################################################################################################################################################################\'}:VOLTYPE",)', 'code': 751}}
Thread-103::DEBUG::2013-05-08 18:10:39,816::vm::676::vm.Vm::(_startUnderlyingVm) vmId=`af5bcf57-751c-4d7f-9ded-4ebac4d48be4`::_ongoingCreations released
Thread-103::ERROR::2013-05-08 18:10:39,817::vm::700::vm.Vm::(_startUnderlyingVm) vmId=`af5bcf57-751c-4d7f-9ded-4ebac4d48be4`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 662, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1447, in _run
    self.preparePaths(devices[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 721, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 275, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'address': {'bus': '0', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'index': 0, 'iface': 'ide', 'apparentsize': '1073741824', 'specParams': {}, 'imageID': '350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a', 'readonly': 'False', 'shared': 'false', 'truesize': '1073741824', 'type': 'disk', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'reqsize': '0', 'format': 'cow', 'deviceId': '350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a', 'poolID': '7fd33b43-a9f4-4eb7-a885-e9583a929ceb', 'device': 'disk', 'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a/e0f5ad87-638d-4038-865f-170a0f4c1392', 'propagateErrors': 'off', 'optional': 'false', 'name': 'hda', 'bootOrder': '1', 'volumeID': 'e0f5ad87-638d-4038-865f-170a0f4c1392', 'alias': 'ide0-0-0', 'volumeChain': [{'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a/e886fb6e-f91f-46ae-a488-0f24e47f5731', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'volumeID': 'e886fb6e-f91f-46ae-a488-0f24e47f5731', 'imageID': '350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a'}, {'path': '/rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/d326916a-89b6-41c8-9c14-4b9cf8a1c979/images/350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a/e0f5ad87-638d-4038-865f-170a0f4c1392', 'domainID': 'd326916a-89b6-41c8-9c14-4b9cf8a1c979', 'volumeID': 'e0f5ad87-638d-4038-865f-170a0f4c1392', 'imageID': '350b1e5a-4a61-44b1-9ad0-859d0e0dbf4a'}]}

Comment 1 Ayal Baron 2013-05-08 17:48:14 UTC
> Steps to Reproduce:
> 1. create and run a Desktop type vm from template (i.e thin provision copy),
> on iscsi storage (template must be copied to all domains) - make sure that
> the vm is run on hsm
2. live migrate the disk
3. stop vdsm on the host running the VM during the migration
4. start vdsm after failure of migration 
5. shut down the vm
6. delete the autogenerated snapshot
7. run the vm on the hsm

Comment 2 Federico Simoncelli 2013-05-09 11:07:04 UTC
I have the feeling that this might be a problem in the lvm cache.
I saw the same error reported also at 16:40 too:

Thread-148::DEBUG::2013-05-08 16:40:02,623::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct skip=9 bs=512 if=/dev/d326916a-89b6-41c8-9c14-4b9cf8a1c979/metadata count=1' (cwd None)
Thread-148::DEBUG::2013-05-08 16:40:02,635::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '1+0 records in\n1+0 records out\n512 bytes (512 B) copied, 0.000525346 s, 975 kB/s\n'; <rc> = 0
Thread-148::DEBUG::2013-05-08 16:40:02,635::misc::344::Storage.Misc::(validateDDBytes) err: ['1+0 records in', '1+0 records out', '512 bytes (512 B) copied, 0.000525346 s, 975 kB/s'], size: 512
Thread-148::DEBUG::2013-05-08 16:40:02,636::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct skip=8 bs=512 if=/dev/d326916a-89b6-41c8-9c14-4b9cf8a1c979/metadata count=1' (cwd None)
Thread-148::DEBUG::2013-05-08 16:40:02,645::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '1+0 records in\n1+0 records out\n512 bytes (512 B) copied, 0.000524717 s, 976 kB/s\n'; <rc> = 0
Thread-148::DEBUG::2013-05-08 16:40:02,645::misc::344::Storage.Misc::(validateDDBytes) err: ['1+0 records in', '1+0 records out', '512 bytes (512 B) copied, 0.000524717 s, 976 kB/s'], size: 512
Thread-148::ERROR::2013-05-08 16:40:02,645::task::850::TaskManager.Task::(_setError) Task=`ef0f027c-2602-4aac-ae67-8864b048a96e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3135, in prepareImage
    imgVolumes = img.prepare(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 364, in prepare
    chain = self.getChain(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 336, in getChain
    while not srcVol.isShared():
  File "/usr/share/vdsm/storage/volume.py", line 662, in isShared
    return self.getVolType() == type2name(SHARED_VOL)
  File "/usr/share/vdsm/storage/volume.py", line 623, in getVolType
    self.voltype = self.getMetaParam(VOLTYPE)
  File "/usr/share/vdsm/storage/volume.py", line 867, in getMetaParam
    raise se.MetaDataKeyNotFoundError(str(meta) + ":" + str(key))
MetaDataKeyNotFoundError: Meta Data key not found error: ("{'NONE': '######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################'}:VOLTYPE",)

Analyzing it I discovered that the history of the image is:

e886fb6e-f91f-46ae-a488-0f24e47f5731(MD:4?) <- 2d8d1def-27f7-43a9-9850-68fc7bb3f9f9(MD:7?)
e886fb6e-f91f-46ae-a488-0f24e47f5731(MD:4?) <- 2d8d1def-27f7-43a9-9850-68fc7bb3f9f9(MD:7?) <- 854ecd26-1fc8-4901-9313-fcb65897c5c6(MD:8)
e886fb6e-f91f-46ae-a488-0f24e47f5731(MD:4?) <- 2d8d1def-27f7-43a9-9850-68fc7bb3f9f9(MD:7?) <- 854ecd26-1fc8-4901-9313-fcb65897c5c6(MD:8) <- 72c6c45b-92ab-4cf0-b92a-93ac33c6717c(MD:9)

Then 854ecd26-1fc8-4901-9313-fcb65897c5c6 and 72c6c45b-92ab-4cf0-b92a-93ac33c6717c got merged:

Thread-15955::INFO::2013-05-08 16:36:11,991::logUtils::40::dispatcher::(wrapper) Run and protect: mergeSnapshots(sdUUID='d326916a-89b6-41c8-9c14-4b9cf8a1c979', spUUID='7fd33b43-a9f4-4eb7-a885-e9583a929ceb', vmUUID='', imgUUID='ba4c27ad-f297-4202-b28c-9b5b2eb038ca', ancestor='854ecd26-1fc8-4901-9313-fcb65897c5c6', successor='72c6c45b-92ab-4cf0-b92a-93ac33c6717c', postZero='false')

e886fb6e-f91f-46ae-a488-0f24e47f5731(MD:4?) <- 2d8d1def-27f7-43a9-9850-68fc7bb3f9f9(MD:7?) <- 72c6c45b-92ab-4cf0-b92a-93ac33c6717c(MD:9)

Although when prepareImage is called on the HSM we get an error trying to read the metadata slot 8 (old 854ecd26-1fc8-4901-9313-fcb65897c5c6 that got merged):

Thread-148::DEBUG::2013-05-08 16:40:02,636::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct skip=8 bs=512 if=/dev/d326916a-89b6-41c8-9c14-4b9cf8a1c979/metadata count=1' (cwd None)
...
Thread-148::ERROR::2013-05-08 16:40:02,645::task::850::TaskManager.Task::(_setError) Task=`ef0f027c-2602-4aac-ae67-8864b048a96e`::Unexpected error
...

This doesn't seem related to live storage migration.

Comment 3 Dafna Ron 2013-05-09 11:48:16 UTC
the error appears in the log more than once because after I encountered the error I debugged and reproduced it twice.
the reproduction is part of the LSM flow (failure+merge) but I agree with Fede. it was the exact thought I had when I opened the bug since the vm was re-run on the spm and failed to run on the hsm which suggests a cache issue.

Comment 5 Ayal Baron 2013-07-08 21:19:59 UTC
The issue is that we merge on the SPM and then try to run the VM on the hsm without refreshing it's knowledge of the changes (so indeed lvm cache is out of synch)
anyway, this is being taken care of as part of bug 960952

Comment 6 Leonid Natapov 2013-11-19 09:56:28 UTC
vdsm-4.13.0-0.8.beta1.el6ev.x86_64. fixed.

Comment 7 Charlie 2013-11-28 00:34:19 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 8 errata-xmlrpc 2014-01-21 16:07:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.