Bug 1000816 - [vdsm] lvremove fails when vdsm tries to delete leftover volumes from dst SD after a failure in LSM during syncImageData
Summary: [vdsm] lvremove fails when vdsm tries to delete leftover volumes from dst SD ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Federico Simoncelli
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-25 15:36 UTC by Elad
Modified: 2016-02-10 20:24 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-21 16:13:44 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (6.15 MB, application/x-gzip)
2013-08-25 15:36 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC

Description Elad 2013-08-25 15:36:35 UTC
Created attachment 790099 [details]
logs

Description of problem:
After vdsm crashed during syncImageData phase of live storage migration, vdsm is trying to remove the volume from the destination SD. vdsm fails to do so due to  a failure in lvremove. it happens because the LV remains open. 

Version-Release number of selected component (if applicable):
vdsm-4.12.0-72.git287bb7e.el6ev.x86_64
sanlock-2.8-1.el6.x86_64
libvirt-0.10.2-18.el6_4.9.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
rhevm-3.3.0-0.16.master.el6ev.noarch

How reproducible:
unknown

Steps to Reproduce:
On a block pool with more than 1 SD:
- create and run a vm 
- live migrate vm's disk
- restart vdsm during syncImageData task on spm
- after vdsm comes up, it tries to perform roll-back and remove the leftover volumes from the destination SD

Actual results:
vdsm fails in lvremove:

Thread-138::ERROR::2013-08-21 15:30:51,464::task::850::TaskManager.Task::(_setError) Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1546, in deleteImage
    dom.deleteImage(sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/blockSD.py", line 978, in deleteImage
    deleteVolumes(sdUUID, toDel)
  File "/usr/share/vdsm/storage/blockSD.py", line 189, in deleteVolumes
    lvm.removeLVs(sdUUID, vols)
  File "/usr/share/vdsm/storage/lvm.py", line 1073, in removeLVs
    raise se.CannotRemoveLogicalVolume(vgName, str(lvNames))
CannotRemoveLogicalVolume: Cannot remove Logical Volume: ('4700d87c-0616-4051-880e-6f3d15357bc4', "{'114f06ed-8288-4955-955f-d4708b55c885': ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'], parent='a3befb04-1a36-4f16-a9da-3a1d691b6833'), 'a3befb04-1a36-4f16-a9da-3a1d691b6833': ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'], parent='00000000-0000-0000-0000-000000000000'), '7ca6a8c5-16d1-4ea9-b184-4a0003865929': ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'], parent='114f06ed-8288-4955-955f-d4708b55c885')}")


Expected results:
vdsm should close the open LV after the failure in LSM 


Additional info:
logs

Comment 1 Federico Simoncelli 2013-08-26 14:06:54 UTC
(In reply to Elad from comment #0)
> Actual results:
> vdsm fails in lvremove:
> 
> Thread-138::ERROR::2013-08-21
> 15:30:51,464::task::850::TaskManager.Task::(_setError)
> Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 1546, in deleteImage
>     dom.deleteImage(sdUUID, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/blockSD.py", line 978, in deleteImage
>     deleteVolumes(sdUUID, toDel)
>   File "/usr/share/vdsm/storage/blockSD.py", line 189, in deleteVolumes
>     lvm.removeLVs(sdUUID, vols)
>   File "/usr/share/vdsm/storage/lvm.py", line 1073, in removeLVs
>     raise se.CannotRemoveLogicalVolume(vgName, str(lvNames))
> CannotRemoveLogicalVolume: Cannot remove Logical Volume:
> ('4700d87c-0616-4051-880e-6f3d15357bc4',
> "{'114f06ed-8288-4955-955f-d4708b55c885':
> ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> parent='a3befb04-1a36-4f16-a9da-3a1d691b6833'),
> 'a3befb04-1a36-4f16-a9da-3a1d691b6833':
> ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> parent='00000000-0000-0000-0000-000000000000'),
> '7ca6a8c5-16d1-4ea9-b184-4a0003865929':
> ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> parent='114f06ed-8288-4955-955f-d4708b55c885')}")
> 
> 
> Expected results:
> vdsm should close the open LV after the failure in LSM 

deleteImage is not the rollback for syncImageData, in fact the request comes from the engine:

Thread-138::DEBUG::2013-08-21 15:30:41,500::BindingXMLRPC::177::vds::(wrapper) client [10.35.161.52] flowID [72a210fa]
Thread-138::DEBUG::2013-08-21 15:30:41,501::task::579::TaskManager.Task::(_updateState) Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::moving from state init -> state preparing
Thread-138::INFO::2013-08-21 15:30:41,501::logUtils::44::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='4700d87c-0616-4051-880e-6f3d15357bc4', spUUID='e639b114-5960-43e2-b85e-9c1444d3af63', imgUUID='f74dd0fe-150f-4037-8992-71d7ea0ac982', postZero='false', force='false')
...
Thread-138::ERROR::2013-08-21 15:30:51,464::task::850::TaskManager.Task::(_setError) Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1546, in deleteImage
    dom.deleteImage(sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/blockSD.py", line 978, in deleteImage
    deleteVolumes(sdUUID, toDel)
  File "/usr/share/vdsm/storage/blockSD.py", line 189, in deleteVolumes
    lvm.removeLVs(sdUUID, vols)
  File "/usr/share/vdsm/storage/lvm.py", line 1073, in removeLVs
    raise se.CannotRemoveLogicalVolume(vgName, str(lvNames))
CannotRemoveLogicalVolume: Cannot remove Logical Volume: ('4700d87c-0616-4051-880e-6f3d15357bc4', "{'114f06ed-8288-4955-955f-d4708b55c885': ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'], ...


I have the feeling that engine didn't send the vmReplicateDiskFinish command to the VM before trying to remove the new image.
Daniel can you take a look?

If the problem instead is related to a running dd process (syncData) then I think this might be dup of:

https://bugzilla.redhat.com/show_bug.cgi?id=713434

Anyway the fix would be the same:

http://gerrit.ovirt.org/#/c/18074/

Comment 2 Daniel Erez 2013-08-26 14:47:00 UTC
Looks like the engine log has been rotated out, I can find any reference to the LSM commands.

Comment 3 Daniel Erez 2013-08-26 15:43:12 UTC
(In reply to Federico Simoncelli from comment #1)
> (In reply to Elad from comment #0)
> > Actual results:
> > vdsm fails in lvremove:
> > 
> > Thread-138::ERROR::2013-08-21
> > 15:30:51,464::task::850::TaskManager.Task::(_setError)
> > Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::Unexpected error
> > Traceback (most recent call last):
> >   File "/usr/share/vdsm/storage/task.py", line 857, in _run
> >     return fn(*args, **kargs)
> >   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
> >     res = f(*args, **kwargs)
> >   File "/usr/share/vdsm/storage/hsm.py", line 1546, in deleteImage
> >     dom.deleteImage(sdUUID, imgUUID, volsByImg)
> >   File "/usr/share/vdsm/storage/blockSD.py", line 978, in deleteImage
> >     deleteVolumes(sdUUID, toDel)
> >   File "/usr/share/vdsm/storage/blockSD.py", line 189, in deleteVolumes
> >     lvm.removeLVs(sdUUID, vols)
> >   File "/usr/share/vdsm/storage/lvm.py", line 1073, in removeLVs
> >     raise se.CannotRemoveLogicalVolume(vgName, str(lvNames))
> > CannotRemoveLogicalVolume: Cannot remove Logical Volume:
> > ('4700d87c-0616-4051-880e-6f3d15357bc4',
> > "{'114f06ed-8288-4955-955f-d4708b55c885':
> > ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> > parent='a3befb04-1a36-4f16-a9da-3a1d691b6833'),
> > 'a3befb04-1a36-4f16-a9da-3a1d691b6833':
> > ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> > parent='00000000-0000-0000-0000-000000000000'),
> > '7ca6a8c5-16d1-4ea9-b184-4a0003865929':
> > ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'],
> > parent='114f06ed-8288-4955-955f-d4708b55c885')}")
> > 
> > 
> > Expected results:
> > vdsm should close the open LV after the failure in LSM 
> 
> deleteImage is not the rollback for syncImageData, in fact the request comes
> from the engine:
> 
> Thread-138::DEBUG::2013-08-21
> 15:30:41,500::BindingXMLRPC::177::vds::(wrapper) client [10.35.161.52]
> flowID [72a210fa]
> Thread-138::DEBUG::2013-08-21
> 15:30:41,501::task::579::TaskManager.Task::(_updateState)
> Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::moving from state init -> state
> preparing
> Thread-138::INFO::2013-08-21
> 15:30:41,501::logUtils::44::dispatcher::(wrapper) Run and protect:
> deleteImage(sdUUID='4700d87c-0616-4051-880e-6f3d15357bc4',
> spUUID='e639b114-5960-43e2-b85e-9c1444d3af63',
> imgUUID='f74dd0fe-150f-4037-8992-71d7ea0ac982', postZero='false',
> force='false')
> ...
> Thread-138::ERROR::2013-08-21
> 15:30:51,464::task::850::TaskManager.Task::(_setError)
> Task=`40fe8a7f-0e57-4bd3-a180-3543ec5bb79f`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 1546, in deleteImage
>     dom.deleteImage(sdUUID, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/blockSD.py", line 978, in deleteImage
>     deleteVolumes(sdUUID, toDel)
>   File "/usr/share/vdsm/storage/blockSD.py", line 189, in deleteVolumes
>     lvm.removeLVs(sdUUID, vols)
>   File "/usr/share/vdsm/storage/lvm.py", line 1073, in removeLVs
>     raise se.CannotRemoveLogicalVolume(vgName, str(lvNames))
> CannotRemoveLogicalVolume: Cannot remove Logical Volume:
> ('4700d87c-0616-4051-880e-6f3d15357bc4',
> "{'114f06ed-8288-4955-955f-d4708b55c885':
> ImgsPar(imgs=['f74dd0fe-150f-4037-8992-71d7ea0ac982'], ...
> 
> 
> I have the feeling that engine didn't send the vmReplicateDiskFinish command
> to the VM before trying to remove the new image.
> Daniel can you take a look?

According to the engine log VmReplicateDiskFinish has been send:

2013-08-21 15:30:42,085 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-47) [72a210fa] FINISH, VmReplicateDiskFinishVDSCommand, log id: 645a2707


> 
> If the problem instead is related to a running dd process (syncData) then I
> think this might be dup of:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=713434
> 
> Anyway the fix would be the same:
> 
> http://gerrit.ovirt.org/#/c/18074/

Comment 4 Federico Simoncelli 2013-09-05 09:10:10 UTC
Please try to reproduce when:

http://gerrit.ovirt.org/#/c/18074/

has reached the build you're testing.

Comment 5 Federico Simoncelli 2013-09-06 07:35:12 UTC
Let's keep in mind that this could be a duplicate of bug 961400

Comment 6 Elad 2013-09-23 10:49:12 UTC
Removing of volumes from dst domain ends successfully after failure in live storage migration during syncImageData.

Checked on is15:
vdsm-4.12.0-138.gitab256be.el6ev.x86_64
rhevm-3.3.0-0.21.master.el6ev.noarch

Comment 7 Charlie 2013-11-28 00:32:53 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 9 errata-xmlrpc 2014-01-21 16:13:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.