1697496 – "attach_volume error=Managed Volume is already attached." when migrating VM with Managed Block Storage (Ceph RBD)

Bug 1697496 - "attach_volume error=Managed Volume is already attached." when migrating VM with Managed Block Storage (Ceph RBD)

Summary: "attach_volume error=Managed Volume is already attached." when migrating VM w...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.3.5
Target Release:	4.3.5.2
Assignee:	Benny Zlotnik
QA Contact:	Shir Fishbain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-08 14:07 UTC by matthias.leopold
Modified:	2019-07-30 14:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:	ovirt-engine-4.3.5.2
Clone Of:
Environment:
Last Closed:	2019-07-30 14:08:26 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.3+

Attachments	(Terms of Use)
engine, vdsm, supervdsm logs (1.03 MB, application/x-tar) 2019-04-08 14:07 UTC, matthias.leopold	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	99436	0	'None'	MERGED	core: detach MBS disks upon live migration failure	2021-01-29 12:59:49 UTC
oVirt gerrit	100910	0	'None'	MERGED	core: detach MBS disks upon live migration failure	2021-01-29 12:59:06 UTC

Description matthias.leopold 2019-04-08 14:07:54 UTC

Created attachment 1553646 [details]
engine, vdsm, supervdsm logs

Description of problem:
Migration of VM with Managed Block Storage disk (Ceph RBD, non OS disk) fails with "EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)" in engine.log.
Reason seems to be "error=Managed Volume is already attached." in vdsm.log of receiving host.

Version-Release number of selected component (if applicable):
vdsm-4.30.11-1.el7.x86_64

How reproducible:
Start migration of VM with Managed Block Storage disk in oVirt 4.3.2

Steps to Reproduce:
1. set up oVirt 4.3.2 with ManagedBlockDomainSupported=true
2. install openstack-cinder + cinderlib on engine host
3. install python2-os-brick on hypervisor hosts
4. create "Managed Block Storage" domain
5. create VM with OS disk on iSCSI storage and secondary disk on Managed Block Storage
6. start VM
6. try to migrate the VM

Actual results:
Migration fails

Expected results:
Migration succeeds

Additional info:
CentOS 7
cinderlib (0.3.9)
openstack-cinder-13.0.4-1.el7.noarch
ceph 12.2.11

Comment 1 Benny Zlotnik 2019-04-08 14:27:40 UTC

Thanks for the report!

We already fixed some of the error handling to provide a clearer error, instead of NPEs


If you need to workaround this, you can manually detach the volume by running the following on the relevant host:

$ vdsm-client ManagedVolume detach_volume vol_id=<vol_id>


I couldn't find how the 2f053070-f5b7-4f04-856c-87a56d70cd75 volume was already attached to target ov-test-04-01, was it attached previously?

Comment 2 matthias.leopold 2019-04-08 14:35:37 UTC

Yes, the log starts when VM is already running. I didn't think of that (just used timestamp 15:** because it was handy at that time). Do you need logs for the whole process? I can only provide them tomorrow.

Comment 3 Benny Zlotnik 2019-04-08 14:42:07 UTC

(In reply to matthias.leopold from comment #2)
> Yes, the log starts when VM is already running. I didn't think of that (just
> used timestamp 15:** because it was handy at that time). Do you need logs
> for the whole process? I can only provide them tomorrow.

yes, it will be useful

Comment 4 matthias.leopold 2019-04-09 12:16:06 UTC

I turns out that the original migration error was 

libvirtError: Unsafe migration: Migration may lead to data corruption if disks use cache != none or cache != directsync

This stems from our using "viodiskcache=writeback" custom VM property. If I understand this correctly this isn't needed anymore with kernel rbd devices as used with cinderlib.

The error "attach_volume error=Managed Volume is already attached." is a follow up after the first failed migration, when there is a leftover rbdmapped device.
The problem is resolved, this ticket can be closed.

thank you
Matthias

Comment 5 Benny Zlotnik 2019-04-10 08:30:51 UTC

Thanks Matthias!

Feel free to report any issue you encounter

Comment 6 Nir Soffer 2019-04-12 15:31:08 UTC

Benny, why did we have leftover managed volume after migration failure?

Smells like a bug in engine cleanup after migration.

Comment 7 Benny Zlotnik 2019-04-15 12:24:16 UTC

(In reply to Nir Soffer from comment #6)
> Benny, why did we have leftover managed volume after migration failure?
> 
> Smells like a bug in engine cleanup after migration.
we have a bug for this issue

Comment 8 Benny Zlotnik 2019-05-05 11:21:34 UTC

Reopening, I must have confused this with another bug since I can't find it

Comment 9 RHEL Program Management 2019-06-18 11:10:21 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 10 Shir Fishbain 2019-07-03 16:32:04 UTC

I can't verify this bug because it's impossible to start VM with Ceph driver, I opened a new bug in this issue :
https://bugzilla.redhat.com/show_bug.cgi?id=1726758

Comment 11 Benny Zlotnik 2019-07-04 08:39:21 UTC

As Freddy stated in comment #2, it's a host configuration issue
I posted a fix for the error handling, but it should not block the verification of this bug

Comment 12 Shir Fishbain 2019-07-04 12:06:34 UTC

Verified - Migration succeeds

The storage domain was missing the rbd_ceph_conf property. Benny fixed it in the ceph.conf file of QE env.

ovirt-engine-4.3.5.2-0.1.el7.noarch
vdsm-4.30.22-1.el7ev.x86_64
Cinderlib version : 0.9.0

Comment 13 Sandro Bonazzola 2019-07-30 14:08:26 UTC

This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.