Created attachment 1553646 [details]
engine, vdsm, supervdsm logs
Description of problem:
Migration of VM with Managed Block Storage disk (Ceph RBD, non OS disk) fails with "EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)" in engine.log.
Reason seems to be "error=Managed Volume is already attached." in vdsm.log of receiving host.
Version-Release number of selected component (if applicable):
Start migration of VM with Managed Block Storage disk in oVirt 4.3.2
Steps to Reproduce:
1. set up oVirt 4.3.2 with ManagedBlockDomainSupported=true
2. install openstack-cinder + cinderlib on engine host
3. install python2-os-brick on hypervisor hosts
4. create "Managed Block Storage" domain
5. create VM with OS disk on iSCSI storage and secondary disk on Managed Block Storage
6. start VM
6. try to migrate the VM
Thanks for the report!
We already fixed some of the error handling to provide a clearer error, instead of NPEs
If you need to workaround this, you can manually detach the volume by running the following on the relevant host:
$ vdsm-client ManagedVolume detach_volume vol_id=<vol_id>
I couldn't find how the 2f053070-f5b7-4f04-856c-87a56d70cd75 volume was already attached to target ov-test-04-01, was it attached previously?
Yes, the log starts when VM is already running. I didn't think of that (just used timestamp 15:** because it was handy at that time). Do you need logs for the whole process? I can only provide them tomorrow.
(In reply to matthias.leopold from comment #2)
> Yes, the log starts when VM is already running. I didn't think of that (just
> used timestamp 15:** because it was handy at that time). Do you need logs
> for the whole process? I can only provide them tomorrow.
yes, it will be useful
I turns out that the original migration error was
libvirtError: Unsafe migration: Migration may lead to data corruption if disks use cache != none or cache != directsync
This stems from our using "viodiskcache=writeback" custom VM property. If I understand this correctly this isn't needed anymore with kernel rbd devices as used with cinderlib.
The error "attach_volume error=Managed Volume is already attached." is a follow up after the first failed migration, when there is a leftover rbdmapped device.
The problem is resolved, this ticket can be closed.
Feel free to report any issue you encounter
Benny, why did we have leftover managed volume after migration failure?
Smells like a bug in engine cleanup after migration.
(In reply to Nir Soffer from comment #6)
> Benny, why did we have leftover managed volume after migration failure?
> Smells like a bug in engine cleanup after migration.
we have a bug for this issue
Reopening, I must have confused this with another bug since I can't find it
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
I can't verify this bug because it's impossible to start VM with Ceph driver, I opened a new bug in this issue :
As Freddy stated in comment #2, it's a host configuration issue
I posted a fix for the error handling, but it should not block the verification of this bug
Verified - Migration succeeds
The storage domain was missing the rbd_ceph_conf property. Benny fixed it in the ceph.conf file of QE env.
Cinderlib version : 0.9.0
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.
Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.