Bug 1096529 - Live disk migration fails [NEEDINFO]
Summary: Live disk migration fails
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.4
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: 3.5.1
Assignee: Maor
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: 1193195
TreeView+ depends on / blocked
 
Reported: 2014-05-11 18:22 UTC by Maurice James
Modified: 2016-02-10 18:39 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-16 07:13:12 UTC
oVirt Team: Storage
mlipchuk: needinfo? (midnightsteel)


Attachments (Terms of Use)
VDSM Log (69.03 KB, application/gzip)
2014-05-11 18:23 UTC, Maurice James
no flags Details
destination vdsm log (181.80 KB, application/gzip)
2014-05-11 18:24 UTC, Maurice James
no flags Details
EngineSPM (636.58 KB, application/gzip)
2014-05-14 23:40 UTC, Maurice James
no flags Details
EngineSPM (358.86 KB, application/gzip)
2014-05-17 17:56 UTC, Maurice James
no flags Details
Engine VDSM Log (930.63 KB, application/gzip)
2014-05-17 17:56 UTC, Maurice James
no flags Details
destination vdsm log (242.94 KB, application/gzip)
2014-05-17 17:57 UTC, Maurice James
no flags Details
SourceVDSM (409.10 KB, application/gzip)
2014-05-18 16:05 UTC, Maurice James
no flags Details
New logs (2.63 MB, application/octet-stream)
2014-05-28 12:35 UTC, Maurice James
no flags Details

Description Maurice James 2014-05-11 18:22:55 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.Attempt to move the disk of a running vm to a different storage domain

Actual results:
Automatic snapshot and disk move fails

Expected results:
Disk moved

Additional info:

Comment 1 Maurice James 2014-05-11 18:23:31 UTC
Created attachment 894450 [details]
VDSM Log

Comment 2 Maurice James 2014-05-11 18:24:04 UTC
Created attachment 894451 [details]
destination vdsm log

Comment 3 Maor 2014-05-11 19:05:04 UTC
Maurice can u please also add the relevant engine log.
Also the output of "ls -l" at /rhev/data-center/e0e65e47-52c8-41bd-8499-a3e025831215/21484146-1a6c-4a31-896e-da1156888dfc/images
in the SPM host.
and the output of tree command under /rhev/data-center at the SPM Host.

Comment 4 Maor 2014-05-11 19:46:06 UTC
It seems that the VM did not contain the device with 'domainID':
'e0e65e47-52c8-41bd-8499-a3e025831215', 'volumeID':
'deae7162-1eb7-423e-9115-3e7de542c89c', 'imageID':
'21484146-1a6c-4a31-896e-da1156888dfc' (at def _findDriveByUUIDs(self, drive) in vdsm/vm.py)

Comment 5 Allon Mureinik 2014-05-14 13:48:18 UTC
(In reply to Maor from comment #4)
> It seems that the VM did not contain the device with 'domainID':
> 'e0e65e47-52c8-41bd-8499-a3e025831215', 'volumeID':
> 'deae7162-1eb7-423e-9115-3e7de542c89c', 'imageID':
> '21484146-1a6c-4a31-896e-da1156888dfc' (at def _findDriveByUUIDs(self,
> drive) in vdsm/vm.py)

So where did this operation come from?

Comment 6 Maurice James 2014-05-14 23:40:39 UTC
Created attachment 895663 [details]
EngineSPM

Comment 7 Maurice James 2014-05-15 00:39:43 UTC
This looks related https://bugzilla.redhat.com/show_bug.cgi?id=1009100

Comment 8 Maor 2014-05-15 08:20:39 UTC
(In reply to Allon Mureinik from comment #5)
> (In reply to Maor from comment #4)
> > It seems that the VM did not contain the device with 'domainID':
> > 'e0e65e47-52c8-41bd-8499-a3e025831215', 'volumeID':
> > 'deae7162-1eb7-423e-9115-3e7de542c89c', 'imageID':
> > '21484146-1a6c-4a31-896e-da1156888dfc' (at def _findDriveByUUIDs(self,
> > drive) in vdsm/vm.py)
> 
> So where did this operation come from?
The operation came from live snapshot

Comment 9 Maor 2014-05-15 08:40:38 UTC
Maurice, you are right, this does look much similar to https://bugzilla.redhat.com/show_bug.cgi?id=1009100

Just to be sure, can you attach older VDSM logs.
It seems that they are starting from 2014-05-10 but the engine error was from 2014-05-09.

Comment 10 Maurice James 2014-05-17 17:56:15 UTC
Created attachment 896629 [details]
EngineSPM

Comment 11 Maurice James 2014-05-17 17:56:51 UTC
Created attachment 896630 [details]
Engine VDSM Log

Comment 12 Maurice James 2014-05-17 17:57:16 UTC
Created attachment 896631 [details]
destination vdsm log

Comment 13 Maurice James 2014-05-17 17:57:49 UTC
I attached a fresh set of logs from the source and destination

Comment 14 Maor 2014-05-18 09:59:55 UTC
(In reply to Maurice James from comment #13)
> I attached a fresh set of logs from the source and destination

Maurice from the engine logs it look that the VM was running on Host Titan:
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48, mMessage=Snapshot failed]]
2014-05-17 13:50:52,700 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (org.ovirt.thread.pool-6-thread-45) HostName = Titan
2014-05-17 13:50:52,700 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName = Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf, vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = Snapshot failed, code = 48

I don't see any log of Titan from the attached files.

Comment 15 Maurice James 2014-05-18 16:05:27 UTC
Created attachment 896807 [details]
SourceVDSM

Comment 16 Maurice James 2014-05-18 16:07:00 UTC
The vm was running on Titan but the disk is on beetlejuice. VMs move fine, its the disk I'm having issues with



(In reply to Maor from comment #14)
> (In reply to Maurice James from comment #13)
> > I attached a fresh set of logs from the source and destination
> 
> Maurice from the engine logs it look that the VM was running on Host Titan:
>  StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48,
> mMessage=Snapshot failed]]
> 2014-05-17 13:50:52,700 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> (org.ovirt.thread.pool-6-thread-45) HostName = Titan
> 2014-05-17 13:50:52,700 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName =
> Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf,
> vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception:
> VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> SnapshotVDS, error = Snapshot failed, code = 48
> 
> I don't see any log of Titan from the attached files.

Comment 17 Maor 2014-05-19 03:16:49 UTC
As part of the disk live storage migration we create a live snapshot for the VM.
The live snapshot operation is being done on the HSM that the VM is running on.
The error that I see in the logs is of the snapshot command being executed for the VM, so we need to see the error of this operation in VDSM.

(In reply to Maurice James from comment #16)
> The vm was running on Titan but the disk is on beetlejuice. VMs move fine,
> its the disk I'm having issues with
> 
> 
> 
> (In reply to Maor from comment #14)
> > (In reply to Maurice James from comment #13)
> > > I attached a fresh set of logs from the source and destination
> > 
> > Maurice from the engine logs it look that the VM was running on Host Titan:
> >  StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48,
> > mMessage=Snapshot failed]]
> > 2014-05-17 13:50:52,700 INFO 
> > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > (org.ovirt.thread.pool-6-thread-45) HostName = Titan
> > 2014-05-17 13:50:52,700 ERROR
> > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName =
> > Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf,
> > vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception:
> > VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> > SnapshotVDS, error = Snapshot failed, code = 48
> > 
> > I don't see any log of Titan from the attached files.

Comment 18 Maurice James 2014-05-19 14:48:52 UTC
I uploaded the logs from all 3 servers. Do you need anything else?



(In reply to Maor from comment #17)
> As part of the disk live storage migration we create a live snapshot for the
> VM.
> The live snapshot operation is being done on the HSM that the VM is running
> on.
> The error that I see in the logs is of the snapshot command being executed
> for the VM, so we need to see the error of this operation in VDSM.
> 
> (In reply to Maurice James from comment #16)
> > The vm was running on Titan but the disk is on beetlejuice. VMs move fine,
> > its the disk I'm having issues with
> > 
> > 
> > 
> > (In reply to Maor from comment #14)
> > > (In reply to Maurice James from comment #13)
> > > > I attached a fresh set of logs from the source and destination
> > > 
> > > Maurice from the engine logs it look that the VM was running on Host Titan:
> > >  StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48,
> > > mMessage=Snapshot failed]]
> > > 2014-05-17 13:50:52,700 INFO 
> > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > (org.ovirt.thread.pool-6-thread-45) HostName = Titan
> > > 2014-05-17 13:50:52,700 ERROR
> > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName =
> > > Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf,
> > > vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception:
> > > VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> > > SnapshotVDS, error = Snapshot failed, code = 48
> > > 
> > > I don't see any log of Titan from the attached files.

Comment 19 Maor 2014-05-27 11:57:57 UTC
Hi Maurice, I don't see the uploaded logs in the bug

(In reply to Maurice James from comment #18)
> I uploaded the logs from all 3 servers. Do you need anything else?
> 
> 
> 
> (In reply to Maor from comment #17)
> > As part of the disk live storage migration we create a live snapshot for the
> > VM.
> > The live snapshot operation is being done on the HSM that the VM is running
> > on.
> > The error that I see in the logs is of the snapshot command being executed
> > for the VM, so we need to see the error of this operation in VDSM.
> > 
> > (In reply to Maurice James from comment #16)
> > > The vm was running on Titan but the disk is on beetlejuice. VMs move fine,
> > > its the disk I'm having issues with
> > > 
> > > 
> > > 
> > > (In reply to Maor from comment #14)
> > > > (In reply to Maurice James from comment #13)
> > > > > I attached a fresh set of logs from the source and destination
> > > > 
> > > > Maurice from the engine logs it look that the VM was running on Host Titan:
> > > >  StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48,
> > > > mMessage=Snapshot failed]]
> > > > 2014-05-17 13:50:52,700 INFO 
> > > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > > (org.ovirt.thread.pool-6-thread-45) HostName = Titan
> > > > 2014-05-17 13:50:52,700 ERROR
> > > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > > (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName =
> > > > Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf,
> > > > vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception:
> > > > VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> > > > SnapshotVDS, error = Snapshot failed, code = 48
> > > > 
> > > > I don't see any log of Titan from the attached files.

Comment 20 Maurice James 2014-05-27 15:10:35 UTC
(In reply to Maor from comment #19)
> Hi Maurice, I don't see the uploaded logs in the bug
> 
> (In reply to Maurice James from comment #18)
> > I uploaded the logs from all 3 servers. Do you need anything else?
> > 
> > 
> > 
> > (In reply to Maor from comment #17)
> > > As part of the disk live storage migration we create a live snapshot for the
> > > VM.
> > > The live snapshot operation is being done on the HSM that the VM is running
> > > on.
> > > The error that I see in the logs is of the snapshot command being executed
> > > for the VM, so we need to see the error of this operation in VDSM.
> > > 
> > > (In reply to Maurice James from comment #16)
> > > > The vm was running on Titan but the disk is on beetlejuice. VMs move fine,
> > > > its the disk I'm having issues with
They are listed as


EngineSPM (358.86 KB, application/gzip)
2014-05-17 13:56 EDT, Maurice James 	no flags 	Details
Engine VDSM Log (930.63 KB, application/gzip)
2014-05-17 13:56 EDT, Maurice James 	no flags 	Details
destination vdsm log (242.94 KB, application/gzip)
2014-05-17 13:57 EDT, Maurice James 	no flags 	Details
SourceVDSM (409.10 KB, application/gzip)
2014-05-18 12:05 EDT, Maurice James 	no flags 	Details


They were uploaded on the 17th and 18th of MAY


> > > > 


> > > > 
> > > > 
> > > > (In reply to Maor from comment #14)
> > > > > (In reply to Maurice James from comment #13)
> > > > > > I attached a fresh set of logs from the source and destination
> > > > > 
> > > > > Maurice from the engine logs it look that the VM was running on Host Titan:
> > > > >  StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=48,
> > > > > mMessage=Snapshot failed]]
> > > > > 2014-05-17 13:50:52,700 INFO 
> > > > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > > > (org.ovirt.thread.pool-6-thread-45) HostName = Titan
> > > > > 2014-05-17 13:50:52,700 ERROR
> > > > > [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
> > > > > (org.ovirt.thread.pool-6-thread-45) Command SnapshotVDSCommand(HostName =
> > > > > Titan, HostId = 5869805e-5b95-485a-bd8a-07b472d3fcaf,
> > > > > vmId=7f341f92-134a-47e7-b7ed-e7df772806f3) execution failed. Exception:
> > > > > VDSErrorException: VDSGenericException: VDSErrorException: Failed to
> > > > > SnapshotVDS, error = Snapshot failed, code = 48
> > > > > 
> > > > > I don't see any log of Titan from the attached files.

Comment 21 Maor 2014-05-27 20:40:08 UTC
Hi Francesco, can you please take a look.
Do you think this is the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1009100

Thanks,
Maor

Comment 22 Maurice James 2014-05-28 12:15:45 UTC
(In reply to Maor from comment #21)
> Hi Francesco, can you please take a look.
> Do you think this is the same issue as
> https://bugzilla.redhat.com/show_bug.cgi?id=1009100
> 
> Thanks,
> Maor

In your test 3.4.1 environment, live migration works fine?

Comment 23 Maurice James 2014-05-28 12:35:11 UTC
Created attachment 899965 [details]
New logs

Here are a fresh set of logs from a different system that is having the same problem

Comment 24 Allon Mureinik 2014-05-28 16:26:51 UTC
(In reply to Maurice James from comment #23)
> Created attachment 899965 [details]
> New logs
> 
> Here are a fresh set of logs from a different system that is having the same
> problem
Maor/Francesco, please take a look at this.

Comment 25 Francesco Romani 2014-05-29 07:17:33 UTC
I had a look at the logs provided in https://bugzilla.redhat.com/show_bug.cgi?id=1096529#c23 and this definitely doesn't seem the same case as per bz1009100

ashtivh04_vdsm.log:Thread-582288::ERROR::2014-05-28 08:20:29,333::vm::3915::vm.Vm::(snapshot) vmId=`508f2275-50d3-4fb2-a8e6-06e50c87d0d1`::The base volume doesn't exist: {'device': 'disk', 'domainID': 'b7663d70-e658-41fa-b9f0-8da83c9eddce', 'volumeID': '9e298151-23f7-4e46-8bec-71d644967f96', 'imageID': 'babe7494-bce9-4695-b341-fae61715f9e6'}

At glance this looks a storage related issue.

Comment 26 Francesco Romani 2014-05-29 08:39:49 UTC
forgot to clear my NEEDINFO

Comment 27 Maurice James 2014-06-02 13:09:31 UTC
I found that all of the VMs that I created that were based on the out of the box "Blank" template would fail live disk migration. I created a new "Default" template and created a VM based on it and was able to live migrate the disk. I need to be able to change the template its based on because I already have close to 30 VMs created based on the old "Blank" template. Is this possible?

Comment 28 Allon Mureinik 2014-06-02 15:38:29 UTC
(In reply to Maurice James from comment #27)
> I found that all of the VMs that I created that were based on the out of the
> box "Blank" template would fail live disk migration. I created a new
> "Default" template and created a VM based on it and was able to live migrate
> the disk. I need to be able to change the template its based on because I
> already have close to 30 VMs created based on the old "Blank" template. Is
> this possible?
AFAIK, this is not possible.

Dainel/Maor - lets look into why migrating a disk based on the default template fails?

Comment 29 Maurice James 2014-06-04 16:04:16 UTC
OK after repeating the stpes that I followed in my prior posts
1. Stop the VM
2. Export it
3. Delete it (the VM not the export)
4. Re-import the VM

After following those steps I was able to live migrate the disks without error.
I'm not sure why this fixed the problem,but what I can say is that there was a problem with the default "Blank" template. The problem started when i upgraded from 3.3.3 to 3.3.4. This carried over to the version 3.4.x

Comment 30 Maurice James 2014-06-05 11:50:59 UTC
Perhaps there was a problem with the upgrade script from 3.3.3 -> 3.3.4?

Comment 31 Maurice James 2014-06-05 23:58:50 UTC
I have one more VM left to export. Is there anything that I should compare to ones that Ive already fixed? Maybe we can get a root cause out of it

Comment 32 Sandro Bonazzola 2014-06-11 06:50:52 UTC
This is an automated message:
This bug has been re-targeted from 3.4.2 to 3.5.0 since neither priority nor severity were high or urgent. Please re-target to 3.4.3 if relevant.

Comment 33 Allon Mureinik 2014-08-19 14:12:56 UTC
Maor, Maurice, what's up with this BZ? Are we going anywhere with it?

Comment 34 Maor 2014-08-27 11:22:07 UTC
It could be that the problem is related to what Maurice described, regarding the  upgrade process which failed.

What was the origin of the problem in the upgrade phase from 3.3.3 to 3.3.4?

If you still have this VM which you have trouble to migrate its disks, can u please try to move the disk again and attach also the /var/log/messages and /var/log/libvirt/libvirt.log messages?

Does it reproduce on new VMs also?

Comment 35 Allon Mureinik 2014-09-11 12:12:36 UTC
This won't make oVirt 3.5.0, pushing out to 3.5.0.

Any news on the needinfo?

Comment 36 Maor 2014-09-16 07:13:12 UTC
I verified move disk with different scenarios,
I suspect this is an upgrade issue got wrong, but for now without the info we can't do much about it.

I'm closing the bug for now, please feel free to re-open it once it reproduces and have the right info.


Note You need to log in before you can comment on or make changes to this bug.