Description of problem: VM doesn't start after storage migration Version-Release number of selected component (if applicable): 3.5.1 Hello, I performed a storage migration of a CentOS 6 VM and while the migration was succesfull (based on the output in the tasks tab on the bottom of the webadmin portal) the Disks on the VM's tab weren't refreshed with the new Storage Domain information and remained locked. I removed the lock by running "update images set imagestatus = 1 where imagestatus = 2;" on the database but the storage domain wasn't updated to the new one and kept the original location. Now I'm unable to start the VM probably because the disks have been migrated but the information on the database stills points on the old storage domain. The vdsm log throws the following error at VM startup: Thread-887::ERROR::2015-04-17 10:08:11,831::task::866::Storage.TaskManager.Task::(_setError) Task=`7b51e15b-b04f-4f17-8096-882beffc1a6f`::Unexpected error Thread-887::ERROR::2015-04-17 10:08:11,953::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Logical volume does not exist: ('d3ec9e08-1e83-449e-b09d-dd50a4f7102f/d0870749-36b0-4ab2-91a5-4743c690551a',)", 'code': 610}} Thread-887::ERROR::2015-04-17 10:08:11,975::vm::2331::vm.Vm::(_startUnderlyingVm) vmId=`1282c0ed-f753-4484-bb4e-82d41febbba4`::The vm start process failed Similarly the engine log throws the following error: 2015-04-17 09:39:42,174 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-94) [4c6a284] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM testvm is down with error. Exit message: ('Failed to get size for volume %s', 'd0870749-36b0-4ab2-91a5-4743c690551a'). 2015-04-17 09:39:42,175 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-94) [4c6a284] Running on vds during rerun failed vm: null 2015-04-17 09:39:42,177 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-94) [4c6a284] VM testvm (1282c0ed-f753-4484-bb4e-82d41febbba4) is running in db and not running in VDS node01 2015-04-17 09:39:42,178 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-94) [4c6a284] add VM testvm to HA rerun treatment 2015-04-17 09:39:42,199 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-94) [4c6a284] Rerun vm 1282c0ed-f753-4484-bb4e-82d41febbba4. Called from vds node01 How can I fix this? Do I have to edit the database? How can I find which is the correct logical volume associated with the VM? Thank you, Sokratis
Hi Sokratis, Can you please attach Engine logs and logs from the SPM host at the time of the operation starting before the storage migration?
Created attachment 1016721 [details] engine log during storage migration
I'm afraid I can only upload only the engine as I switched SPM hosts a few times to check if that fixes the issue.
Hi Sokratis, I couldn't find any errors in the engine logs, but it seems that the operation hasn't been fully completed. So indeed, the storage domain probably wasn't updated accordingly yet. It could have been just a refresh issue, hard to tell without full logs. In any case, in such scenarios, you shouldn't update the DB manually, instead, wait for the disk to get unlocked automatically upon operation completion. To fix the current situation, you can try to edit 'image_storage_domain_map' table in DB - i.e. update the value in 'storage_domain_id' column on the relevant 'image_id' record.
Created attachment 1016851 [details] engine log after updating image_storage_map table on DB
I updated storage_domain_map using the value of the image_guid column on images_storage_domain_view table. By doing that on the three disks the gui was refreshed with the correct Data Domain information. However I'm still unable to power on the VM and I have attached the engine log when I tried to power on the VM.
Do I need to update disk_profile column on image_storage_domain_map as well? Or maybe the value of the storage_path column on images_storage_domain_view ?
As I can see storage_path was updated automatically.
Hi Sokratis, * Can you please attach the vdsm logs as well? * According to the log [1], it seems that volume 'd0870749-36b0-4ab2-91a5-4743c690551a' wasn't found, can you please check whether it exists on the relevant storage domain? (by using 'vdsClient -s 0 getVolumesList') [1] 2015-04-21 16:02:28,104 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-29) [52e720a7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM it-unifidev-01 is down with error. Exit message: ('Failed to get size for volume %s', 'd0870749-36b0-4ab2-91a5-4743c690551a').
[root@ovirt-node-01 ~]# vdsClient -s 0 getVolumesList Error using command: list index out of range getVolumesList <sdUUID> <spUUID> [imgUUID] Returns list of volumes of imgUUID or sdUUID if imgUUID absent
Created attachment 1017278 [details] vdsm log during vm power on
I can see the following error in the vdsm log I uploaded: LogicalVolumeDoesNotExistError: Logical volume does not exist: ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a',) The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update the storage_domain_id column in the image_storage_domain_map table.
Hi Sokratis, Can you(In reply to Sokratis from comment #12) > I can see the following error in the vdsm log I uploaded: > > LogicalVolumeDoesNotExistError: Logical volume does not exist: > ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a', > ) > > > The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update > the storage_domain_id column in the image_storage_domain_map table. Is the volume exist in the output of 'vdsClient -s 0 getVolumesList 8166e147-3d08-41d5-984a-5fd6ff7b65e9 <spUUID>' (replace with the relevant storage pool id..)?
On which host should I run the command? Does it have to be the SPM host or any ovirt node? What is the value of spUUID? Is it the spm_vds_id column on storage_pool table?
(In reply to Sokratis from comment #14) > On which host should I run the command? Does it have to be the SPM host or > any ovirt node? It shouldn't matter, any host that can see the storage is fine. > > What is the value of spUUID? Is it the spm_vds_id column on storage_pool > table? No, it's the id of the storage pool. You can find it in storage_pool table -> 'id' column.
(In reply to Daniel Erez from comment #13) > Hi Sokratis, > > Can you(In reply to Sokratis from comment #12) > > I can see the following error in the vdsm log I uploaded: > > > > LogicalVolumeDoesNotExistError: Logical volume does not exist: > > ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a', > > ) > > > > > > The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update > > the storage_domain_id column in the image_storage_domain_map table. > > Is the volume exist in the output of 'vdsClient -s 0 getVolumesList > 8166e147-3d08-41d5-984a-5fd6ff7b65e9 <spUUID>' (replace with the relevant > storage pool id..)? I ran the command and the output is show below: [root@node-01 ~]# vdsClient -s 0 getVolumesList 8166e147-3d08-41d5-984a-5fd6ff7b65e9 00000002-0002-0002-0002-0000000003d7 |grep 8166 b984714e-8a3a-43f0-8e17-53b1b2d2c6f6 : {"Updated":true,"Disk Description":"OVF_STORE","Storage Domains":[{"uuid":"8166e147-3d08-41d5-984a-5fd6ff7b65e9"}],"Last Updated":"Thu Apr 23 13:15:29 EEST 2015","Size":419840}. d849f2cc-b189-43d6-afa2-a34e1f490b53 : {"Updated":true,"Disk Description":"OVF_STORE","Storage Domains":[{"uuid":"8166e147-3d08-41d5-984a-5fd6ff7b65e9"}],"Last Updated":"Thu Apr 23 13:15:29 EEST 2015","Size":419840}.
(In reply to Sokratis from comment #16) > (In reply to Daniel Erez from comment #13) > > Hi Sokratis, > > > > Can you(In reply to Sokratis from comment #12) > > > I can see the following error in the vdsm log I uploaded: > > > > > > LogicalVolumeDoesNotExistError: Logical volume does not exist: > > > ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a', > > > ) > > > > > > > > > The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update > > > the storage_domain_id column in the image_storage_domain_map table. > > > > Is the volume exist in the output of 'vdsClient -s 0 getVolumesList > > 8166e147-3d08-41d5-984a-5fd6ff7b65e9 <spUUID>' (replace with the relevant > > storage pool id..)? > > I ran the command and the output is show below: > > [root@node-01 ~]# vdsClient -s 0 getVolumesList > 8166e147-3d08-41d5-984a-5fd6ff7b65e9 00000002-0002-0002-0002-0000000003d7 > |grep 8166 > b984714e-8a3a-43f0-8e17-53b1b2d2c6f6 : {"Updated":true,"Disk > Description":"OVF_STORE","Storage > Domains":[{"uuid":"8166e147-3d08-41d5-984a-5fd6ff7b65e9"}],"Last > Updated":"Thu Apr 23 13:15:29 EEST 2015","Size":419840}. > d849f2cc-b189-43d6-afa2-a34e1f490b53 : {"Updated":true,"Disk > Description":"OVF_STORE","Storage > Domains":[{"uuid":"8166e147-3d08-41d5-984a-5fd6ff7b65e9"}],"Last > Updated":"Thu Apr 23 13:15:29 EEST 2015","Size":419840}. OK, so it seems that the volume isn't on that domain. Can you try the same thing with the other domain?
The id column on storage_pool table is Data Center wide. I guess I need to use a value from another table? Maybe storage_id column from images_storage_domain_view?
(In reply to Sokratis from comment #18) > The id column on storage_pool table is Data Center wide. > > I guess I need to use a value from another table? Maybe storage_id column > from images_storage_domain_view? You can find it in 'storage_pool_iso_map' table which contains records for 'storage_id' -> 'storage_pool_id'.
I ran the command for all the domains and I got the same output (2 lines regarding OVF_STORE).
(In reply to Sokratis from comment #20) > I ran the command for all the domains and I got the same output (2 lines > regarding OVF_STORE). Are all the storage domains indeed empty? Can you check in the underlined storage if there are any volumes?
The storage domains are not empty. I ran the command with |grep 8166 in the end to filter only the related volume. If you see my first post the following error was thrown: Thread-887::ERROR::2015-04-17 10:08:11,953::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Logical volume does not exist: ('d3ec9e08-1e83-449e-b09d-dd50a4f7102f/d0870749-36b0-4ab2-91a5-4743c690551a',)", 'code': 610}} The "d3ec9e08-1e83-449e-b09d-dd50a4f7102f" data domain was the original domain before the storage migration (which currently has 1 VM running). The new data domain is "8166e147-3d08-41d5-984a-5fd6ff7b65e9" which currently has 25 VMs running. On the below comment is the error after manually updating the Database to reflect the new Data Domain (8166e147-3d08-41d5-984a-5fd6ff7b65e9). On comment 12(In reply to Sokratis from comment #12) > I can see the following error in the vdsm log I uploaded: > > LogicalVolumeDoesNotExistError: Logical volume does not exist: > ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a', > ) > > > The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update > the storage_domain_id column in the image_storage_domain_map table. However the vdsClient commands were run for the Data Domain and not the Volume. Would it be useful if I run the command with d0870749-36b0-4ab2-91a5-4743c690551a?
(In reply to Sokratis from comment #22) > The storage domains are not empty. I ran the command with |grep 8166 in the > end to filter only the related volume. If you see my first post the > following error was thrown: > > Thread-887::ERROR::2015-04-17 > 10:08:11,953::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': > {'message': "Logical volume does not exist: > ('d3ec9e08-1e83-449e-b09d-dd50a4f7102f/d0870749-36b0-4ab2-91a5-4743c690551a', > )", 'code': 610}} > > > The "d3ec9e08-1e83-449e-b09d-dd50a4f7102f" data domain was the original > domain before the storage migration (which currently has 1 VM running). > > The new data domain is "8166e147-3d08-41d5-984a-5fd6ff7b65e9" which > currently has 25 VMs running. > > On the below comment is the error after manually updating the Database to > reflect the new Data Domain (8166e147-3d08-41d5-984a-5fd6ff7b65e9). > > > On comment 12(In reply to Sokratis from comment #12) > > I can see the following error in the vdsm log I uploaded: > > > > LogicalVolumeDoesNotExistError: Logical volume does not exist: > > ('8166e147-3d08-41d5-984a-5fd6ff7b65e9/d0870749-36b0-4ab2-91a5-4743c690551a', > > ) > > > > > > The id "8166e147-3d08-41d5-984a-5fd6ff7b65e9" is the one I used to update > > the storage_domain_id column in the image_storage_domain_map table. > > However the vdsClient commands were run for the Data Domain and not the > Volume. > > Would it be useful if I run the command with > d0870749-36b0-4ab2-91a5-4743c690551a? Yeah, IIUC, 'd0870749-36b0-4ab2-91a5-4743c690551a' is the volume you're looking for, right? So we need to find in which storage domain it currently resides to fix the DB accordingly.
I ran the vdsClient command for all Data Domains and I found that 2 of the 3 VM disks exist on Data Domain 8166e147-3d08-41d5-984a-5fd6ff7b65e9. The only one that doesn't exist is d0870749-36b0-4ab2-91a5-4743c690551a for which the error is thrown. The problem is that it doesn't exist in any of the other Data Domains. However in the web admin portal I can see all three disks on the new Data Domain but the disk profile still refers to the original Data Domain.
I deactivated the VM Disk with the problematic volume and I was able to power on the VM. However the problematic volume contains the bootable disk and the OS is unable to boot.
I also ran an "lvs" command on all ovirt nodes and I couldn't find the volume.
Does the volume change ID after storage migration? Is there a way to check for disks not connected to any VM?
(In reply to Sokratis from comment #27) > Does the volume change ID after storage migration? No, the ID should remain the same after moving a disk. Is there a way to check > for disks not connected to any VM? Try using 'vdsClient -s 0 getImagesList' with the relevant storage domain to retrieve the list of images. Then, check if it contains the ID of the disk (you can get the ID from 'Disks' -> 'General tab).
I ran the getImagesList command for all the Data Domains and I could only find the other 2 disks (similar to the output of getVolumesList).
(In reply to Sokratis from comment #29) > I ran the getImagesList command for all the Data Domains and I could only > find the other 2 disks (similar to the output of getVolumesList). What's the ID of the missing disk? (I'll try to check in the logs if it was already been removed from source).
(In reply to Sokratis from comment #24) > I ran the vdsClient command for all Data Domains and I found that 2 of the 3 > VM disks exist on Data Domain 8166e147-3d08-41d5-984a-5fd6ff7b65e9. > > The only one that doesn't exist is d0870749-36b0-4ab2-91a5-4743c690551a for > which > the error is thrown. > > The problem is that it doesn't exist in any of the other Data Domains. > > However in the web admin portal I can see all three disks on the new Data > Domain but the disk profile still refers to the original Data Domain. Based on the above comment the id is d0870749-36b0-4ab2-91a5-4743c690551a (image_guid column on images_storage_domain_view). The ID shown in the gui is 3641f4d8-89c7-49b5-8cad-49230c52feb2 (disk_id column on images_storage_domain_view).
(In reply to Sokratis from comment #31) > (In reply to Sokratis from comment #24) > > I ran the vdsClient command for all Data Domains and I found that 2 of the 3 > > VM disks exist on Data Domain 8166e147-3d08-41d5-984a-5fd6ff7b65e9. > > > > The only one that doesn't exist is d0870749-36b0-4ab2-91a5-4743c690551a for > > which > > the error is thrown. > > > > The problem is that it doesn't exist in any of the other Data Domains. > > > > However in the web admin portal I can see all three disks on the new Data > > Domain but the disk profile still refers to the original Data Domain. > > Based on the above comment the id is d0870749-36b0-4ab2-91a5-4743c690551a > (image_guid column on images_storage_domain_view). > > The ID shown in the gui is 3641f4d8-89c7-49b5-8cad-49230c52feb2 (disk_id > column on images_storage_domain_view). OK, so according to the logs [1], the disk has been moved from storage domain 'd3ec9e08-1e83-449e-b09d-dd50a4f7102f' to domain '8166e147-3d08-41d5-984a-5fd6ff7b65e9', and failed afterwards in the SPM (Image does not exist in domain) [2]. I'm afraid that without the SPM logs during the move operation we can't have further details and determine whether it's an issue in the underlined storage. But if the disk doesn't exist in neither domains now, it looks like there's been an issue in the storage and not in the code (according to the flow). Try to check if the output of 'lvs' in the host contains the missing lv (d0870749-36b0-4ab2-91a5-4743c690551a). [1] 2015-04-16 17:19:30,821 INFO [org.ovirt.engine.core.bll.MoveOrCopyDiskCommand] (org.ovirt.thread.pool-8-thread-15) [4d4fc1f6] Running command: MoveOrCopyDiskCommand internal: false. Entities affected : ID: 3641f4d8-89c7-49b5-8cad-49230c52feb2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: 8166e147-3d08-41d5-984a-5fd6ff7b65e9 Type: StorageAction group CREATE_DISK with role type USER 2015-04-16 17:19:30,827 INFO [org.ovirt.engine.core.bll.MoveImageGroupCommand] (org.ovirt.thread.pool-8-thread-15) [4d4fc1f6] Running command: MoveImageGroupCommand internal: true. Entities affected : ID: 8166e147-3d08-41d5-984a-5fd6ff7b65e9 Type: Storage 2015-04-16 17:19:30,881 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.MoveImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-15) [4d4fc1f6] START, MoveImageGroupVDSCommand( storagePoolId = 00000002-0002-0002-0002-0000000003d7, ignoreFailoverLimit = false, storageDomainId = d3ec9e08-1e83-449e-b09d-dd50a4f7102f, imageGroupId = 3641f4d8-89c7-49b5-8cad-49230c52feb2, dstDomainId = 8166e147-3d08-41d5-984a-5fd6ff7b65e9, vmId = 00000000-0000-0000-0000-000000000000, op = Copy, postZero = false, force = false), log id: 816d2e [2] 2015-04-16 17:19:41,676 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-15) BaseAsyncTask::logEndTaskFailure: Task 537b7d15-bae8-4e01-bef3-ef10d9804532 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Image does not exist in domain, -- Exception: VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = Image does not exist in domain
I run lvs on all ovirt nodes and I couldn't find the volume. I've decided to restore the VM from a backup so we don't need to debug this any longer. You can close the bug. Thank you very much for your help.