Created attachment 598463 [details] engine.log ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', the VM's Virtual Disk stuck in 'lock' forever (VM cannot be remove). Scenario: ********** It happened to me twice, 1) Create desktop pool of 5 VM's 2) Attempted to remove them (at once) 3) restart the ovirt engine service while the tasks are running (if its not reproducing set a break-point in the relevant code part or just restart the ovirt engine twice - it a meter of timing ) Results: ********* - VM Virtual Disk will remain in status 'lock' forever. - VM cannot be remove while its hard-disk locked. *************************************************** service ovirt-engine restart during this part of the code ---> ---> result - Virtual Disk remain in lock forever : **************************************************** VDSReturnValue returnValue = runVdsCommand(VDSCommandType.DeleteImageGroup, newDeleteImageGroupVDSCommandParameters(getDiskImage().getstorage_pool_id().getValue(), getStorageDomainId().getValue(), getDiskImage().getId() getValue(), getDiskImage().isWipeAfterDelete(), getParameters() .getForceDelete(), getStoragePool().getcompatibility_version().toString())); return returnValue; engine.log: *********** 012-07-16 16:00:17,654 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-48) [7a8f0bf2] START, DeleteImageGroupVDSCommand(storagePoolId = 86efaad2-705a-4a08-8d98-5a24f098044c, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = fb2298e2-6be5-4571-bc33-5fb6b00e46a6, imageGroupId = 90c3fa25-9aa0-4935-9513-94469acfd6b8, postZeros = false, forceDelete = false), log id: 60f2ac65 2012-07-16 16:00:27,314 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-6) AsyncTaskManager::PollAndUpdateAsyncTasks: 2 tasks, 1 tasks to poll now 2012-07-16 16:00:40,569 WARN [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (MSC service thread 1-5) /etc/engine/engine.conf file is not found. Please check your engine installation. Default values will be used 2012-07-16 16:00:41,924 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Start time: 7/16/12 4:00 PM 2012-07-16 16:00:42,199 INFO [org.ovirt.engine.core.bll.RemoveVmCommand] (MSC service thread 1-5) Command [id=673057cf-7187-45cc-b329-ca3611604f5a]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.VmDynamic; snapshot: EntityStatusSnapshot [id=a3cf66c3-ef31-4d33-97fd-50b179b338b8, status=Down]. 2012-07-16 16:00:42,253 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Running compensation on startup for Command : org.ovirt.engine.core.bll.RemoveVmCommand , Command Id : 673057cf-7187-45cc-b329-ca3611604f5a 2012-07-16 16:00:42,322 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CbcCheckOnVdsChange 2012-07-16 16:00:42,332 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ENGINEEARLib 2012-07-16 16:00:42,333 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CAEngineKey 2012-07-16 16:00:42,366 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: MinimalETLVersion 2012-07-16 16:00:42,372 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: SQLServerI18NPrefix 2012-07-16 16:00:42,379 WARN [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ScriptsPath 2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (MSC service thread 1-5) Failed to decrypt Data must start with zero 2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.dal.dbbroker.generic.DBConfigUtils] (MSC service thread 1-5) Failed to decrypt value for property TruststorePass will be used encrypted value 2012-07-16 16:00:42,397 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) VDSBrokerFrontend: 7/16/12 4:00 PM 2012-07-16 16:00:42,400 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) CpuFlagsManager: 7/16/12 4:00 PM 2012-07-16 16:00:42,401 INFO [org.ovirt.engine.core.bll.AuditLogCleanupManager] (MSC service thread 1-5) Setting audit clean up manager to run at: 35 35 3 * * ? 2012-07-16 16:00:42,423 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) AuditLogCleanupManager: 7/16/12 4:00 PM 2012-07-16 16:00:42,427 INFO [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) TagsDirector initialization 2012-07-16 16:00:42,445 INFO [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) Tag root added to tree 2012-07-16 16:00:42,459 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) TagsDirector: 7/16/12 4:00 PM 2012-07-16 16:00:42,467 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) IsoDomainListSyncronizer: 7/16/12 4:00 PM 2012-07-16 16:00:42,486 INFO [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) InitHandlers: 7/16/12 4:00 PM 2012-07-16 16:00:42,493 WARN [org.ovirt.engine.core.utils.ErrorTranslatorImpl] (MSC service thread 1-5) Code VDS_FENCING_OPERATION_FAILED appears more then once in string table.
Trying to reproduce this I got to something very similar to what was described here (vm disks locked after an engine restart during a RemoveVm command), but the logs are different. Anyway I identified the problem being in the async_tasks table: the action_params_class is org.ovirt.engine.core.common.action.RemoveVmParameters but the parameters (action_parameters) are org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters. And in fact the engine logs contain: 2012-09-20 14:50:43,549 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) Cant find constructor for type org.ovirt.engine.core.bll.RemoveVmCommand with parameter types: [class org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters] 2012-09-20 14:50:43,551 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) CommandsFactory [parameter: VdcActionParametersBase]: Failed to get type information using reflection for Action: RemoveVm: java.lang.RuntimeException: could not find matching constructor for Command class class org.ovirt.engine.core.bll.RemoveVmCommand I believe the root cause is in RemoveAllVmImagesCommand.java:59 protected void executeVmCommand() { [...] RemoveImageParameters tempVar = new RemoveImageParameters(image.getImageId()); tempVar.setParentCommand(getParameters().getParentCommand()); tempVar.setParentParameters(getParameters().getParentParameters()); [...] tempVar.setParentParameters(getParameters()); [...] } For some reason the parent parameters (RemoveVmParameters) are overwritten with the RemoveAllVmImagesParameters ones. I removed such line and it looks like it's working now. I'll investigate some more and I'll eventually send a patch.
Proposed fix upstream: http://gerrit.ovirt.org/#/c/8119/
A generic solution (devised for another regression) was merged with Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b
(In reply to comment #3) > A generic solution (devised for another regression) was merged with > Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b It looks unrelated to me. Here the issue was that the parent parameters were overridden (setParentParameters) with garbage. I don't think I71290e58978b680dbbbf88b024a19482b637a91b would come into play in this case.
checked on si20. I still have a vm that is stuck in image lock. logs will be attached but we can see that after the engine starts again, the task is added with no Parent Command so even though the task is cleaned the vm will remain in image locked. 2012-10-11 15:53:46,187 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-15) Adding task a418b478-345a-4066-b6f0-3092c72aa495 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.Asy ncTaskParameters), polling hasn't started yet.. VM id: 120570e0-0ef1-4e12-84d2-e3d2a46f0af4 (name is POOL-3).
Created attachment 625551 [details] engine and vds logs
The fix was verified tracing the exact flow of the data inserted/modified and removed from the async task table (also looking at the code the fix was straight forward). I feel like this could be an unexpected interaction with other patches around this area, or in this case the engine was restarted in a different way/time other than what I was testing.
*** Bug 880687 has been marked as a duplicate of this bug. ***
http://gerrit.ovirt.org/#/c/11075/
verified on sf8 vm's were not in image locked after the restart
3.2 has been released