Bug 840525

Summary: ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', the VM's Virtual Disk stuck in 'lock' forever (VM cannot be remove).
Product: Red Hat Enterprise Virtualization Manager Reporter: Omri Hochman <ohochman>
Component: ovirt-engineAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: abaron, amureini, dron, fsimonce, iheim, italkohe, sgrinber, yeylon, ykaul
Target Milestone: ---Keywords: ZStream
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: SF6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 896608 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 896608    
Attachments:
Description Flags
engine.log
none
engine and vds logs none

Description Omri Hochman 2012-07-16 14:26:00 UTC
Created attachment 598463 [details]
engine.log

ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', the VM's Virtual Disk stuck in 'lock' forever (VM cannot be remove).

Scenario:
**********
It happened to me twice, 
1) Create desktop pool of 5 VM's
2) Attempted to remove them (at once)
3) restart the ovirt engine service while the tasks are running 
(if its not reproducing set a break-point in the relevant code part or just  restart the ovirt engine twice - it a meter of timing )

Results:
*********
- VM Virtual Disk will remain in status 'lock' forever.
- VM cannot be remove while its hard-disk locked.

***************************************************
service ovirt-engine restart during this part of the code ---> 
 ---> result -  Virtual Disk remain in lock forever  :
****************************************************
VDSReturnValue returnValue = runVdsCommand(VDSCommandType.DeleteImageGroup, newDeleteImageGroupVDSCommandParameters(getDiskImage().getstorage_pool_id().getValue(),
     getStorageDomainId().getValue(), getDiskImage().getId()
     getValue(), getDiskImage().isWipeAfterDelete(), getParameters()
    .getForceDelete(), getStoragePool().getcompatibility_version().toString()));
  return returnValue;


engine.log:
***********
012-07-16 16:00:17,654 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-48) [7a8f0bf2] START, DeleteImageGroupVDSCommand(storagePoolId = 86efaad2-705a-4a08-8d98-5a24f098044c, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = fb2298e2-6be5-4571-bc33-5fb6b00e46a6, imageGroupId = 90c3fa25-9aa0-4935-9513-94469acfd6b8, postZeros = false, forceDelete = false), log id: 60f2ac65
2012-07-16 16:00:27,314 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-6) AsyncTaskManager::PollAndUpdateAsyncTasks: 2 tasks, 1 tasks to poll now
2012-07-16 16:00:40,569 WARN  [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (MSC service thread 1-5) /etc/engine/engine.conf file is not found. Please check your engine installation. Default values will be used
2012-07-16 16:00:41,924 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Start time: 7/16/12 4:00 PM
2012-07-16 16:00:42,199 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (MSC service thread 1-5) Command [id=673057cf-7187-45cc-b329-ca3611604f5a]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.VmDynamic; snapshot: EntityStatusSnapshot [id=a3cf66c3-ef31-4d33-97fd-50b179b338b8, status=Down].
2012-07-16 16:00:42,253 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Running compensation on startup for Command : org.ovirt.engine.core.bll.RemoveVmCommand , Command Id : 673057cf-7187-45cc-b329-ca3611604f5a
2012-07-16 16:00:42,322 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CbcCheckOnVdsChange
2012-07-16 16:00:42,332 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ENGINEEARLib
2012-07-16 16:00:42,333 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CAEngineKey
2012-07-16 16:00:42,366 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: MinimalETLVersion
2012-07-16 16:00:42,372 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: SQLServerI18NPrefix
2012-07-16 16:00:42,379 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ScriptsPath
2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (MSC service thread 1-5) Failed to decrypt Data must start with zero
2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.dal.dbbroker.generic.DBConfigUtils] (MSC service thread 1-5) Failed to decrypt value for property TruststorePass will be used encrypted value
2012-07-16 16:00:42,397 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) VDSBrokerFrontend: 7/16/12 4:00 PM
2012-07-16 16:00:42,400 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) CpuFlagsManager: 7/16/12 4:00 PM
2012-07-16 16:00:42,401 INFO  [org.ovirt.engine.core.bll.AuditLogCleanupManager] (MSC service thread 1-5) Setting audit clean up manager to run at: 35 35 3 * * ?
2012-07-16 16:00:42,423 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) AuditLogCleanupManager: 7/16/12 4:00 PM
2012-07-16 16:00:42,427 INFO  [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) TagsDirector initialization
2012-07-16 16:00:42,445 INFO  [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) Tag root added to tree
2012-07-16 16:00:42,459 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) TagsDirector: 7/16/12 4:00 PM
2012-07-16 16:00:42,467 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) IsoDomainListSyncronizer: 7/16/12 4:00 PM
2012-07-16 16:00:42,486 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) InitHandlers: 7/16/12 4:00 PM
2012-07-16 16:00:42,493 WARN  [org.ovirt.engine.core.utils.ErrorTranslatorImpl] (MSC service thread 1-5) Code VDS_FENCING_OPERATION_FAILED appears more then once in string table.

Comment 1 Federico Simoncelli 2012-09-20 20:44:09 UTC
Trying to reproduce this I got to something very similar to what was described here (vm disks locked after an engine restart during a RemoveVm command), but the logs are different.

Anyway I identified the problem being in the async_tasks table: the action_params_class is org.ovirt.engine.core.common.action.RemoveVmParameters but the parameters (action_parameters) are org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters.

And in fact the engine logs contain:

2012-09-20 14:50:43,549 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) Cant find constructor for type org.ovirt.engine.core.bll.RemoveVmCommand with parameter types: [class org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters]
2012-09-20 14:50:43,551 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) CommandsFactory [parameter: VdcActionParametersBase]: Failed to get type information using reflection for Action: RemoveVm: java.lang.RuntimeException: could not find matching constructor for Command class class org.ovirt.engine.core.bll.RemoveVmCommand

I believe the root cause is in RemoveAllVmImagesCommand.java:59

 protected void executeVmCommand() {
 [...]
             RemoveImageParameters tempVar = new RemoveImageParameters(image.getImageId());
             tempVar.setParentCommand(getParameters().getParentCommand());
             tempVar.setParentParameters(getParameters().getParentParameters());
 [...]
             tempVar.setParentParameters(getParameters());
 [...]
 }

For some reason the parent parameters (RemoveVmParameters) are overwritten with the RemoveAllVmImagesParameters ones.
I removed such line and it looks like it's working now. I'll investigate some more and I'll eventually send a patch.

Comment 2 Federico Simoncelli 2012-09-21 16:01:08 UTC
Proposed fix upstream: http://gerrit.ovirt.org/#/c/8119/

Comment 3 Allon Mureinik 2012-10-03 16:25:51 UTC
A generic solution (devised for another regression) was merged with Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b

Comment 4 Federico Simoncelli 2012-10-03 22:40:45 UTC
(In reply to comment #3)
> A generic solution (devised for another regression) was merged with
> Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b

It looks unrelated to me. Here the issue was that the parent parameters were overridden (setParentParameters) with garbage. I don't think I71290e58978b680dbbbf88b024a19482b637a91b would come into play in this case.

Comment 7 Dafna Ron 2012-10-11 14:35:30 UTC
checked on si20.

I still have a vm that is stuck in image lock. 

logs will be attached but we can see that after the engine starts again, the task is added with no Parent Command so even though the task is cleaned the vm will remain in image locked. 

2012-10-11 15:53:46,187 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-15) Adding task a418b478-345a-4066-b6f0-3092c72aa495 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.Asy
ncTaskParameters), polling hasn't started yet..


VM id: 

120570e0-0ef1-4e12-84d2-e3d2a46f0af4 (name is POOL-3).

Comment 8 Dafna Ron 2012-10-11 14:35:58 UTC
Created attachment 625551 [details]
engine and vds logs

Comment 9 Federico Simoncelli 2012-10-11 15:03:30 UTC
The fix was verified tracing the exact flow of the data inserted/modified and removed from the async task table (also looking at the code the fix was straight forward). I feel like this could be an unexpected interaction with other patches around this area, or in this case the engine was restarted in a different way/time other than what I was testing.

Comment 13 Allon Mureinik 2012-11-28 10:17:53 UTC
*** Bug 880687 has been marked as a duplicate of this bug. ***

Comment 14 Liron Aravot 2013-01-16 13:47:26 UTC
http://gerrit.ovirt.org/#/c/11075/

Comment 16 Dafna Ron 2013-02-27 17:29:45 UTC
verified on sf8
vm's were not in image locked after the restart

Comment 17 Itamar Heim 2013-06-11 09:01:25 UTC
3.2 has been released

Comment 18 Itamar Heim 2013-06-11 09:01:30 UTC
3.2 has been released

Comment 19 Itamar Heim 2013-06-11 09:01:30 UTC
3.2 has been released

Comment 20 Itamar Heim 2013-06-11 09:02:37 UTC
3.2 has been released

Comment 21 Itamar Heim 2013-06-11 09:31:55 UTC
3.2 has been released