Bug 840525 - ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', the VM's Virtual Disk stuck in 'lock' forever (VM cannot be remove).
ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
x86_64 Linux
high Severity high
: ---
: 3.2.0
Assigned To: Liron Aravot
Dafna Ron
storage
: ZStream
Depends On:
Blocks: 896608
  Show dependency treegraph
 
Reported: 2012-07-16 10:26 EDT by Omri Hochman
Modified: 2016-02-10 11:39 EST (History)
9 users (show)

See Also:
Fixed In Version: SF6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 896608 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
engine.log (429.52 KB, application/octet-stream)
2012-07-16 10:26 EDT, Omri Hochman
no flags Details
engine and vds logs (761.71 KB, application/x-gzip)
2012-10-11 10:35 EDT, Dafna Ron
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 11075 None None None Never

  None (edit)
Description Omri Hochman 2012-07-16 10:26:00 EDT
Created attachment 598463 [details]
engine.log

ovirt-engine-backend: When engine service restart during 'DeleteImageGroup', the VM's Virtual Disk stuck in 'lock' forever (VM cannot be remove).

Scenario:
**********
It happened to me twice, 
1) Create desktop pool of 5 VM's
2) Attempted to remove them (at once)
3) restart the ovirt engine service while the tasks are running 
(if its not reproducing set a break-point in the relevant code part or just  restart the ovirt engine twice - it a meter of timing )

Results:
*********
- VM Virtual Disk will remain in status 'lock' forever.
- VM cannot be remove while its hard-disk locked.

***************************************************
service ovirt-engine restart during this part of the code ---> 
 ---> result -  Virtual Disk remain in lock forever  :
****************************************************
VDSReturnValue returnValue = runVdsCommand(VDSCommandType.DeleteImageGroup, newDeleteImageGroupVDSCommandParameters(getDiskImage().getstorage_pool_id().getValue(),
     getStorageDomainId().getValue(), getDiskImage().getId()
     getValue(), getDiskImage().isWipeAfterDelete(), getParameters()
    .getForceDelete(), getStoragePool().getcompatibility_version().toString()));
  return returnValue;


engine.log:
***********
012-07-16 16:00:17,654 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-48) [7a8f0bf2] START, DeleteImageGroupVDSCommand(storagePoolId = 86efaad2-705a-4a08-8d98-5a24f098044c, ignoreFailoverLimit = false, compatabilityVersion = 3.1, storageDomainId = fb2298e2-6be5-4571-bc33-5fb6b00e46a6, imageGroupId = 90c3fa25-9aa0-4935-9513-94469acfd6b8, postZeros = false, forceDelete = false), log id: 60f2ac65
2012-07-16 16:00:27,314 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-6) AsyncTaskManager::PollAndUpdateAsyncTasks: 2 tasks, 1 tasks to poll now
2012-07-16 16:00:40,569 WARN  [org.ovirt.engine.core.dal.dbbroker.DbFacadeLocator] (MSC service thread 1-5) /etc/engine/engine.conf file is not found. Please check your engine installation. Default values will be used
2012-07-16 16:00:41,924 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Start time: 7/16/12 4:00 PM
2012-07-16 16:00:42,199 INFO  [org.ovirt.engine.core.bll.RemoveVmCommand] (MSC service thread 1-5) Command [id=673057cf-7187-45cc-b329-ca3611604f5a]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.VmDynamic; snapshot: EntityStatusSnapshot [id=a3cf66c3-ef31-4d33-97fd-50b179b338b8, status=Down].
2012-07-16 16:00:42,253 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) Running compensation on startup for Command : org.ovirt.engine.core.bll.RemoveVmCommand , Command Id : 673057cf-7187-45cc-b329-ca3611604f5a
2012-07-16 16:00:42,322 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CbcCheckOnVdsChange
2012-07-16 16:00:42,332 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ENGINEEARLib
2012-07-16 16:00:42,333 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: CAEngineKey
2012-07-16 16:00:42,366 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: MinimalETLVersion
2012-07-16 16:00:42,372 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: SQLServerI18NPrefix
2012-07-16 16:00:42,379 WARN  [org.ovirt.engine.core.utils.ConfigUtilsBase] (MSC service thread 1-5) Could not find enum value for option: ScriptsPath
2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (MSC service thread 1-5) Failed to decrypt Data must start with zero
2012-07-16 16:00:42,388 ERROR [org.ovirt.engine.core.dal.dbbroker.generic.DBConfigUtils] (MSC service thread 1-5) Failed to decrypt value for property TruststorePass will be used encrypted value
2012-07-16 16:00:42,397 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) VDSBrokerFrontend: 7/16/12 4:00 PM
2012-07-16 16:00:42,400 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) CpuFlagsManager: 7/16/12 4:00 PM
2012-07-16 16:00:42,401 INFO  [org.ovirt.engine.core.bll.AuditLogCleanupManager] (MSC service thread 1-5) Setting audit clean up manager to run at: 35 35 3 * * ?
2012-07-16 16:00:42,423 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) AuditLogCleanupManager: 7/16/12 4:00 PM
2012-07-16 16:00:42,427 INFO  [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) TagsDirector initialization
2012-07-16 16:00:42,445 INFO  [org.ovirt.engine.core.bll.TagsDirector] (MSC service thread 1-5) Tag root added to tree
2012-07-16 16:00:42,459 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) TagsDirector: 7/16/12 4:00 PM
2012-07-16 16:00:42,467 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) IsoDomainListSyncronizer: 7/16/12 4:00 PM
2012-07-16 16:00:42,486 INFO  [org.ovirt.engine.core.bll.Backend] (MSC service thread 1-5) InitHandlers: 7/16/12 4:00 PM
2012-07-16 16:00:42,493 WARN  [org.ovirt.engine.core.utils.ErrorTranslatorImpl] (MSC service thread 1-5) Code VDS_FENCING_OPERATION_FAILED appears more then once in string table.
Comment 1 Federico Simoncelli 2012-09-20 16:44:09 EDT
Trying to reproduce this I got to something very similar to what was described here (vm disks locked after an engine restart during a RemoveVm command), but the logs are different.

Anyway I identified the problem being in the async_tasks table: the action_params_class is org.ovirt.engine.core.common.action.RemoveVmParameters but the parameters (action_parameters) are org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters.

And in fact the engine logs contain:

2012-09-20 14:50:43,549 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) Cant find constructor for type org.ovirt.engine.core.bll.RemoveVmCommand with parameter types: [class org.ovirt.engine.core.common.action.RemoveAllVmImagesParameters]
2012-09-20 14:50:43,551 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (pool-10-thread-50) CommandsFactory [parameter: VdcActionParametersBase]: Failed to get type information using reflection for Action: RemoveVm: java.lang.RuntimeException: could not find matching constructor for Command class class org.ovirt.engine.core.bll.RemoveVmCommand

I believe the root cause is in RemoveAllVmImagesCommand.java:59

 protected void executeVmCommand() {
 [...]
             RemoveImageParameters tempVar = new RemoveImageParameters(image.getImageId());
             tempVar.setParentCommand(getParameters().getParentCommand());
             tempVar.setParentParameters(getParameters().getParentParameters());
 [...]
             tempVar.setParentParameters(getParameters());
 [...]
 }

For some reason the parent parameters (RemoveVmParameters) are overwritten with the RemoveAllVmImagesParameters ones.
I removed such line and it looks like it's working now. I'll investigate some more and I'll eventually send a patch.
Comment 2 Federico Simoncelli 2012-09-21 12:01:08 EDT
Proposed fix upstream: http://gerrit.ovirt.org/#/c/8119/
Comment 3 Allon Mureinik 2012-10-03 12:25:51 EDT
A generic solution (devised for another regression) was merged with Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b
Comment 4 Federico Simoncelli 2012-10-03 18:40:45 EDT
(In reply to comment #3)
> A generic solution (devised for another regression) was merged with
> Change-Id: I71290e58978b680dbbbf88b024a19482b637a91b

It looks unrelated to me. Here the issue was that the parent parameters were overridden (setParentParameters) with garbage. I don't think I71290e58978b680dbbbf88b024a19482b637a91b would come into play in this case.
Comment 7 Dafna Ron 2012-10-11 10:35:30 EDT
checked on si20.

I still have a vm that is stuck in image lock. 

logs will be attached but we can see that after the engine starts again, the task is added with no Parent Command so even though the task is cleaned the vm will remain in image locked. 

2012-10-11 15:53:46,187 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (QuartzScheduler_Worker-15) Adding task a418b478-345a-4066-b6f0-3092c72aa495 (Parent Command Unknown, Parameters Type org.ovirt.engine.core.common.asynctasks.Asy
ncTaskParameters), polling hasn't started yet..


VM id: 

120570e0-0ef1-4e12-84d2-e3d2a46f0af4 (name is POOL-3).
Comment 8 Dafna Ron 2012-10-11 10:35:58 EDT
Created attachment 625551 [details]
engine and vds logs
Comment 9 Federico Simoncelli 2012-10-11 11:03:30 EDT
The fix was verified tracing the exact flow of the data inserted/modified and removed from the async task table (also looking at the code the fix was straight forward). I feel like this could be an unexpected interaction with other patches around this area, or in this case the engine was restarted in a different way/time other than what I was testing.
Comment 13 Allon Mureinik 2012-11-28 05:17:53 EST
*** Bug 880687 has been marked as a duplicate of this bug. ***
Comment 14 Liron Aravot 2013-01-16 08:47:26 EST
http://gerrit.ovirt.org/#/c/11075/
Comment 16 Dafna Ron 2013-02-27 12:29:45 EST
verified on sf8
vm's were not in image locked after the restart
Comment 17 Itamar Heim 2013-06-11 05:01:25 EDT
3.2 has been released
Comment 18 Itamar Heim 2013-06-11 05:01:30 EDT
3.2 has been released
Comment 19 Itamar Heim 2013-06-11 05:01:30 EDT
3.2 has been released
Comment 20 Itamar Heim 2013-06-11 05:02:37 EDT
3.2 has been released
Comment 21 Itamar Heim 2013-06-11 05:31:55 EDT
3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.