Bug 991542 - [LSM] engine: disk remains in locked state with repeating error in engine log when vm's pid is suddenly killed
Summary: [LSM] engine: disk remains in locked state with repeating error in engine log...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.2.3
Assignee: Daniel Erez
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On: 970889
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-02 17:12 UTC by Chris Pelland
Modified: 2016-02-10 19:10 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When a virtual machine's pid was unexpectedly killed, its disks remained in a locked state, and the engine logged repeated errors. This has been fixed by catching the exception which prevents the disk from being unlocked, and having its status updated. Now, if a virtual machine crashes, its disks do not stay locked, and the error is only logged once.
Clone Of: 970889
Environment:
Last Closed: 2013-09-10 19:05:39 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:1210 0 normal SHIPPED_LIVE Moderate: rhevm security and bug fix update 2013-09-10 23:03:43 UTC
oVirt gerrit 16797 0 None None None Never

Description Chris Pelland 2013-08-02 17:12:46 UTC
+++ This bug was initially created as a clone of Bug #970889 +++

Description of problem:

This happened to me in my setup and I found a way to reproduce it. 
a vm crashed during syncImageData and engine started reporting the following in a loop while vm disk remained in image locked: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

I found a way to reproduce this 100% which is to kill -9 the vm pid during syncImageData. 

Version-Release number of selected component (if applicable):

sf17.4

How reproducible:

100%

Steps to Reproduce:
1. on a two hosts cluster with multiple iscsi storage create a vm with one disk
2. run the vm
3. when the vm changes status to up live storage migrate the disk
4. during syncImageData step, kill -9 the vm's pid


Actual results:

1. engine shows the following error in the logs in a loop: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

2. vm's disk remains in image locked until we run unlock_entity script. 

Expected results:

we should be able to recover in case a vm crashes duruing LSM. 

Additional info:logs

2013-06-05 10:12:03,563 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] Cannot get vdsManager for vdsid=00000000-0000-0000-0000-000000000000
2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.initializeVdsBroker(VdsBrokerCommand.java:50) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.<init>(VdsBrokerCommand.java:29) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskVDSCommand.<init>(VmReplicateDiskVDSCommand.java:9) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand.<init>(VmReplicateDiskFinishVDSCommand.java:8) [engine-vdsbroker.jar:]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.7.0_19]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525) [rt.jar:1.7.0_19]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:333) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:370) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.revertTask(VmReplicateDiskStartTaskHandler.java:79) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.endWithFailure(VmReplicateDiskStartTaskHandler.java:107) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndWithFailure(CommandBase.java:569) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:485) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1515) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:166) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:108) [engine-utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:429) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:58) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:374) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.ovirt.engine.core.utils.ThreadLocalSessionCleanerInterceptor.injectWebContextToThreadLocal(ThreadLocalSessionCleanerInterceptor.java:11) [engine-utils.jar:]
        at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) [:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:123) [jboss-as-ee.jar:7.2.0.Final-redhat-




engine=# SELECT * from images;
              image_guid              |     creation_date      |    size     |               it_guid                |               parentid               | imagestatus |      
lastmodified      |            vm_snapshot_id            | volume_type | volume_format |            image_group_id            |         _create_date          |         _update_
date          |               quota_id               | active 
--------------------------------------+------------------------+-------------+--------------------------------------+--------------------------------------+-------------+------
------------------+--------------------------------------+-------------+---------------+--------------------------------------+-------------------------------+-----------------
--------------+--------------------------------------+--------
 00000000-0000-0000-0000-000000000000 | 2008-04-01 00:00:00+03 | 85899345920 | 00000000-0000-0000-0000-000000000000 |                                      |           0 |      
                  |                                      |           2 |             4 |                                      | 2013-02-26 04:56:09.820466+02 |                 
              |                                      | t
 f82a0d58-0791-4137-b1e6-22a8794acd2a | 2013-05-27 14:07:13+03 | 16106127360 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-27 14:07:15+03 | df1e0727-64ac-4cae-a35f-7f31058bbd69 |           2 |             4 | cd5f0e01-8f73-4aa0-b837-b7f9bf4bc333 | 2013-05-27 14:07:05.401915+03 | 2013-05-29 10:55
:22.031875+03 | 00000000-0000-0000-0000-000000000000 | t
 d23f4d33-797b-4325-93d5-4e2e5d8e352b | 2013-05-30 19:00:13+03 |  1073741824 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-30 19:00:15+03 | b4ead7f9-62ec-4be8-b292-c78aa61571f4 |           2 |             4 | e76ada86-21cf-45b5-a241-74932cf45af9 | 2013-05-30 19:00:04.370783+03 | 2013-05-30 19:00
:15.472749+03 | 00000000-0000-0000-0000-000000000000 | t
 864baac9-0229-4e0f-9be9-2ab0452c959d | 2013-06-05 10:06:56+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
06-05 10:06:56+03 | 189f21ab-22b8-4e55-a994-079b105c396d |           1 |             5 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:06:49.142729+03 | 2013-06-05 10:08
:51.89915+03  | 00000000-0000-0000-0000-000000000000 | f
 e76974e4-18f7-4bee-9a79-7e5a02a7d655 | 2013-06-05 10:09:00+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 864baac9-0229-4e0f-9be9-2ab0452c959d |           2 | 2013-
06-05 10:09:00+03 | 44762adb-2711-41a8-9263-32c30130f9af |           2 |             4 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:08:51.89915+03  | 2013-06-05 10:09
:14.203064+03 | 00000000-0000-0000-0000-000000000000 | t
(5 rows)


3198f747-07ab-41eb-995b-f1470250c32f :
        verb = syncImageData
        id = 3198f747-07ab-41eb-995b-f1470250c32f

--- Additional comment from Allon Mureinik on 2013-07-09 11:12:20 EDT ---

This reproduces with the latest upstream (commit hash 085420fc98657a4e0281d38abb56f5d0ae0825ee).

Adding recent logs from my setup, and marking as high priority as this is a scenario that can very well happen, and the system does not seem to have a way to automatically recover.

--- Additional comment from Allon Mureinik on 2013-07-09 11:15:20 EDT ---



--- Additional comment from Allon Mureinik on 2013-07-30 10:44:04 EDT ---

Comment 2 Aharon Canan 2013-08-26 12:51:36 UTC
Verified using sf20

Error appears in logs but only once, 
Disk became available and didn't remain locked (after few minutes)

Comment 4 errata-xmlrpc 2013-09-10 19:05:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1210.html


Note You need to log in before you can comment on or make changes to this bug.