Bug 970889 - [LSM] engine: disk remains in locked state with repeating error in engine log when vm's pid is suddenly killed
Summary: [LSM] engine: disk remains in locked state with repeating error in engine log...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.3.0
Assignee: Daniel Erez
QA Contact: Katarzyna Jachim
URL:
Whiteboard: storage
: 989966 (view as bug list)
Depends On: 985735
Blocks: 991542
TreeView+ depends on / blocked
 
Reported: 2013-06-05 07:28 UTC by Dafna Ron
Modified: 2016-02-10 17:32 UTC (History)
11 users (show)

Fixed In Version: is6
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 991542 (view as bug list)
Environment:
Last Closed: 2014-01-21 22:14:51 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (2.24 MB, application/x-gzip)
2013-06-05 07:28 UTC, Dafna Ron
no flags Details
Recent engine log with described failure. (4.96 MB, text/x-log)
2013-07-09 15:15 UTC, Allon Mureinik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 16797 0 None None None Never

Description Dafna Ron 2013-06-05 07:28:54 UTC
Created attachment 757059 [details]
logs

Description of problem:

This happened to me in my setup and I found a way to reproduce it. 
a vm crashed during syncImageData and engine started reporting the following in a loop while vm disk remained in image locked: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

I found a way to reproduce this 100% which is to kill -9 the vm pid during syncImageData. 

Version-Release number of selected component (if applicable):

sf17.4

How reproducible:

100%

Steps to Reproduce:
1. on a two hosts cluster with multiple iscsi storage create a vm with one disk
2. run the vm
3. when the vm changes status to up live storage migrate the disk
4. during syncImageData step, kill -9 the vm's pid


Actual results:

1. engine shows the following error in the logs in a loop: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

2. vm's disk remains in image locked until we run unlock_entity script. 

Expected results:

we should be able to recover in case a vm crashes duruing LSM. 

Additional info:logs

2013-06-05 10:12:03,563 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] Cannot get vdsManager for vdsid=00000000-0000-0000-0000-000000000000
2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.initializeVdsBroker(VdsBrokerCommand.java:50) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.<init>(VdsBrokerCommand.java:29) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskVDSCommand.<init>(VmReplicateDiskVDSCommand.java:9) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand.<init>(VmReplicateDiskFinishVDSCommand.java:8) [engine-vdsbroker.jar:]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.7.0_19]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525) [rt.jar:1.7.0_19]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:333) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:370) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.revertTask(VmReplicateDiskStartTaskHandler.java:79) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.endWithFailure(VmReplicateDiskStartTaskHandler.java:107) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndWithFailure(CommandBase.java:569) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:485) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1515) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:166) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:108) [engine-utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:429) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:58) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:374) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.ovirt.engine.core.utils.ThreadLocalSessionCleanerInterceptor.injectWebContextToThreadLocal(ThreadLocalSessionCleanerInterceptor.java:11) [engine-utils.jar:]
        at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) [:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:123) [jboss-as-ee.jar:7.2.0.Final-redhat-




engine=# SELECT * from images;
              image_guid              |     creation_date      |    size     |               it_guid                |               parentid               | imagestatus |      
lastmodified      |            vm_snapshot_id            | volume_type | volume_format |            image_group_id            |         _create_date          |         _update_
date          |               quota_id               | active 
--------------------------------------+------------------------+-------------+--------------------------------------+--------------------------------------+-------------+------
------------------+--------------------------------------+-------------+---------------+--------------------------------------+-------------------------------+-----------------
--------------+--------------------------------------+--------
 00000000-0000-0000-0000-000000000000 | 2008-04-01 00:00:00+03 | 85899345920 | 00000000-0000-0000-0000-000000000000 |                                      |           0 |      
                  |                                      |           2 |             4 |                                      | 2013-02-26 04:56:09.820466+02 |                 
              |                                      | t
 f82a0d58-0791-4137-b1e6-22a8794acd2a | 2013-05-27 14:07:13+03 | 16106127360 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-27 14:07:15+03 | df1e0727-64ac-4cae-a35f-7f31058bbd69 |           2 |             4 | cd5f0e01-8f73-4aa0-b837-b7f9bf4bc333 | 2013-05-27 14:07:05.401915+03 | 2013-05-29 10:55
:22.031875+03 | 00000000-0000-0000-0000-000000000000 | t
 d23f4d33-797b-4325-93d5-4e2e5d8e352b | 2013-05-30 19:00:13+03 |  1073741824 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-30 19:00:15+03 | b4ead7f9-62ec-4be8-b292-c78aa61571f4 |           2 |             4 | e76ada86-21cf-45b5-a241-74932cf45af9 | 2013-05-30 19:00:04.370783+03 | 2013-05-30 19:00
:15.472749+03 | 00000000-0000-0000-0000-000000000000 | t
 864baac9-0229-4e0f-9be9-2ab0452c959d | 2013-06-05 10:06:56+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
06-05 10:06:56+03 | 189f21ab-22b8-4e55-a994-079b105c396d |           1 |             5 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:06:49.142729+03 | 2013-06-05 10:08
:51.89915+03  | 00000000-0000-0000-0000-000000000000 | f
 e76974e4-18f7-4bee-9a79-7e5a02a7d655 | 2013-06-05 10:09:00+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 864baac9-0229-4e0f-9be9-2ab0452c959d |           2 | 2013-
06-05 10:09:00+03 | 44762adb-2711-41a8-9263-32c30130f9af |           2 |             4 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:08:51.89915+03  | 2013-06-05 10:09
:14.203064+03 | 00000000-0000-0000-0000-000000000000 | t
(5 rows)


3198f747-07ab-41eb-995b-f1470250c32f :
        verb = syncImageData
        id = 3198f747-07ab-41eb-995b-f1470250c32f

Comment 1 Allon Mureinik 2013-07-09 15:12:20 UTC
This reproduces with the latest upstream (commit hash 085420fc98657a4e0281d38abb56f5d0ae0825ee).

Adding recent logs from my setup, and marking as high priority as this is a scenario that can very well happen, and the system does not seem to have a way to automatically recover.

Comment 2 Allon Mureinik 2013-07-09 15:15:20 UTC
Created attachment 771095 [details]
Recent engine log with described failure.

Comment 3 Allon Mureinik 2013-07-30 14:44:04 UTC
*** Bug 989966 has been marked as a duplicate of this bug. ***

Comment 5 Katarzyna Jachim 2013-08-13 12:19:55 UTC
Verified on is9.1 - after killing vm process live migration fails (after a few minutes) and disk is unlocked.

Comment 6 Itamar Heim 2014-01-21 22:14:51 UTC
Closing - RHEV 3.3 Released

Comment 7 Itamar Heim 2014-01-21 22:22:20 UTC
Closing - RHEV 3.3 Released


Note You need to log in before you can comment on or make changes to this bug.