Bug 970889 - [LSM] engine: disk remains in locked state with repeating error in engine log when vm's pid is suddenly killed
[LSM] engine: disk remains in locked state with repeating error in engine log...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.2.0
x86_64 Linux
high Severity high
: ---
: 3.3.0
Assigned To: Daniel Erez
Katarzyna Jachim
storage
: ZStream
: 989966 (view as bug list)
Depends On: 985735
Blocks: 991542
  Show dependency treegraph
 
Reported: 2013-06-05 03:28 EDT by Dafna Ron
Modified: 2016-02-10 12:32 EST (History)
11 users (show)

See Also:
Fixed In Version: is6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 991542 (view as bug list)
Environment:
Last Closed: 2014-01-21 17:14:51 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (2.24 MB, application/x-gzip)
2013-06-05 03:28 EDT, Dafna Ron
no flags Details
Recent engine log with described failure. (4.96 MB, text/x-log)
2013-07-09 11:15 EDT, Allon Mureinik
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 16797 None None None Never

  None (edit)
Description Dafna Ron 2013-06-05 03:28:54 EDT
Created attachment 757059 [details]
logs

Description of problem:

This happened to me in my setup and I found a way to reproduce it. 
a vm crashed during syncImageData and engine started reporting the following in a loop while vm disk remained in image locked: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

I found a way to reproduce this 100% which is to kill -9 the vm pid during syncImageData. 

Version-Release number of selected component (if applicable):

sf17.4

How reproducible:

100%

Steps to Reproduce:
1. on a two hosts cluster with multiple iscsi storage create a vm with one disk
2. run the vm
3. when the vm changes status to up live storage migrate the disk
4. during syncImageData step, kill -9 the vm's pid


Actual results:

1. engine shows the following error in the logs in a loop: 

2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found

2. vm's disk remains in image locked until we run unlock_entity script. 

Expected results:

we should be able to recover in case a vm crashes duruing LSM. 

Additional info:logs

2013-06-05 10:12:03,563 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] Cannot get vdsManager for vdsid=00000000-0000-0000-0000-000000000000
2013-06-05 10:12:03,564 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (pool-4-thread-49) [5ab1c721] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.initializeVdsBroker(VdsBrokerCommand.java:50) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.<init>(VdsBrokerCommand.java:29) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskVDSCommand.<init>(VmReplicateDiskVDSCommand.java:9) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand.<init>(VmReplicateDiskFinishVDSCommand.java:8) [engine-vdsbroker.jar:]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [rt.jar:1.7.0_19]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525) [rt.jar:1.7.0_19]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.CreateCommand(ResourceManager.java:333) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:370) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.revertTask(VmReplicateDiskStartTaskHandler.java:79) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.lsm.VmReplicateDiskStartTaskHandler.endWithFailure(VmReplicateDiskStartTaskHandler.java:107) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalEndWithFailure(CommandBase.java:569) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endActionInTransactionScope(CommandBase.java:485) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1515) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:166) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:108) [engine-utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.endAction(CommandBase.java:429) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.Backend.endAction(Backend.java:379) [engine-bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_19]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:58) [jboss-as-ee.jar:7.2.0.Final-redhat-8]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:374) [jboss-invocation.jar:1.1.1.Final-redhat-2]
        at org.ovirt.engine.core.utils.ThreadLocalSessionCleanerInterceptor.injectWebContextToThreadLocal(ThreadLocalSessionCleanerInterceptor.java:11) [engine-utils.jar:]
        at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) [:1.7.0_19]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_19]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_19]
        at org.jboss.as.ee.component.ManagedReferenceLifecycleMethodInterceptorFactory$ManagedReferenceLifecycleMethodInterceptor.processInvocation(ManagedReferenceLifecycleMethodInterceptorFactory.java:123) [jboss-as-ee.jar:7.2.0.Final-redhat-




engine=# SELECT * from images;
              image_guid              |     creation_date      |    size     |               it_guid                |               parentid               | imagestatus |      
lastmodified      |            vm_snapshot_id            | volume_type | volume_format |            image_group_id            |         _create_date          |         _update_
date          |               quota_id               | active 
--------------------------------------+------------------------+-------------+--------------------------------------+--------------------------------------+-------------+------
------------------+--------------------------------------+-------------+---------------+--------------------------------------+-------------------------------+-----------------
--------------+--------------------------------------+--------
 00000000-0000-0000-0000-000000000000 | 2008-04-01 00:00:00+03 | 85899345920 | 00000000-0000-0000-0000-000000000000 |                                      |           0 |      
                  |                                      |           2 |             4 |                                      | 2013-02-26 04:56:09.820466+02 |                 
              |                                      | t
 f82a0d58-0791-4137-b1e6-22a8794acd2a | 2013-05-27 14:07:13+03 | 16106127360 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-27 14:07:15+03 | df1e0727-64ac-4cae-a35f-7f31058bbd69 |           2 |             4 | cd5f0e01-8f73-4aa0-b837-b7f9bf4bc333 | 2013-05-27 14:07:05.401915+03 | 2013-05-29 10:55
:22.031875+03 | 00000000-0000-0000-0000-000000000000 | t
 d23f4d33-797b-4325-93d5-4e2e5d8e352b | 2013-05-30 19:00:13+03 |  1073741824 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
05-30 19:00:15+03 | b4ead7f9-62ec-4be8-b292-c78aa61571f4 |           2 |             4 | e76ada86-21cf-45b5-a241-74932cf45af9 | 2013-05-30 19:00:04.370783+03 | 2013-05-30 19:00
:15.472749+03 | 00000000-0000-0000-0000-000000000000 | t
 864baac9-0229-4e0f-9be9-2ab0452c959d | 2013-06-05 10:06:56+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 00000000-0000-0000-0000-000000000000 |           1 | 2013-
06-05 10:06:56+03 | 189f21ab-22b8-4e55-a994-079b105c396d |           1 |             5 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:06:49.142729+03 | 2013-06-05 10:08
:51.89915+03  | 00000000-0000-0000-0000-000000000000 | f
 e76974e4-18f7-4bee-9a79-7e5a02a7d655 | 2013-06-05 10:09:00+03 |  4294967296 | 00000000-0000-0000-0000-000000000000 | 864baac9-0229-4e0f-9be9-2ab0452c959d |           2 | 2013-
06-05 10:09:00+03 | 44762adb-2711-41a8-9263-32c30130f9af |           2 |             4 | dcc04651-062b-4124-84aa-d97d3ae37173 | 2013-06-05 10:08:51.89915+03  | 2013-06-05 10:09
:14.203064+03 | 00000000-0000-0000-0000-000000000000 | t
(5 rows)


3198f747-07ab-41eb-995b-f1470250c32f :
        verb = syncImageData
        id = 3198f747-07ab-41eb-995b-f1470250c32f
Comment 1 Allon Mureinik 2013-07-09 11:12:20 EDT
This reproduces with the latest upstream (commit hash 085420fc98657a4e0281d38abb56f5d0ae0825ee).

Adding recent logs from my setup, and marking as high priority as this is a scenario that can very well happen, and the system does not seem to have a way to automatically recover.
Comment 2 Allon Mureinik 2013-07-09 11:15:20 EDT
Created attachment 771095 [details]
Recent engine log with described failure.
Comment 3 Allon Mureinik 2013-07-30 10:44:04 EDT
*** Bug 989966 has been marked as a duplicate of this bug. ***
Comment 5 Katarzyna Jachim 2013-08-13 08:19:55 EDT
Verified on is9.1 - after killing vm process live migration fails (after a few minutes) and disk is unlocked.
Comment 6 Itamar Heim 2014-01-21 17:14:51 EST
Closing - RHEV 3.3 Released
Comment 7 Itamar Heim 2014-01-21 17:22:20 EST
Closing - RHEV 3.3 Released

Note You need to log in before you can comment on or make changes to this bug.