1020909 – Restoring VM from snapshot stuck in "Image Locked" state

Bug 1020909 - Restoring VM from snapshot stuck in "Image Locked" state

Summary: Restoring VM from snapshot stuck in "Image Locked" state

Keywords:
Status:	CLOSED DUPLICATE of bug 1015638
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.2.6
Assignee:	Ravi Nori
QA Contact:
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-18 13:49 UTC by Petr Spacek
Modified:	2016-02-10 19:43 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-11-26 16:02:14 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
python reset-vm-to-snapshot pviktori-z-ci-host1 remote-host (2.03 KB, text/plain) 2013-11-20 15:57 UTC, Petr Spacek	no flags	Details
View All

Description Petr Spacek 2013-10-18 13:49:01 UTC

Description of problem:
We tried to revert two VMs to state stored in snapshort but these two VMs stuck in state "Image Locked" for more than 1 day.



Background
==========
Our Continuous Integration system uses several VMs and reverts those VMs to the original state/reverts to the snapshot before each test. Those two VMs were restored from the same snapshot at least hundred times at least, everything went fine up to now.

Now two VMs are stuck in state "Image Locked".


What I tried to resurrect VMs
=============================
- restart ovirt-engine
- restart vdsmd on both hypervisors
Result: nothing changed. VMs are still in "Image Locked" state.


Environment
===========
Coincidentally, we did upgrade from RHEV-M 3.2.0 to latest RHEV-M 3.2.3 (bits are from the public RHN) 4 hours before VMs get stuck, but I don't know if it is related to this problem or not.

All machines (1x RHEV-M + 2x hypervisors) are running RHEL 6.4 with latest bits from RHN. Kernels are also latest released versions, all machines were rebooted 4 hours before the problem appeared.

Both stuck VMs reside on the same iSCSI storage domain, but all other VMs on the same storage domain are up and running.


Affected VMs & last lines from log shown in RHEV-M
==================================================
- VM pviktori-z-ci-host1
2013-Oct-17, 12:27 Restoring VM pviktori-z-ci-host1 from snapshot started by user admin internal RestoreSnapshot 2013-Oct-17, 12:27 Snapshot-Preview remote-host for VM pviktori-z-ci-host1 has been completed. RestoreSnapshot 2013-Oct-17, 12:27 Snapshot-Preview remote-host for VM pviktori-z-ci-host1 was initiated by admin internal RestoreSnapshot 2013-Oct-17, 12:27 VM pviktori-z-ci-host1 powered off by admin internal (Host: laika). 1832c81e


- VM pviktori-z-ci-host2
2013-Oct-17, 12:20 Restoring VM pviktori-z-ci-host2 from snapshot started by user admin internal RestoreSnapshot 2013-Oct-17, 12:20 Snapshot-Preview remote-host for VM pviktori-z-ci-host2 has been completed. RestoreSnapshot 2013-Oct-17, 12:19 Snapshot-Preview remote-host for VM pviktori-z-ci-host2 was initiated by admin internal RestoreSnapshot 2013-Oct-17, 12:19 VM pviktori-z-ci-host2 restoring from Snapshot has been completed. RestoreSnapshot 2013-Oct-17, 12:18 Restoring VM pviktori-z-ci-host2 from snapshot started by user admin internal RestoreSnapshot 2013-Oct-17, 12:18 Snapshot-Preview remote-host for VM pviktori-z-ci-host2 has been completed. RestoreSnapshot 2013-Oct-17, 12:18 Snapshot-Preview remote-host for VM pviktori-z-ci-host2 was initiated by admin internal RestoreSnapshot 2013-Oct-17, 12:18 VM pviktori-z-ci-host2 powered off by admin internal (Host: laika). e8f08bb

Version-Release number of selected component (if applicable):
3.2.3-0.43.el6ev

How reproducible:
Unclear.

Actual results:
VM is stuck in "Image Locked" state.

Expected results:
It should work :-)

Comment 1 Petr Spacek 2013-10-18 13:50:35 UTC

For the record, pstehlik helped me to "fix"/workaround the problem.

We did (roughly)  following:
* Check that disks are in state OK (not "locked"), i.e. only the VM state is stuck.

* Run /usr/share/ovirt-engine/scripts/dbutils/taskcleaner.sh -d engine -u engine
  Result was "(0 rows)"

* Open database on ovirt-engine:
  $ psql -d engine -U postgres

* Check which VMs are in state 15:
  SELECT vm_guid, status FROM vm_dynamic WHERE status = 15;

* Change state of the VMs to 0:
  UPDATE vm_dynamic SET status = 0 WHERE status = 15;

* COMMIT;

Now both VMs can start and revert to the snapshot works.

Comment 6 Michal Skrivanek 2013-10-18 14:27:04 UTC

Yair, if you can please check the NPE in async tasks...seems the code is not there anymore in 3.3 so I suppose you did something about it:)

Comment 7 Barak 2013-10-20 12:03:55 UTC

Is this 100% reproducible ?

Comment 8 Petr Spacek 2013-10-21 08:45:35 UTC

No, it not 100% reproducible. It happens from time to time. One VM got stuck after ~ 12 reverts and the other after ~ 50 reverts.
Today it happened again with another set of machines. I let VMs in the stuck state so someone can investigate it.

This never happend with RHEV-M 3.2.0 and happened twice in the first week after upgrade to RHEV-M 3.2.3, so it seems like a regression. Feel free to ping me, I can provide you access to hypervisors.

Comment 10 Ravi Nori 2013-10-29 01:46:48 UTC

A lot of refactoring has gone into Aync Task handling in 3.3. The code that is throwing the exception is now located in in CommandAsyncTask. I took a look a the code and the cause of the exception seems to be linked to handling end of commands by entity id. This has been refactored to be handled by command id in 3.3 so the above should not happen in 3.3.

Comment 11 Itamar Heim 2013-10-31 06:51:00 UTC

(In reply to Petr Spacek from comment #1)
> For the record, pstehlik helped me to "fix"/workaround the problem.
> 
> We did (roughly)  following:
> * Check that disks are in state OK (not "locked"), i.e. only the VM state is
> stuck.
> 
> * Run /usr/share/ovirt-engine/scripts/dbutils/taskcleaner.sh -d engine -u
> engine
>   Result was "(0 rows)"
> 
> * Open database on ovirt-engine:
>   $ psql -d engine -U postgres
> 
> * Check which VMs are in state 15:
>   SELECT vm_guid, status FROM vm_dynamic WHERE status = 15;
> 
> * Change state of the VMs to 0:
>   UPDATE vm_dynamic SET status = 0 WHERE status = 15;
> 
> * COMMIT;
> 
> Now both VMs can start and revert to the snapshot works.

just a btw you should be able to use the unlock_entity utility instead of hacking the db.

Comment 12 Petr Spacek 2013-10-31 12:31:21 UTC

Thank you. I use /usr/share/ovirt-engine/dbscripts/unlock_entity.sh -t vm -i $vmname , but I have to run it at least once a day.

Comment 15 Ravi Nori 2013-11-05 01:43:06 UTC

I followed the following steps but am unable to reproduce the issue in 3.2.z and current master


1. Created two VMs and installed Fedora 18 on them
2. Create snapshot for each VM
3. Used command below to restore VMs from snapshot

curl -v -X POST -d @snapshot.xml -u admin@internal:letmein! -H "Accept: application/xml" -H "Content-Type: application/xml" http://127.0.0.1:8700/api/vms/xxxx/snapshots/xxxx/restore

where snapshot.xml is

<action>
</action>

4. I executed the above command simultaneously on the two VMs 500 times using shell script.

It took a few hours to run but every invocation of restore command completed successfully with out exceptions and the image is in OK status.

Please let me know if I a missing anything.

Looking at the log files this issue seems to be similar to the issue described in BZ 1015638.

Comment 16 Petr Spacek 2013-11-05 08:25:06 UTC

Is seems fine, but it is a race condition, you know ...

Our workflow is like this:
1) Create VMs, install OS, create snapshot (you did that)
2) Run VM
3) Do some changes in the VM
4) Shutdown VM
5) Revert VM to the snapshort
6) Go to step (2)

All this (except point 1) is done by script, so revent follows the shutdown in very short time and so on.

Our hypervisor hosts 20-30 VMs at the same time.

Thank you for your time.

Comment 17 Barak 2013-11-05 09:52:31 UTC

Petr,
Can you please describe your setup,
Specifically DCs/Clusters/Hosts
Storage ?
Did this happen also on other DCs?
Did this happen always on the same VMs ?
 

Ravi can you please verify check in the attached logs:
- Storage related issues during the described action
- whether the SPM changed during the revert ?
- were there any missing/disappearing tasks in engine (related to the actions 
  above).

Comment 18 Petr Spacek 2013-11-05 11:10:25 UTC

Sure.

1 DC:
- IDM_LAB
1 cluster:
- IDM_LAB, compatibility version 3.2
2 hosts in the cluster:
- belka (RHEL 6.4.z - with lastest released updates)
- laika (RHEL 6.4.z - with lastest released updates)
4 storage domains:
- belka-vm1 - iSCSI
- laika-vm1 - iSCSI
- export - NFS
- iso-laika - NFS

We have only one DC, so I can't say if it would happen on the other DC or not.

It seems that it happens randomly on all VMs, but we do snapshot-revert only on subset of all VMs (on ~ 12 VMs from ~ 120 of total VMs).

As far as I remember it happened only with VMs from storage laika-vm1, but this is not very representative information. Only one VM from the whole subset is hosted on storage domain belka-vm1.

It seems that host on which VM was running before snapshot-revert operation doesn't make any difference.

Let me know if you want any other information.

Comment 19 Petr Spacek 2013-11-06 08:29:02 UTC

Correcion: Today it happened again with VM stored on storage domain belka-vm1, so both storage domains are affected.

Comment 20 Ravi Nori 2013-11-06 15:20:12 UTC

From the engine log file there is no storage related exceptions, spm re-election and no disappearing tasks.

Below are the exceptions from the log files attached. There is a 'Device or resource busy' exception thrown on vdsm just before an exception is thrown on the engine. Not sure what is causing the resource busy exception. Was there any recent change in network topology involving the hosts and the iSCSI storage?

Looking at the engine exception this looks like a race condition where the engine task is cleared before startPolling is called leading to NPE as described in this patch http://gerrit.ovirt.org/#/c/20183/

Exception in engine log

2013-10-17 12:20:04,032 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (ajp-/127.0.0.1:8702-7) [RestoreSnapshot] BaseAsyncTask::StartPollingTask: Starting to poll task b3d523c5-0762-4152-a04b-378d6cae39d3.
2013-10-17 12:20:04,053 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-7) Operation Failed: java.lang.NullPointerException: javax.ejb.EJBException: java.lang.NullPointerException
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInNoTx(CMTTxInterceptor.java:191) [jboss-as-ejb3.jar:7.2.1.Final-redhat-10]
	at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:237) [jboss-as-ejb3.jar:7.2.1.Final-redhat-10]



Exception in vdsm log

Thread-2680::DEBUG::2013-10-17 12:20:00,627::task::579::TaskManager.Task::(_updateState) Task=`2753fc4f-a752-4f0c-b034-8ad05e317d79`::moving from state preparing -> state finished
Thread-2680::DEBUG::2013-10-17 12:20:00,628::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-2680::DEBUG::2013-10-17 12:20:00,628::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-2680::DEBUG::2013-10-17 12:20:00,628::task::974::TaskManager.Task::(_decref) Task=`2753fc4f-a752-4f0c-b034-8ad05e317d79`::ref 0 aborting False
89dae60e-0ca2-4023-b879-87736b81e29f::DEBUG::2013-10-17 12:20:00,913::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '  device-mapper: remove ioctl on  failed: Device or resource busy\n  Unable to deactivate 8f72fe93--70a7--44cc--8566--7eec0c4bffea-925f6d7d--4feb--44d5--b352--ee19ed726a57 (253:25)\n'; <rc> = 5
89dae60e-0ca2-4023-b879-87736b81e29f::DEBUG::2013-10-17 12:20:00,928::lvm::501::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
89dae60e-0ca2-4023-b879-87736b81e29f::DEBUG::2013-10-17 12:20:00,928::lvm::513::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
89dae60e-0ca2-4023-b879-87736b81e29f::WARNING::2013-10-17 12:20:00,928::resourceFactories::61::Storage.ResourcesFactories::(close) Failure deactivate LV 8f72fe93-70a7-44cc-8566-7eec0c4bffea/925f6d7d-4feb-44d5-b352-ee19ed726a57 (Cannot deactivate Logical Volume: ('General Storage Exception: ("5 [] [\'  device-mapper: remove ioctl on  failed: Device or resource busy\', \'  device-mapper: remove ioctl on  failed: Device or resource busy\', Unable to deactivate 8f72fe93--70a7--44cc--8566--7eec0c4bffea-925f6d7d--4feb--44d5--b352--ee19ed726a57 (253:25)\']\\n8f72fe93-70a7-44cc-8566-7eec0c4bffea/[\'925f6d7d-4feb-44d5-b352-ee19ed726a57\']",)',))
89dae60e-0ca2-4023-b879-87736b81e29f::DEBUG::2013-10-17 12:20:00,929::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource '8f72fe93-70a7-44cc-8566-7eec0c4bffea_lvmActivationNS.925f6d7d-4feb-44d5-b352-ee19ed726a57', Clearing records.

Comment 21 Petr Spacek 2013-11-06 15:38:07 UTC

(In reply to Ravi Nori from comment #20)
> Was there any recent change in network topology involving the hosts and the iSCSI storage?

The network and topology is the same for several months. As I said in bug description, the problem appeared first time after upgrade 3.2.3-0.43.el6ev. We fully upgraded both hypervisors at the same time, so everything is not latest RHE 6.4.z bits.

Comment 22 Petr Spacek 2013-11-06 15:44:26 UTC

Hmm, one of VMs got stuck again (second time today) and I noticed that there are some stuck stasks on one of hypervisor:

# /usr/share/ovirt-engine/scripts/dbutils/taskcleaner.sh 
-[ RECORD 1 ]+-------------------------------------
task_id      | b59205fd-9e44-4425-963b-62bb62cf7b1a
task_type    | 4
status       | 2
started_at   | 2013-11-05 23:22:47.883+01
result       | 0
command_type | 224
command_id   | ab0874e4-9731-4ffa-a4ab-1116f21e13f5
step_id      | ccd812d8-f3ab-4b5a-bbfc-bea3a8cf66c4
dc           | 6163a491-b652-453d-9574-630503175cbe
-[ RECORD 2 ]+-------------------------------------
task_id      | bc52458a-0ecc-48b5-8a54-ca740d1a7f61
task_type    | 4
status       | 2
started_at   | 2013-11-05 22:09:45.909+01
result       | 0
command_type | 224
command_id   | a273d5f0-94eb-4610-9253-f333009ec757
step_id      | 0eda080e-0929-49b2-886f-6d8bbc0d4587
dc           | 6163a491-b652-453d-9574-630503175cbe
-[ RECORD 3 ]+-------------------------------------
task_id      | 4564c75a-22e4-4859-8b70-f4dca87ca6dc
task_type    | 4
status       | 2
started_at   | 2013-11-06 12:05:32.018+01
result       | 0
command_type | 224
command_id   | 0331b710-19e8-4e13-8ace-e164f11f8e07
step_id      | fe114cf1-62bf-4631-9b78-65ed63a44ced
dc           | 6163a491-b652-453d-9574-630503175cbe
-[ RECORD 4 ]+-------------------------------------
task_id      | bdd42c87-694f-4c77-9eb8-489245589d4d
task_type    | 4
status       | 2
started_at   | 2013-11-04 23:22:39.925+01
result       | 0
command_type | 224
command_id   | f6325304-27a5-431b-9737-c9cf3dd7d447
step_id      | 1721c866-fb42-42ac-b3b6-1737f14b46a8
dc           | 6163a491-b652-453d-9574-630503175cbe

[root@laika ~]# vdsClient -s localhost getAllTasksStatuses
{'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'4564c75a-22e4-4859-8b70-f4dca87ca6dc': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': '4564c75a-22e4-4859-8b70-f4dca87ca6dc', 'taskResult': 'success', 'taskState': 'finished'}, 'b59205fd-9e44-4425-963b-62bb62cf7b1a': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': 'b59205fd-9e44-4425-963b-62bb62cf7b1a', 'taskResult': 'success', 'taskState': 'finished'}, 'bc52458a-0ecc-48b5-8a54-ca740d1a7f61': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': 'bc52458a-0ecc-48b5-8a54-ca740d1a7f61', 'taskResult': 'success', 'taskState': 'finished'}}}

Comment 23 Barak 2013-11-07 09:53:55 UTC

Eduardo, Saggi,

Can you please take a look into the vdsm error in comment #20.

- What could be the problem in this specific use case ?
- What are the implications on the task itself ? could it disappear from 
  reported tasks ()by vdsm) ?

Comment 24 Petr Spacek 2013-11-20 11:06:49 UTC

Hi guys,

it is more than two weeks from last update. Can I help you in any way?

Today I noticed that VDSM on the hypervisors spits some suspicious messages:

Nov 20 10:42:47 belka vdsm vm.Vm WARNING vmId=`695360f0-36f0-41d5-aae0-7707753985e8`::Unknown type found, device: '{'device': u'spicevmc', 'alias': u'channel2', 'type': u'channel', 'address': {u'bus': u'0', u'controller': u'0', u'type': u'virtio-serial', u'port': u'3'}}' found
Nov 20 10:52:05 belka vdsm vm.Vm WARNING vmId=`7e2b0467-72c9-49ca-9778-969e6d414060`::_readPauseCode unsupported by libvirt vm
Nov 20 10:52:27 belka <BF><11>vdsm vm.Vm ERROR vmId=`7e2b0467-72c9-49ca-9778-969e6d414060`::Unknown message type fqdn

vmId 7e2b0467-72c9-49ca-9778-969e6d414060 == VM name pviktori-z-ci-f19-test4

VDSM and libvirt are on latest and greatest versions from RHEL-6.4.z.

Comment 25 Eduardo Warszawski 2013-11-20 11:32:37 UTC

(In reply to Barak from comment #23)
> Eduardo, Saggi,
> 
> Can you please take a look into the vdsm error in comment #20.
> 
> - What could be the problem in this specific use case ?
> - What are the implications on the task itself ? could it disappear from 
>   reported tasks ()by vdsm) ?

12:20:00,913::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = '  device-mapper: remove ioctl on  failed: Device or resource busy\n  Unable to deactivate 8f72fe93--70a7--44cc--8566--7eec0c4bffea-925f6d7d--4feb--44d5--b352--ee19ed726a57 (253:25)\n'; <rc> = 5


The volume 8f72fe93-70a7-44cc-8566-7eec0c4bffea/925f6d7d-4feb-44d5-b352-ee19ed726a57 is still open when is deactivated, failing the operation.
Need to ellucidate which process is holding an open fd on it.
The failed deactivation is forgiven, but this is a different error.

Comment 26 Barak 2013-11-20 15:37:11 UTC

(In reply to Petr Spacek from comment #16)
> Is seems fine, but it is a race condition, you know ...
> 
> Our workflow is like this:
> 1) Create VMs, install OS, create snapshot (you did that)
> 2) Run VM
> 3) Do some changes in the VM
> 4) Shutdown VM
> 5) Revert VM to the snapshort

When you say revert vm from snapshot , 
- does it involve deletion of snapshot ? does it include post zero ?
- Or preview ?


> 6) Go to step (2)
> 
> All this (except point 1) is done by script, so revent follows the shutdown
> in very short time and so on.

meaning using the rest API ?
Can you post the script ?


> 
> Our hypervisor hosts 20-30 VMs at the same time.
> 
> Thank you for your time.

Comment 27 Petr Spacek 2013-11-20 15:57:18 UTC

Created attachment 826726 [details]
python reset-vm-to-snapshot pviktori-z-ci-host1 remote-host

We use attached script for reverts. It would be best to check the script, I could mangle RHEV-M terminology.

The script author has nick "pviktori", so please direct any questions about the script to him.

We call the script like "python reset-vm-to-snapshot.py pviktori-z-ci-host1 remote-host".

Comment 28 Saggi Mizrahi 2013-11-24 10:24:38 UTC

The deactivation error means that a process has an open FD to the volume so it can't be deactivated.

Comment 29 Ravi Nori 2013-11-26 16:02:14 UTC

The underlying issue is the same as the issue in BZ 1015638

We are running into a Race Condition when restoring VM from snapshot and the fix for the above mentioned bug fixes the issue reported.

*** This bug has been marked as a duplicate of bug 1015638 ***

Comment 30 Petr Spacek 2013-12-17 08:18:25 UTC

I confirm that rhevm-3.2.5-0.49.el6ev.src.rpm fixed the problem for us, so yes - it is really a duplicate.

Note You need to log in before you can comment on or make changes to this bug.