Bug 1327140 - Failed to MergeVDS, error = Drive image file could not be found, code = 13
Summary: Failed to MergeVDS, error = Drive image file could not be found, code = 13
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 3.6.4.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-3.6.6
: 3.6.6.2
Assignee: Ala Hino
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-14 10:31 UTC by sha
Modified: 2016-05-30 10:56 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-05-30 10:56:02 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
acanan: testing_ack+


Attachments (Terms of Use)
First attempt at deleting the snapshot (30.03 KB, text/plain)
2016-05-09 14:22 UTC, sha
no flags Details
Subsequent attempt at deleting the snapshot (6.82 KB, text/plain)
2016-05-09 14:23 UTC, sha
no flags Details
vdsm.log from host running VM (2.30 MB, application/x-xz)
2016-05-10 08:08 UTC, sha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 56734 0 None MERGED core: Live Merge: Improve Live Merge Recovery Mechanism 2020-10-29 15:09:30 UTC

Description sha 2016-04-14 10:31:38 UTC
Description of problem:

Trying to delete snapshot fails with:
Failed to MergeVDS, error = Drive image file could not be found, code = 13

The snapshot has been marked illegal within oVirt. There are two snapshots with the same issues.

The hosted-engine was shut down by the HA agents during the snapshot deletion which I think caused this.

Relevant engine log during attempt to delete snapshot.
2016-04-14 10:33:25,151 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (default task-30) [3b7df2d5] Lock Acquired to object 'EngineLock:{exclusiveLocks='[0210bd18-bdaa-4803-8250-756ac6aee1ed=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-04-14 10:33:25,539 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (default task-30) [3b7df2d5] Running command: RemoveSnapshotCommand internal: false. Entities affected :  ID: 0210bd18-bdaa-4803-8250-756ac6aee1ed Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2016-04-14 10:33:25,577 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (default task-30) [3b7df2d5] Lock freed to object 'EngineLock:{exclusiveLocks='[0210bd18-bdaa-4803-8250-756ac6aee1ed=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-04-14 10:33:25,748 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (pool-7-thread-10) [49d132f2] Running command: RemoveSnapshotSingleDiskLiveCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2016-04-14 10:33:25,754 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-2) [] Waiting on Live Merge child commands to complete
2016-04-14 10:33:25,782 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-30) [] Correlation ID: 3b7df2d5, Job ID: 1ab716c6-cc14-4d36-8b4d-46265b8efea7, Call Stack: null, Custom Event ID: -1, Message: Snapshot '_backup_oliveri_201604050917' deletion for VM 'oliveri' was initiated by sha.uk.
2016-04-14 10:33:27,766 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-91) [49d132f2] Executing Live Merge command step 'EXTEND'
2016-04-14 10:33:27,836 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-91) [] Waiting on Live Merge child commands to complete
2016-04-14 10:33:27,887 INFO  [org.ovirt.engine.core.bll.MergeExtendCommand] (pool-7-thread-1) [5e5be0a0] Running command: MergeExtendCommand internal: true. Entities affected :  ID: 8d6835a5-7b8f-4cc7-a8bb-3a07926d522d Type: Storage
2016-04-14 10:33:27,889 INFO  [org.ovirt.engine.core.bll.MergeExtendCommand] (pool-7-thread-1) [5e5be0a0] Base and top image sizes are the same; no image size update required
2016-04-14 10:33:29,854 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-6) [49d132f2] Executing Live Merge command step 'MERGE'
2016-04-14 10:33:29,937 INFO  [org.ovirt.engine.core.bll.MergeCommand] (pool-7-thread-2) [7c3bd01b] Running command: MergeCommand internal: true. Entities affected :  ID: 8d6835a5-7b8f-4cc7-a8bb-3a07926d522d Type: Storage
2016-04-14 10:33:29,938 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] START, MergeVDSCommand(HostName = balboai, MergeVDSCommandParameters:{runAsync='true', hostId='c2dbe050-8c48-4aad-b255-7d91aeb30281', vmId='0210bd18-bdaa-4803-8250-756ac6aee1ed', storagePoolId='00000002-0002-0002-0002-000000000183', storageDomainId='8d6835a5-7b8f-4cc7-a8bb-3a07926d522d', imageGroupId='06b8d8bb-344e-4222-b57c-e49adf97504c', imageId='139af7be-e99f-4f8e-a66f-6fd08a83a4b7', baseImageId='73d68cc8-65f2-4c5d-b8c2-a09fa2c342ab', topImageId='139af7be-e99f-4f8e-a66f-6fd08a83a4b7', bandwidth='0'}), log id: 3845c383
2016-04-14 10:33:29,966 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] Failed in 'MergeVDS' method
2016-04-14 10:33:29,969 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-7-thread-2) [7c3bd01b] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM balboai command failed: Drive image file could not be found
2016-04-14 10:33:29,969 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=13, message=Drive image file could not be found]]'
2016-04-14 10:33:29,969 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] HostName = balboai
2016-04-14 10:33:29,969 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] Command 'MergeVDSCommand(HostName = balboai, MergeVDSCommandParameters:{runAsync='true', hostId='c2dbe050-8c48-4aad-b255-7d91aeb30281', vmId='0210bd18-bdaa-4803-8250-756ac6aee1ed', storagePoolId='00000002-0002-0002-0002-000000000183', storageDomainId='8d6835a5-7b8f-4cc7-a8bb-3a07926d522d', imageGroupId='06b8d8bb-344e-4222-b57c-e49adf97504c', imageId='139af7be-e99f-4f8e-a66f-6fd08a83a4b7', baseImageId='73d68cc8-65f2-4c5d-b8c2-a09fa2c342ab', topImageId='139af7be-e99f-4f8e-a66f-6fd08a83a4b7', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13
2016-04-14 10:33:29,969 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-2) [7c3bd01b] FINISH, MergeVDSCommand, log id: 3845c383
2016-04-14 10:33:29,969 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-7-thread-2) [7c3bd01b] Command 'org.ovirt.engine.core.bll.MergeCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13)
2016-04-14 10:33:29,986 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-7-thread-2) [7c3bd01b] Transaction rolled-back for command 'org.ovirt.engine.core.bll.MergeCommand'.
2016-04-14 10:33:30,926 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-44) [49d132f2] Failed child command status for step 'MERGE'
2016-04-14 10:33:31,986 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-100) [49d132f2] Merging of snapshot '75d5606d-9b74-4b26-b174-f0ac3547df71' images '73d68cc8-65f2-4c5d-b8c2-a09fa2c342ab'..'139af7be-e99f-4f8e-a66f-6fd08a83a4b7' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2016-04-14 10:33:32,025 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-100) [3b7df2d5] All Live Merge child commands have completed, status 'FAILED'
2016-04-14 10:33:33,109 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-94) [3b7df2d5] Ending command 'org.ovirt.engine.core.bll.RemoveSnapshotCommand' with failure.
2016-04-14 10:33:33,225 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-94) [3b7df2d5] Correlation ID: 3b7df2d5, Job ID: 1ab716c6-cc14-4d36-8b4d-46265b8efea7, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '_backup_oliveri_201604050917' for VM 'oliveri'.

Ran the python check disk tool:
[root@balboai tmp]# python check-disk-info.py 
VM oliveri
    Disk 06b8d8bb-344e-4222-b57c-e49adf97504c (sd:8d6835a5-7b8f-4cc7-a8bb-3a07926d522d)
Warning: volume 139af7be-e99f-4f8e-a66f-6fd08a83a4b7 is in chain but illegal
    Volumes:
        73d68cc8-65f2-4c5d-b8c2-a09fa2c342ab




Version-Release number of selected component (if applicable):

ovirt-engine.noarch  3.6.4.1-1.el6 

How reproducible:
Have only done it once.

Steps to Reproduce:
1.
2.
3.

Actual results:
Snapshot marked illegal.


Expected results:
Snapshot should have been deleted


Additional info:

Comment 1 Allon Mureinik 2016-04-14 14:10:52 UTC
Ala, seems like something that the patchset you're working on should also solve, no?

Comment 2 gregor 2016-04-19 17:50:23 UTC
Hi, I have here the same problem. Here the Snapshot is not marked illegal, the Status is "OK".

Environment:

CentOS 7.2.1511
kernel-3.10.0-327.13.1.el7.x86_64
ovirt-engine-3.6.4.1-1.el7.centos.noarch

...
2016-04-19 19:42:02,243 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-7-thread-2) [111550ed] Command 'org.ovirt.engine.core.bll.MergeCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13)
...

Comment 3 gregor 2016-04-19 18:02:49 UTC
UPDATE: Now I see that the Disk's inside the Snapshot are marked as Illegal, for all Snapshots

Comment 4 gregor 2016-04-21 13:03:24 UTC
I solved the problem, these where my steps:

- power off all VM's
- reboot the host
- now all VM's should be off
- delete the snapshots, this took very very long but it finished and the snapshots are removed + the VM is able to boot ;-)

Comment 5 Ala Hino 2016-05-02 11:52:38 UTC
(In reply to Allon Mureinik from comment #1)
> Ala, seems like something that the patchset you're working on should also
> solve, no?

Yes, this BZ is supposed to be fixed by https://gerrit.ovirt.org/56734.

Basically, when an volume becomes illegal as a result of a failure during live merge, the patch enables the user to recover from this state by running live merge again on same volume

Comment 6 sha 2016-05-09 14:21:40 UTC
Ala, is there a work around in the mean time?

I just tried creatign another snapshot and deleting it on a different VM and got the exact same issues. Except this time, its being referenced to in the python check_disk tool.

Alpheus currently has three snapshots listed under the "Snapshots" tab:
Current: OK (3273de02-507d-4917-a446-a18b0219f79a)
_backup_alpheus_201605091325: Illegal (98bcd787-8212-49fc-b029-24166eafe9ee)
Base CentOS7 install: OK (d8a02d4b-c92d-4d08-97a5-d155d9cd2347)

Output of the check_disk.py

VM alpheus
    Disk 4e5dac5e-6a41-49b3-99de-60ccf0c3a20a (sd:8d6835a5-7b8f-4cc7-a8bb-3a07926d522d)
    Volumes:
        d8a02d4b-c92d-4d08-97a5-d155d9cd2347
        98bcd787-8212-49fc-b029-24166eafe9ee

I've attached two logs, the inital_snapshot-deletion.txt is the first attempt at deleting the snapshot. The snapshot-deletion.txt is a subsequent attempt at deleting the snapshot.

Comment 7 sha 2016-05-09 14:22:22 UTC
Created attachment 1155351 [details]
First attempt at deleting the snapshot

Comment 8 sha 2016-05-09 14:23:09 UTC
Created attachment 1155352 [details]
Subsequent attempt at deleting the snapshot

Comment 9 Ala Hino 2016-05-10 06:47:40 UTC
From the initial log it seems that the volume was not removed from chain:

2016-05-09 14:03:18,508 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (pool-7-thread-1) [4b362aae] Failed to live merge, still in volume chain: [3273de02-507d-4917-a446-a18b0219f79a, 98bcd787-8212-49fc-b029-24166eafe9ee]

Can you please attach vdsm log? The log will help understand why second merge failed too.

BTW, are you using a build that includes https://gerrit.ovirt.org/56734 patch? The patch is in 3.6.6. 

Please note that recovery by attempting live merge again is only support in 3.6.6 and later versions.

Comment 10 sha 2016-05-10 08:08:27 UTC
Created attachment 1155610 [details]
vdsm.log from host running VM

Comment 11 sha 2016-05-10 08:11:43 UTC
I've attached the VDSM log from around the time of the two merges.

It looks like I'm only running 3.6.5.3-1.el6. Does this mean live merge does not work for this version? Is this a regression (I'm sure it used to work)?

Ovirt packages installed on the engine are:
[root@ovirt-hosted-00 ~]# rpm -qa ovirt-*                                                                                                                                                                                                     
ovirt-engine-setup-plugin-vmconsole-proxy-helper-3.6.5.3-1.el6.noarch
ovirt-engine-extensions-api-impl-3.6.5.3-1.el6.noarch
ovirt-engine-webadmin-portal-3.6.5.3-1.el6.noarch
ovirt-engine-dbscripts-3.6.5.3-1.el6.noarch
ovirt-engine-wildfly-8.2.1-1.el6.x86_64
ovirt-vmconsole-proxy-1.0.0-1.el6.noarch
ovirt-guest-agent-1.0.11-1.el6.noarch
ovirt-engine-lib-3.6.5.3-1.el6.noarch
ovirt-setup-lib-1.0.1-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-3.6.5.3-1.el6.noarch
ovirt-engine-setup-plugin-ovirt-engine-3.6.5.3-1.el6.noarch
ovirt-engine-extension-aaa-jdbc-1.0.6-1.el6.noarch
ovirt-engine-setup-plugin-websocket-proxy-3.6.5.3-1.el6.noarch
ovirt-engine-vmconsole-proxy-helper-3.6.5.3-1.el6.noarch
ovirt-engine-tools-backup-3.6.5.3-1.el6.noarch
ovirt-engine-userportal-3.6.5.3-1.el6.noarch
ovirt-engine-restapi-3.6.5.3-1.el6.noarch
ovirt-engine-backend-3.6.5.3-1.el6.noarch
ovirt-engine-jboss-as-7.1.1-1.el6.x86_64
ovirt-engine-cli-3.6.2.0-1.el6.noarch
ovirt-image-uploader-3.6.0-1.el6.noarch
ovirt-vmconsole-1.0.0-1.el6.noarch
ovirt-host-deploy-java-1.4.1-1.el6.noarch
ovirt-engine-wildfly-overlay-8.0.5-1.el6.noarch
ovirt-release36-007-1.noarch
ovirt-engine-setup-base-3.6.5.3-1.el6.noarch
ovirt-engine-setup-3.6.5.3-1.el6.noarch
ovirt-engine-websocket-proxy-3.6.5.3-1.el6.noarch
ovirt-engine-sdk-python-3.6.3.0-1.el6.noarch
ovirt-engine-tools-3.6.5.3-1.el6.noarch
ovirt-iso-uploader-3.6.0-1.el6.noarch
ovirt-engine-3.6.5.3-1.el6.noarch
ovirt-host-deploy-1.4.1-1.el6.noarch

Comment 12 Ala Hino 2016-05-10 08:26:23 UTC
Live merge works for 3.6.5. No regression here.
We enhanced live merge recovery mechanism for 3.6.6 to enable users to recover from use cases where volumes become ILLEGAL by attempting live merge again. If there is a "real" storage issue, re-attempting live merge will not help.

I will look at vdsm log now and see what I can find.

Comment 13 Ori Gofen 2016-05-17 12:06:35 UTC
Does the bug verification require hosted engine env?

Comment 14 Ala Hino 2016-05-17 12:16:47 UTC
Were you able to reproduce the issue without the fix? If so, I'd recommend to verify that issue is resolved with the fix

Comment 15 Ori Gofen 2016-05-18 16:41:22 UTC
Ala, I have Verified this matter on a my environment, I was asking for you'r best knowledge, is this should be enough or should the verification include reproducing this illegal snapshot as a result of hosted-engine shutdown?

Comment 16 Ala Hino 2016-05-18 18:31:59 UTC
As the bug originally encountered on hosted-engine environment, I do recommend to verify the fix on similar setup

Comment 17 Elad 2016-05-19 16:35:35 UTC
Tested the following:
- Deployed hosted-engine over 2 hosts
- Created domain, VM, live snapshot
- Live merge and during the operation, took down the hosted-engine. While the hosted-engine was down and the task of deleteImage was marked as finished on SPM, I cleared the task (vdsClient -s 0 clearTask)
- Engine started, the snapshot was marked as ILLEGAL
- Tried to live merge the same snapshot again, got [1] in vdsm and the snapshot got deleted successfully



[1]
jsonrpc.Executor/5::ERROR::2016-05-19 19:21:21,447::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': "Volume does not exist: (u'0eae32f6-0952-43d9-8216-95d23590de12',)", 'code': 201}}


Verified using:
vdsm-4.17.28-0.el7ev.noarch
rhevm-3.6.6.2-0.1.el6.noarch
ovirt-hosted-engine-setup-1.3.6.1-1.el7ev.noarch

Comment 18 Allon Mureinik 2016-05-22 11:46:02 UTC
Ala, is there anything we need to document here, or is it documented elsewhere?
Please either provide the doctext, or comment with the BZ tracking the doc text and set requires-doctext-.

Comment 19 Ala Hino 2016-05-22 11:50:04 UTC
BZ 1323629 documents the behavior

Comment 20 sha 2016-05-24 08:27:00 UTC
Did this fix make the 3.6.6 (2016-05-23) release?

Comment 21 Ala Hino 2016-05-24 08:38:58 UTC
Yes it did


Note You need to log in before you can comment on or make changes to this bug.