Bug 1349950 - Snapshot deletion failed: Merge failed
Summary: Snapshot deletion failed: Merge failed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.6
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ovirt-3.6.7
: ---
Assignee: Ala Hino
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-24 15:47 UTC by nicolas
Modified: 2016-07-04 12:30 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-04 12:30:19 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
Engine + SPM VDSM logs (29.22 KB, application/x-gzip)
2016-06-24 15:50 UTC, nicolas
no flags Details
new logs (1.32 MB, application/x-gzip)
2016-06-27 08:50 UTC, Raz Tamir
no flags Details

Description nicolas 2016-06-24 15:47:18 UTC
We're trying to delete an auto-generated live snapshot that has been created after migrating an online VM's storage to a different domain. oVirt version is 3.6.6 and VDSM version is 4.17.28. Most relevant log lines are:

   2016-06-24 07:50:36,252 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-10) [799a22e3] Failed in 'MergeVDS' method
   2016-06-24 07:50:36,256 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-7-thread-10) [799a22e3] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host2.domain.com command failed: Merge failed
   2016-06-24 07:50:36,256 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-7-thread-10) [799a22e3] Command 'MergeVDSCommand(HostName = host2.domain.com, MergeVDSCommandParameters:{runAsync='true', hostId='c31dca1a-e5bc-43f6-940f-6397e3ddbee4', vmId='7083832a-a1a2-42b7-961f-2e9c0dcd7e18', storagePoolId='fa155d43-4e68-486f-9f9d-ae3e3916cc4f', storageDomainId='9339780c-3667-4fef-aa13-9bec08957c5f', imageGroupId='65a0b0d4-5c96-4dd9-a31b-4d08e40a46a5', imageId='9eec9e8f-38db-4abf-b1c4-92fa9383f8b1', baseImageId='568b2f77-0ddf-4349-a45c-36fcb0edecb6', topImageId='9eec9e8f-38db-4abf-b1c4-92fa9383f8b1', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52
   2016-06-24 07:50:36,256 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-7-thread-10) [799a22e3] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Merge failed, code = 52 (Failed with error mergeErr and code 52)

Also, relevant seems this part:

   jsonrpc.Executor/3::ERROR::2016-06-24 07:50:36,216::vm::4955::virt.vm::(merge) vmId=`7083832a-a1a2-42b7-961f-2e9c0dcd7e18`::Live merge failed (job: 3ea68e36-6d99-4af9-a54e-4c5b06df6a0f)
Traceback (most recent call last):
     File "/usr/share/vdsm/virt/vm.py", line 4951, in merge
    flags)
     File "/usr/share/vdsm/virt/virdomain.py", line 68, in f
    ret = attr(*args, **kwargs)
     File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
     File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
     File "/usr/lib64/python2.7/site-packages/libvirt.py", line 668, in blockCommit
       if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
     libvirtError: block copy still active: disk 'vda' already in active block job

I'm attaching relevant logs (both for ovirt-engine and SPM's vdsm).

As per Nir Soffer in the 'users' list:

---
  This means there was a previous attempt to merge, and the block job
did not complete yet.

  Engine should detect that there is an active block job and wait until
it completes.

  This is probably engine bug, please file a bug and attach complete vdsm and
engine logs.
---

If you need more info or the full log don't hesitate to ask.

Comment 1 nicolas 2016-06-24 15:50:15 UTC
Created attachment 1171987 [details]
Engine + SPM VDSM logs

Comment 2 Oved Ourfali 2016-06-24 17:12:21 UTC
Nir, is that infra or storage?

Comment 3 Moran Goldboim 2016-06-26 08:49:49 UTC
dup of Bug 1344479 - Live Merge completes but the 'job' table entry in the database is not marked as finished ?

Comment 4 Allon Mureinik 2016-06-26 12:11:52 UTC
(In reply to Moran Goldboim from comment #3)
> dup of Bug 1344479 - Live Merge completes but the 'job' table entry in the
> database is not marked as finished ?
No.

Comment 5 Ala Hino 2016-06-26 17:49:12 UTC
This BZ is related to the work done in live merge recovery area in 3.6.6:
enable the user to retry live merge after failures. see BZ 1323629 for documentation.

I am moving this BZ to ON_QA in order to QA to keep me honest here.
Would like to verify the following:

1. prepare a long live merge, i.e. copy large amount of data
2. create a snapshot
3. delete that snapshot (while the VM is up)
4. stop vdsm while merge is running
5. wait for merge to complete (with error) on the engine
6. start vdsm
7. retry merge
8. assuming job is still running, merge will fail
9. wait for job to complete
10. retry merge, merge should succeed

Comment 6 Raz Tamir 2016-06-27 08:49:48 UTC
Hi Ala,
I followed the steps you provided and got ERROR in the engine:

2016-06-27 11:43:31,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-2) [5c46b46f] Failed in 'MergeVDS' method
2016-06-27 11:43:31,066 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-5-thread-2) [5c46b46f] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_3 command failed: Drive image file could not be found
2016-06-27 11:43:31,066 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-2) [5c46b46f] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturnForXmlRpc [status=StatusForXmlRpc [code=13, message=Drive image file could not be found]]'
2016-06-27 11:43:31,067 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-2) [5c46b46f] HostName = host_mixed_3
2016-06-27 11:43:31,068 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-2) [5c46b46f] Command 'MergeVDSCommand(HostName = host_mixed_3, MergeVDSCommandParameters:{runAsync='true', hostId='f5b87851-6c15-4552-b349-cc4c297bc30b', vmId='8ee7b94f-b1a3-4f54-9794-37e8259371fe', storagePoolId='7a2d027d-f58b-4645-8339-99c324fcfcd6', storageDomainId='a3f7bd63-2ffd-40e5-be72-d44ea1534cb1', imageGroupId='4b97a780-5529-4187-aaa7-564b14c00cb6', imageId='0cb90c2e-3a2f-4000-9432-d82a3aab97cb', baseImageId='8c63eae4-43c5-4cb9-9242-cd6634571c6c', topImageId='0cb90c2e-3a2f-4000-9432-d82a3aab97cb', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13
2016-06-27 11:43:31,070 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-2) [5c46b46f] FINISH, MergeVDSCommand, log id: 450cd3bb
2016-06-27 11:43:31,071 ERROR [org.ovirt.engine.core.bll.MergeCommand] (pool-5-thread-2) [5c46b46f] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = Drive image file could not be found, code = 13 (Failed with error imageErr and code 13)

And from vdsm.log:

jsonrpc.Executor/5::ERROR::2016-06-27 11:45:34,199::task::866::Storage.TaskManager.Task::(_setError) Task=`f445fe59-5
476-48db-87e8-130497810da9`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3162, in getVolumeInfo
    volUUID=volUUID).getInfo()
  File "/usr/share/vdsm/storage/sd.py", line 457, in produceVolume
    volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 80, in __init__
    volume.Volume.__init__(self, repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 187, in __init__
    self.validate()
  File "/usr/share/vdsm/storage/blockVolume.py", line 88, in validate
    raise se.VolumeDoesNotExist(self.volUUID)
VolumeDoesNotExist: Volume does not exist: ('0cb90c2e-3a2f-4000-9432-d82a3aab97cb',)


new logs attached

Comment 7 Red Hat Bugzilla Rules Engine 2016-06-27 08:49:54 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Raz Tamir 2016-06-27 08:50:11 UTC
Created attachment 1172802 [details]
new logs

Comment 9 Ala Hino 2016-06-27 12:39:46 UTC
Hi Raz,

This exception is expected.
This is what happened:
We started a merge (of Snapshot 'snap' deletion for VM 'test_vm') that failed. Then, we retried merge but meanwhile, merge job *completed* (at storage/vdsm side) and we got that imageErr (indicating base image doesn't exists). The engine in 3.6.6, handles that error by continuing the flow and as can be seen in the log, the snapshot removed:

2016-06-27 11:45:49,192 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-1) [41b38968] Correlation ID: 41b38968, Job ID: ca381f80-90f9-414f-9701-843c31e5bb1d, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'snap' deletion for VM 'test_vm' has been completed.

Can you please confirm that snapshot indeed removed after retrying live merge?

Comment 10 Raz Tamir 2016-06-27 12:45:25 UTC
Yes, the snapshot removed eventually

Comment 11 Raz Tamir 2016-06-27 13:28:13 UTC
According to comment #9 moving to VERIFIED


Note You need to log in before you can comment on or make changes to this bug.