1548436 – [downstream clone - 4.1.10] Cold merge still fails if the base qcow2 image reports leaked cluster

Bug 1548436 - [downstream clone - 4.1.10] Cold merge still fails if the base qcow2 image reports leaked cluster

Summary: [downstream clone - 4.1.10] Cold merge still fails if the base qcow2 image re...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.9
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.10
Target Release:	---
Assignee:	Ala Hino
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1545708
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-23 14:08 UTC by RHV bug bot
Modified:	2023-09-07 19:04 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	This update fixes an issue where cold merges would fail if there was a leaked cluster in a base image. Now leaked clusters are identified and handled to ensure that cold merges will succeed.
Clone Of:	1545708
Environment:
Last Closed:	2018-03-20 16:38:36 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3218271	None	None	None	2018-02-23 14:09:19 UTC
Red Hat Product Errata	RHEA-2018:0563	None	None	None	2018-03-20 16:39:43 UTC
oVirt gerrit	87817	ovirt-4.1	MERGED	qemuimg: Except the correct error in check helper	2020-11-10 07:05:44 UTC

Description RHV bug bot 2018-02-23 14:08:52 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1545708 +++
======================================================================

Description of problem:

As per the bug 1420405, we are using qemu-img check to find the actual size of the volume. However, if the base image is having leaked cluster, then the qemu-img check will be having a return value of 3.

# qemu-img check /rhev/data-center/00000001-0001-0001-0001-000000000311/8257cf14-d88d-4e4e-998c-9f8976dac2a2/images/7455de38-1df1-4acd-b07c-9dc2138aafb3/be4a4d85-d7e6-4725-b7f5-90c9d935c336  ; echo $?
200 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
109461/180224 = 60.74% allocated, 15.29% fragmented, 0.00% compressed clusters
Image end offset: 7188578304
3

Anything other than 0 exit code will fail with below error in the vdsm side during cold merge.


The cold merge will make the image illegal and the user will not be able to start the VM because of illegal legality in the metadata. We need manual changes to recover the VM.


Version-Release number of selected component (if applicable):

How reproducible:
somewhat reproducible

Steps to Reproduce:

1. Create a thin provisioned disk in RHV-M and assign to a VM.

2. Kill the qemu-kvm process while we write the data from the VM. While writing the data from the VM, the qcow2 will allocate new cluster and killing the process at this time will cause leaked cluster in the image.

3. Create a snapshot for the VM.

4. Do a cold merge.


Actual results:

Cold merge will fail if there is leaked cluster in the base image

Expected results:

Cold merge should work if the base image is having leaked cluster.

Additional info:


audit_log:

 2018-01-25 19:19:42.971+01    |    10802 |                                      | rhev-2 | VDSM rhev-2 command HSMGetAllTasksStatusesVDS failed: cmd=['/usr/bin/qemu-img', 'check', '--output', 'json', '-f', 'qcow2', '/rhev/data-center/mnt/blockSD/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/images/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'], ecode=3, stdout={                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                               |          |                                      |                        |     "image-end-offset": 87419584512,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       +
                               |          |                                      |                        |     "total-clusters": 3276800,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             +
                               |          |                                      |                        |     "check-errors": 0,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     +
                               |          |                                      |                        |     "leaks": 1111,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         +
                               |          |                                      |                        |     "leaks-fixed": 0,      
+
                               |          |                                      |                        |     "allocated-clusters": 1332537,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         +
                               |          |                                      |                        |     "filename": "/rhev/data-center/mnt/blockSD/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/images/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     +
                               |          |                                      |                        |     "format": "qcow2",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     +
                               |          |                                      |                        |     "fragmented-clusters": 233541                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                               |          |                                      |                        | }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +
                               |          |                                      |                        | , stderr=Leaked cluster 475133 refcount=1 reference=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +
                               |          |                                      |                        | Leaked cluster 475134 refcount=1 reference=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               +
                               |          |                                      |                        | Leaked cluster 475135 refcount=1 reference=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               +

[...each leaked cluster reported, output manually truncated...]
      
                               |          |                                      |                        | Leaked cluster 1163606 refcount=1 reference=0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
                               |          |                                      |                        | , message=None
 2018-01-25 19:19:47.67+01     |      357 | a1cde685-520a-449e-bd72-5a01a90a1eeb |                        | Failed to delete snapshot 'snapshot-name' for VM 'vm-name'.


vdsm log:

2018-01-25 19:19:37,674+0100 ERROR (tasks/5) [storage.TaskManager.Task] (Task='a536102b-a17a-4281-a948-214f246322c2') Unexpected error (task:872)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 879, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1892, in finalizeMerge
    merge.finalize(subchainInfo)
  File "/usr/share/vdsm/storage/merge.py", line 271, in finalize
    optimal_size = subchain.base_vol.optimal_size()
  File "/usr/share/vdsm/storage/blockVolume.py", line 440, in optimal_size
    check = qemuimg.check(self.getVolumePath(), qemuimg.FORMAT.QCOW2)
  File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 157, in check
    out = _run_cmd(cmd)
  File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 426, in _run_cmd
    raise QImgError(cmd, rc, out, err)
QImgError: cmd=['/usr/bin/qemu-img', 'check', '--output', 'json', '-f', 'qcow2', '/rhev/data-center/mnt/blockSD/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/images/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz'], ecode=3, stdout={
    "image-end-offset": 87419584512,
    "total-clusters": 3276800,
    "check-errors": 0,
    "leaks": 1111,
    "leaks-fixed": 0,
    "allocated-clusters": 1332537,
    "filename": "/rhev/data-center/mnt/blockSD/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/images/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz",
    "format": "qcow2",
    "fragmented-clusters": 233541
}
, stderr=Leaked cluster 475133 refcount=1 reference=0
Leaked cluster 475134 refcount=1 reference=0
Leaked cluster 475135 refcount=1 reference=0
[...]
Leaked cluster 1163605 refcount=1 reference=0
Leaked cluster 1163606 refcount=1 reference=0
, message=None
2018-01-25 19:19:37,739+0100 INFO  (tasks/5) [storage.ThreadPool.WorkerThread] FINISH task a536102b-a17a-4281-a948-214f246322c2 (threadPool:210)


engine.log:

2018-01-25 19:19:45,058+01 ERROR [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (DefaultQuartzScheduler5) [a1cde685-520a-449e-bd72-5a01a90a1eeb] Command 'ColdMergeSnapshotSingleDisk' id '342ebd0a-2bb7-4d36-81cd-57e580c2b2c1' failed executing step 'FINALIZE_MERGE'
2018-01-25 19:19:45,058+01 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler5) [a1cde685-520a-449e-bd72-5a01a90a1eeb] Command 'ColdMergeSnapshotSingleDisk' id: '342ebd0a-2bb7-4d36-81cd-57e580c2b2c1' child commands '[99ccc8e0-8890-4971-936a-440cf3355de7, 5ec4be26-7bff-4cab-8da6-254eebb0dcc8, 02d0e273-8ea6-424c-aa31-bf04f0fb009c]' executions were completed, status 'FAILED'
2018-01-25 19:19:46,137+01 ERROR [org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand] (DefaultQuartzScheduler3) [a1cde685-520a-449e-bd72-5a01a90a1eeb] Ending command 'org.ovirt.engine.core.bll.snapshots.ColdMergeSnapshotSingleDiskCommand' with failure.
2018-01-25 19:19:46,279+01 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (DefaultQuartzScheduler3) [a1cde685-520a-449e-bd72-5a01a90a1eeb] Command 'RemoveSnapshot' id: 'e81b54ec-9211-4740-b350-913bbb7af5c7' child commands '[10ba0892-6a91-4ad5-a2f4-840dbd88166d, 342ebd0a-2bb7-4d36-81cd-57e580c2b2c1, 8abe62ee-9cce-4fca-9205-cfacef4f481e, 57ece9d8-2598-4a57-973d-e4fafde95a74, dd1ab67d-98d6-4c87-8979-9ce8846e2ba9]' executions were completed, status 'FAILED'
2018-01-25 19:19:47,576+01 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (DefaultQuartzScheduler2) [a1cde685-520a-449e-bd72-5a01a90a1eeb] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' with failure.
2018-01-25 19:19:47,693+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [a1cde685-520a-449e-bd72-5a01a90a1eeb] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Correlation ID: a1cde685-520a-449e-bd72-5a01a90a1eeb, Job ID: d571ed23-e270-4d1d-9ad0-54e29d65aa60, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snapshot-name' for VM 'vm-name'.


Looks like regression, adding relevant tag

--
Kind Regards,
Igor Netkachev
Technical Support Engineer, RHCE, RHCVA
Red Hat Global Support Services

(Originally by Igor Netkachev)

Comment 4 RHV bug bot 2018-02-23 14:09:05 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 5 RHV bug bot 2018-02-23 14:09:09 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 7 Kevin Alon Goldblatt 2018-03-12 09:50:16 UTC

Verified with the following code:
--------------------------------------
ovirt-engine-4.1.10.1-0.1.el7.noarch
vdsm-4.19.48-1.el7ev.x86_64

Verified with the following scenario:
-------------------------------------
Steps to Reproduce:
1. create a vm with think proision disk
2. write data to the disk, any data
3. while writing, kill qemu-kvm process on the host - the VM goes down
4. create a snapshot
5. delete/merge the parent of the created snapshot
Cold merge succeed without errors

Moving to VERIFIED!

Comment 10 errata-xmlrpc 2018-03-20 16:38:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0563

Comment 11 Franta Kust 2019-05-16 13:03:08 UTC

BZ<2>Jira Resync

Comment 12 Daniel Gur 2019-08-28 13:11:34 UTC

sync2jira

Comment 13 Daniel Gur 2019-08-28 13:15:46 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.