1585950 – [downstream clone - 4.2.8] Live Merge failed on engine with "still in volume chain", but merge on host was successful

Bug 1585950 - [downstream clone - 4.2.8] Live Merge failed on engine with "still in volume chain", but merge on host was successful [NEEDINFO]

Summary: [downstream clone - 4.2.8] Live Merge failed on engine with "still in volume ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.8
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:	1554369
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-05 07:38 UTC by RHV bug bot
Modified:	2023-03-24 14:06 UTC (History)
CC List:	25 users (show)
Fixed In Version:	ovirt-engine-4.2.4.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1554369
Environment:
Last Closed:	2019-01-25 12:50:23 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	jspanko: needinfo?

Attachments	(Terms of Use)
engine log (508.17 KB, application/x-gzip) 2018-11-22 13:05 UTC, Avihai	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3378361	None	None	None	2018-06-05 07:40:22 UTC
Red Hat Product Errata	RHSA-2018:2071	None	None	None	2018-06-27 10:03:32 UTC
oVirt gerrit	91841	'None'	MERGED	core: Improve MergeStatusCommand	2021-02-08 15:42:11 UTC
oVirt gerrit	91930	'None'	MERGED	core: Improve MergeStatusCommand	2021-02-08 15:42:11 UTC
oVirt gerrit	95085	'None'	MERGED	core: fix race in VmJobs monitoring	2021-02-08 15:42:12 UTC

Description RHV bug bot 2018-06-05 07:38:01 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1554369 +++
======================================================================

Description of problem:

An internal Live Merge was performed on a disk. The merge on the host appeared to have been successful. The top volume was unlinked, such that the qemu image chain and the volume metadata chain matched and no longer contained the volume in question. The xml and open files for the 'qemu-kvm' process also no longer contain this volume. So, on the host and storage side, the top volume had been merged back and was just waiting to be removed.

However, the engine reported that the top volume was still in the volume chain and the merge operation terminated.

All subsequent snapshot deletions for this disk failed due to the failure above.



Version-Release number of selected component (if applicable):

RHV 4.1.9
RHEL 7.4 host;
  vdsm-4.19.43-3.el7ev.x86_64 

How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:

The live merge failed and terminated, resulting in susbsequent live merge failures.


Expected results:

The merge should have succeeded on the engine.

Additional info:

(Originally by Gordon Watson)

Comment 6 RHV bug bot 2018-06-05 07:38:32 UTC

Ala - please take a look.

I'm tentatively targetting for 4.2.2, just because it's quite let to get anything into 4.1.10.
If we *do* find a quick [and safe!] fix, this should definitely be a candidate for 4.1.z.

(Originally by amureini)

Comment 14 RHV bug bot 2018-06-05 07:39:18 UTC

Moving to 4.2.4 for now as the issue isn't reproducible and need more analysis.

(Originally by Ala Hino)

Comment 17 RHV bug bot 2018-06-05 07:39:34 UTC

Evelina, please take a look

(Originally by Elad Ben Aharon)

Comment 18 RHV bug bot 2018-06-05 07:39:39 UTC

Waiting for customer's response.

(Originally by Evelina Shames)

Comment 19 RHV bug bot 2018-06-05 07:39:44 UTC

(In reply to Evelina Shames from comment #17)
> Waiting for customer's response.

What response are we waiting for?
The customer provided the script they use, and the request was to attempt to reproduce the bug on an 4.2 env.

What am I missing here?

(Originally by amureini)

Comment 20 RHV bug bot 2018-06-05 07:39:49 UTC

Me and Ala had a few questions about the script, Gordon sent them to the customer.

(Originally by Evelina Shames)

Comment 22 RHV bug bot 2018-06-05 07:39:59 UTC

Hi Evelina,

Let's try to come with a script from our side that simulates what the customer is trying to do:

1. Create a VM
2. Create a snapshot
3. Delete up to N snapshots. When deleting a snapshot, we have to check the all snapshots status is OK before proceeding to next deletion

Pseudo code:

vm = _create_vm()
_create_snapshot(vm)
// if there are X snapshots and X>N, _list_vm_snapshots returns X-N snapshots
snapshots = _list_vm_snapshots(vm.id) 
for s in snapshots:
    _delete_snapshot(s)
    _del_completed = _check_snapshot_status(vm)
    while not _del_completed:
        _del_completed = _check_snapshot_status(vm)
    
As a reference please see listSnapsToDelete and checkSnapStateOK in the customer script.

Let me know if you need any help with the script.

(Originally by Ala Hino)

Comment 23 RHV bug bot 2018-06-05 07:40:05 UTC

This is happening very frequently when the customer uses Commvault.

Just reviewed the 3 cases attached above, happened with:
rhvm-4.2.3.5-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64

The engine sees the volume still in chain, but all was fine on the host.

A retry fixed the issue.

(Originally by Germano Veit Michel)

Comment 29 Elad 2018-06-20 10:04:38 UTC

Examined our latest automation executions done for 4.2.4-5 build. Snapshot removal tests passed.


Used:
rhvm-4.2.4.4-0.1.el7_3.noarch
vdsm-4.20.31-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

Comment 32 errata-xmlrpc 2018-06-27 10:02:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2071

Comment 45 Eyal Shenitzky 2018-10-11 09:38:03 UTC

Managed to reproduce this bug with the following steps:

In 4.3 engine and cluster < 4.2:

1) Create vm1 with a disk
2) Create backup_vm with a disk
3) Create snapshot ('snap1') to vm1
4) run vm1 backup_vm
5) Attach snap1 to backup_vm
6) Power-off backup_vm
7) Remove backup_vm
8) Remove snap1

Live merge failed with the following error - 
2018-10-11 11:56:48,883+03 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [c1dce6ca-df36-4e6f-a5f1-e3cc063ccf6a] Failed to live merge. Top volume e9ffc75b-caca-4188-aa3f-a983ae2554d1 is still in qemu chain [ca9223cb-960a-4da7-8c6e-6166b269c813, e9ffc75b-caca-4188-aa3f-a983ae2554d1]

Comment 63 Avihai 2018-11-22 13:05:51 UTC

Created attachment 1507931 [details]
engine log

Comment 70 Elad 2018-12-12 13:47:31 UTC

This bug should have been moved to ON_QA.

Comment 71 Eyal Edri 2018-12-12 13:56:40 UTC

same as the other bug, the bz bot that acks bugs wasn't working until yesterday, so it got acked only today.
please check on it on the next build scheduled today.

Comment 74 Raz Tamir 2018-12-16 17:31:32 UTC

QE verification bot: the bug was verified upstream

Comment 77 Franta Kust 2019-05-16 13:07:24 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

aefrat
amarchuk
asabadra
bcholler
bzlotnik
cshao
dfodor
dhuertas
ebenahar
eedri
eshames
eshenitz
gveitmic
gwatson
jspanko
lsurette
lveyde
michael.moir
mkalinin
ratamir
rbalakri
Rhev-m-bugs
sirao
srevivo
tnisan