Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1554369

Summary: Live Merge failed on engine with "still in volume chain", but merge on host was successful
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Eyal Shenitzky <eshenitz>
Status: CLOSED ERRATA QA Contact: Avihai <aefrat>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.9CC: abpatil, bcholler, bscalio, bzlotnik, cshao, ebenahar, eshames, frolland, gveitmic, gwatson, lsurette, mkalinin, nashok, Rhev-m-bugs, sbonazzo, shipatil, sirao, srevivo, tnisan
Target Milestone: ovirt-4.3.0Keywords: ZStream
Target Release: 4.3.0Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.0_alpha Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1585950 (view as bug list) Environment:
Last Closed: 2019-05-08 12:37:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1547768, 1583424, 1585950    

Description Gordon Watson 2018-03-12 14:09:04 UTC
Description of problem:

An internal Live Merge was performed on a disk. The merge on the host appeared to have been successful. The top volume was unlinked, such that the qemu image chain and the volume metadata chain matched and no longer contained the volume in question. The xml and open files for the 'qemu-kvm' process also no longer contain this volume. So, on the host and storage side, the top volume had been merged back and was just waiting to be removed.

However, the engine reported that the top volume was still in the volume chain and the merge operation terminated.

All subsequent snapshot deletions for this disk failed due to the failure above.



Version-Release number of selected component (if applicable):

RHV 4.1.9
RHEL 7.4 host;
  vdsm-4.19.43-3.el7ev.x86_64 

How reproducible:

Not.


Steps to Reproduce:
1.
2.
3.

Actual results:

The live merge failed and terminated, resulting in susbsequent live merge failures.


Expected results:

The merge should have succeeded on the engine.

Additional info:

Comment 5 Allon Mureinik 2018-03-12 16:18:28 UTC
Ala - please take a look.

I'm tentatively targetting for 4.2.2, just because it's quite let to get anything into 4.1.10.
If we *do* find a quick [and safe!] fix, this should definitely be a candidate for 4.1.z.

Comment 13 Ala Hino 2018-03-15 11:23:20 UTC
Moving to 4.2.4 for now as the issue isn't reproducible and need more analysis.

Comment 16 Elad 2018-04-11 09:24:03 UTC
Evelina, please take a look

Comment 17 Evelina Shames 2018-04-29 10:40:19 UTC
Waiting for customer's response.

Comment 18 Allon Mureinik 2018-04-29 12:22:23 UTC
(In reply to Evelina Shames from comment #17)
> Waiting for customer's response.

What response are we waiting for?
The customer provided the script they use, and the request was to attempt to reproduce the bug on an 4.2 env.

What am I missing here?

Comment 19 Evelina Shames 2018-04-29 12:33:11 UTC
Me and Ala had a few questions about the script, Gordon sent them to the customer.

Comment 21 Ala Hino 2018-05-28 09:22:16 UTC
Hi Evelina,

Let's try to come with a script from our side that simulates what the customer is trying to do:

1. Create a VM
2. Create a snapshot
3. Delete up to N snapshots. When deleting a snapshot, we have to check the all snapshots status is OK before proceeding to next deletion

Pseudo code:

vm = _create_vm()
_create_snapshot(vm)
// if there are X snapshots and X>N, _list_vm_snapshots returns X-N snapshots
snapshots = _list_vm_snapshots(vm.id) 
for s in snapshots:
    _delete_snapshot(s)
    _del_completed = _check_snapshot_status(vm)
    while not _del_completed:
        _del_completed = _check_snapshot_status(vm)
    
As a reference please see listSnapsToDelete and checkSnapStateOK in the customer script.

Let me know if you need any help with the script.

Comment 22 Germano Veit Michel 2018-05-28 23:15:38 UTC
This is happening very frequently when the customer uses Commvault.

Just reviewed the 3 cases attached above, happened with:
rhvm-4.2.3.5-0.1.el7.noarch
vdsm-4.19.50-1.el7ev.x86_64

The engine sees the volume still in chain, but all was fine on the host.

A retry fixed the issue.

Comment 25 Elad 2018-06-08 21:36:21 UTC
Will test extensive snapshots usage as per Ala's suggestion from comment #21.
Will update with the findings

Comment 26 Elad 2018-06-10 22:53:15 UTC
Encountered an occurrence of bz 1527861.

See https://bugzilla.redhat.com/show_bug.cgi?id=1527861#c24 for full details.

Comment 36 Germano Veit Michel 2018-08-17 03:52:40 UTC
FYI, I just hit this in our labs by chance.

Again, MERGE_STATUS before the merge completed.

2018-08-17 13:30:23,804+10 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [76a4ec71-dce5-493f-a254-b01c727646d9] Command id: '800272e6-3573-403b-8269-1959a05b8844 failed child command status for step 'MERGE_STATUS'

2018-08-17 13:30:29,194+1000 INFO  (merge/ae378e3c) [virt.vm] (vmId='ae378e3c-e158-46b2-a263-18c1b2e6b263') Requesting pivot to complete active layer commit (job cf168cd2-9ea7-4994-bd87-5a159ac3d636) (vm:6270)

Versions:
vdsm-4.20.35-1.el7.x86_64
ovirt-engine-4.2.4.5-1.el7

I've enabled debug to catch it in case it happens again.

Comment 37 Elad 2018-08-23 08:11:53 UTC
Hi Ala, is there a specific flow we should test for verification?

Comment 38 Ala Hino 2018-08-27 14:48:52 UTC
Hi Elad,

It is not easy to provide the steps to verify.
Having said that, I'd like to have another look at the info provided by Germano and others to see if there is another "hidden" issue.

Comment 39 Elad 2018-09-16 15:12:09 UTC
We'll verify based on regression testing.

Comment 42 Eyal Shenitzky 2018-10-08 07:54:49 UTC
Hey Gordon and Germano,

Can you please provide more details on the configuration of the VMs that encounter this issue.

For e.g: how many disks, is the VM was HA, is the VM run on the SPM or HSM while the live merge failed, when the deleted snapshot was taken (maybe in an older version)

Thanks

Comment 43 Gordon Watson 2018-10-08 14:54:20 UTC
Eyal,

In the specific case for which I opened this bug, the VM in question had a single disk, was not HA and was not running on the SPM at the time of the failure.

I've "worked" on multiple instances of this bug but I don't have access to the specific details of all of them at this time. However, I've seen this happen with VMs with more than one disk too.

Regarding when the snapshot itself had been created, It seems that the majority of these instances have occurred when using API backup utilities, such as Commvault. What they do is to create a snapshot, attach it to a backup VM, back it up, etc., then remove the snapshot. So, the snapshot in question would have been created shortly before the failure.

Hope this helps.

Regards, GFW.

Comment 45 Germano Veit Michel 2018-10-09 03:52:03 UTC
(In reply to Eyal Shenitzky from comment #42)
> For e.g: how many disks, is the VM was HA, is the VM run on the SPM or HSM
> while the live merge failed, when the deleted snapshot was taken (maybe in
> an older version)

We have seen this several times by now, I don't think there is a unique config that triggers this. Anyway, we just saw another one, single disk, not HA and not on SPM. Latest 4.2.6 everything.

It seems to be a timing/race issue that is quite hard to hit. But given enough volume it shows up regularly.

I've hit this once in our labs, then never again. Even with the same VM, I remember trying again using it, never reproduced again.

Comment 46 Eyal Shenitzky 2018-10-11 11:36:35 UTC
Managed to reproduce this bug with the following steps:

In 4.3 engine and cluster < 4.2:

1) Create vm1 with a disk
2) Create backup_vm with a disk
3) Create snapshot ('snap1') to vm1
4) run vm1 backup_vm
5) Attach snap1 to backup_vm
6) Power-off backup_vm
7) Remove backup_vm
8) Remove snap1

Live merge failed with the following error - 
2018-10-11 11:56:48,883+03 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [c1dce6ca-df36-4e6f-a5f1-e3cc063ccf6a] Failed to live merge. Top volume e9ffc75b-caca-4188-aa3f-a983ae2554d1 is still in qemu chain [ca9223cb-960a-4da7-8c6e-6166b269c813, e9ffc75b-caca-4188-aa3f-a983ae2554d1]

Comment 47 Eyal Shenitzky 2018-10-14 08:07:39 UTC
It seems like live merge isn't working with Engine 4.3/4.2 and cluster < 4.2.
(see https://bugzilla.redhat.com/show_bug.cgi?id=1623467)

So it can be reproduced with the following steps:

In 4.3 engine and cluster < 4.2:

1) Create vm1 with a disk
2) Create snapshot ('snap1') to vm1
3) Run vm1
4) Remove snap1

Germano, 
Can we reproduce this issue in engine 4.2 with cluster 4.2 (the difference between the version is the use of dumpXML verb instead of VMList verb)?

If it is reproduced, please attach the logs of the engine and hosts.

Comment 48 Germano Veit Michel 2018-10-15 06:30:02 UTC
(In reply to Eyal Shenitzky from comment #47)
> Germano, 
> Can we reproduce this issue in engine 4.2 with cluster 4.2 (the difference
> between the version is the use of dumpXML verb instead of VMList verb)?
> 
> If it is reproduced, please attach the logs of the engine and hosts.

Hi Eyal,

No, it does not reproduce:

2018-10-15 16:25:48,609+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [34ccb2c9-dfc6-44c2-823a-31029bd30b93] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'test' deletion for VM 'VM' has been completed.

ovirt-engine-4.2.6.4-1.el7.noarch
vdsm-4.20.39.1-1.el7.x86_64

I don't think this bug reproduces easily with a simple steps like this, otherwise we would see it tons of times a day on support cases. It seems to be a timing issue where the engine moves to the STATUS step prematurely. Maybe something with job/task reporting.

And if this constantly reproduces on 4.3 engine with cluster < 4.2, then it could be a different bug with similar symptoms?

Comment 49 Benny Zlotnik 2018-10-17 11:57:43 UTC
After hitting 1623467, Eyal and I initially thought they may be related, but determined they aren't, since the code causing 1623467 is only available on master at the moment

After looking at some of the customer logs, I was able to reproduce it consistently, however, it is not simple and requires exact timing, so I currently only have a reproducer with breakpoints. 

As it seems, the issue is a race between VM monitoring and the merge command:
2018-06-17 03:32:09,914+08 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] START, MergeVDSCommand(HostName = ****, MergeVDSCommandParameters:{runAsync='true', hostId='1dac87af-1909-42fb-be8b-c82fbe55d09b', vmId='260b72f7-0bce-46c4-93f0-d341f8ef7f66', storagePoolId='58c66ccd-02d9-00cc-0324-0000000002ea', storageDomainId='2499b162-1380-4976-b199-dd7b5f0ec7ce', imageGroupId='2bb5ecf2-c119-47b5-add5-4a771826ca56', imageId='8274b638-7214-4a50-ba0c-4a36091c1d2c', baseImageId='56960ce0-7257-462a-805a-c619d25373c1', topImageId='8274b638-7214-4a50-ba0c-4a36091c1d2c', bandwidth='0'}), log id: 45a243f
2018-06-17 03:32:10,390+08 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] FINISH, MergeVDSCommand, return: dfcdef93-913f-4fc3-b66d-0827e6fdc8c9, log id: 45a243f
2018-06-17 03:32:10,391+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] Stored placeholder for VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]
2018-06-17 03:32:10,392+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (DefaultQuartzScheduler10) [8ef8cce] VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]: Deleting

As we can see in the logs:

Merge command returns with job id dfcdef93-913f-4fc3-b66d-0827e6fdc8c9
2018-06-17 03:32:10,390+08 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] FINISH, MergeVDSCommand, return: dfcdef93-913f-4fc3-b66d-0827e6fdc8c9, log id: 45a243f

This means live merge has just started and the engine will start monitoring this job id.

The job id is persisted in the memory and the database:
2018-06-17 03:32:10,391+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] Stored placeholder for VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]

And then almost immediately removed
2018-06-17 03:32:10,392+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (DefaultQuartzScheduler10) [8ef8cce] VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]: Deleting

So the engine thought live merge was finished within a second which seems highly unlikely to happen.

The way jobs are updated is with a thread that periodically sends a getAllVmsStats call to vdsm, this call can be seen in the vdsm log but not in the engine since it is set to DEBUG level.

So the steps for this to happen:
1. A periodic run of the VmsMonitoring starts - Thread A
2. Live merge starts - Thread B
3. Vdsm returns the job id to the engine
4. Thread B adds the job to the VmJobsMonitoring repository
5. Thread A checks if the jobs in the repository are still running on the VM, and since it uses the data from step 1, it is before the blockjob was added to the VM metadata, and as such it is stale
6. Thread A removes the job prematurely based on the stale data

Comment 59 Avihai 2018-11-28 09:45:46 UTC
Verified.

Ovirt engine 4.3.0-0.2.master.20181121071050.gita8fcd23.el7
VDSM 4.30.3-10.git284ece2.el7.noarch

Same verification scenario as in 4.2 clone bug 1585950:

1) create VM and start it.
2) Once up create 5 disks(file based, used NFS SD)
3) create snapshot 's1'
4) create additional 5 disks (file based, used NFS SD)
5) Create snapshot 's2'
6) Delete snapshot 's2'

Live merged was successful.

Just as bug 1585950 in Searching in engine log I saw the phrase which indicates the same issue code flow that triggered the bug occurred:

2018-11-28 11:19:58,908+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] The reported data is not current and should not be used to monitor job dfced4f6-7033-4427-98c0-c8631d487527

Full flow of this job id dfced4f6-7033-4427-98c0-c8631d487527 :
> grep dfced4f6-7033-4427-98c0-c8631d487527 engine.log                                                                                                                         
2018-11-28 11:19:58,837+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-2) [f9d09e47-072d-4ca1-8f35-f29c58869f73] FINISH, MergeVDSCommand, return: dfced4f6-7033-4427-98c0-c8631d487527, log id: 44188091
2018-11-28 11:19:58,839+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-commandCoordinator-Thread-2) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Stored placeholder for VM Job [dfced4f6-7033-4427-98c0-c8631d487527]
2018-11-28 11:19:58,908+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] The reported data is not current and should not be used to monitor job dfced4f6-7033-4427-98c0-c8631d487527, 
    <ovirt-vm:block_jobs>{"dfced4f6-7033-4427-98c0-c8631d487527": {"blockJobType": "commit", "topVolume": "ff0dafb3-11aa-43ce-a39a-04be464b9f4d", "strategy": "commit", "jobID": "dfced4f6-7033-4427-98c0-c8631d487527", "disk": {"domainID": "447bb679-b19d-441a-aca7-50d933c8dcaf", "imageID": "209badb1-44cb-491c-b0a3-d00adaf29f23", "volumeID": "ff0dafb3-11aa-43ce-a39a-04be464b9f4d", "poolID": "e635f8db-2636-461c-b97d-a0d0b1e4978d"}, "baseVolume": "6d58ef89-c383-4b23-995e-1072821c64a1"}, "08e9dfb3-8c81-47e0-bfad-36ccbfebde38": {"blockJobType": "commit", "topVolume": "5afce49b-21f1-4921-9381-da794797e667", "strategy": "commit", "jobID": "08e9dfb3-8c81-47e0-bfad-36ccbfebde38", "disk": {"domainID": "447bb679-b19d-441a-aca7-50d933c8dcaf", "imageID": "db5e50e2-6b97-4214-ab5c-772c393f4a31", "volumeID": "5afce49b-21f1-4921-9381-da794797e667", "poolID": "e635f8db-2636-461c-b97d-a0d0b1e4978d"}, "baseVolume": "6be66ded-eb93-4e5b-afc9-9142d9e10107"}}</ovirt-vm:block_jobs>
2018-11-28 11:20:06,260+02 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-44) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Waiting on merge command to complete (jobId = dfced4f6-7033-4427-98c0-c8631d487527)
2018-11-28 11:20:14,232+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM Job [dfced4f6-7033-4427-98c0-c8631d487527]: Deleting
2018-11-28 11:20:18,787+02 INFO  [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Merge command (jobId = dfced4f6-7033-4427-98c0-c8631d487527) has completed for images '6d58ef89-c383-4b23-995e-1072821c64a1'..'ff0dafb3-11aa-43ce-a39a-04be464b9f4d'

Comment 61 errata-xmlrpc 2019-05-08 12:37:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Comment 62 Daniel Gur 2019-08-28 13:15:19 UTC
sync2jira

Comment 63 Daniel Gur 2019-08-28 13:20:21 UTC
sync2jira