Bug 1554369
| Summary: | Live Merge failed on engine with "still in volume chain", but merge on host was successful | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Gordon Watson <gwatson> | |
| Component: | ovirt-engine | Assignee: | Eyal Shenitzky <eshenitz> | |
| Status: | CLOSED ERRATA | QA Contact: | Avihai <aefrat> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.1.9 | CC: | abpatil, bcholler, bscalio, bzlotnik, cshao, ebenahar, eshames, frolland, gveitmic, gwatson, lsurette, mkalinin, nashok, Rhev-m-bugs, sbonazzo, shipatil, sirao, srevivo, tnisan | |
| Target Milestone: | ovirt-4.3.0 | Keywords: | ZStream | |
| Target Release: | 4.3.0 | Flags: | lsvaty:
testing_plan_complete-
|
|
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | ovirt-engine-4.3.0_alpha | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1585950 (view as bug list) | Environment: | ||
| Last Closed: | 2019-05-08 12:37:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1547768, 1583424, 1585950 | |||
|
Description
Gordon Watson
2018-03-12 14:09:04 UTC
Ala - please take a look. I'm tentatively targetting for 4.2.2, just because it's quite let to get anything into 4.1.10. If we *do* find a quick [and safe!] fix, this should definitely be a candidate for 4.1.z. Moving to 4.2.4 for now as the issue isn't reproducible and need more analysis. Evelina, please take a look Waiting for customer's response. (In reply to Evelina Shames from comment #17) > Waiting for customer's response. What response are we waiting for? The customer provided the script they use, and the request was to attempt to reproduce the bug on an 4.2 env. What am I missing here? Me and Ala had a few questions about the script, Gordon sent them to the customer. Hi Evelina,
Let's try to come with a script from our side that simulates what the customer is trying to do:
1. Create a VM
2. Create a snapshot
3. Delete up to N snapshots. When deleting a snapshot, we have to check the all snapshots status is OK before proceeding to next deletion
Pseudo code:
vm = _create_vm()
_create_snapshot(vm)
// if there are X snapshots and X>N, _list_vm_snapshots returns X-N snapshots
snapshots = _list_vm_snapshots(vm.id)
for s in snapshots:
_delete_snapshot(s)
_del_completed = _check_snapshot_status(vm)
while not _del_completed:
_del_completed = _check_snapshot_status(vm)
As a reference please see listSnapsToDelete and checkSnapStateOK in the customer script.
Let me know if you need any help with the script.
This is happening very frequently when the customer uses Commvault. Just reviewed the 3 cases attached above, happened with: rhvm-4.2.3.5-0.1.el7.noarch vdsm-4.19.50-1.el7ev.x86_64 The engine sees the volume still in chain, but all was fine on the host. A retry fixed the issue. Will test extensive snapshots usage as per Ala's suggestion from comment #21. Will update with the findings Encountered an occurrence of bz 1527861. See https://bugzilla.redhat.com/show_bug.cgi?id=1527861#c24 for full details. FYI, I just hit this in our labs by chance. Again, MERGE_STATUS before the merge completed. 2018-08-17 13:30:23,804+10 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [76a4ec71-dce5-493f-a254-b01c727646d9] Command id: '800272e6-3573-403b-8269-1959a05b8844 failed child command status for step 'MERGE_STATUS' 2018-08-17 13:30:29,194+1000 INFO (merge/ae378e3c) [virt.vm] (vmId='ae378e3c-e158-46b2-a263-18c1b2e6b263') Requesting pivot to complete active layer commit (job cf168cd2-9ea7-4994-bd87-5a159ac3d636) (vm:6270) Versions: vdsm-4.20.35-1.el7.x86_64 ovirt-engine-4.2.4.5-1.el7 I've enabled debug to catch it in case it happens again. Hi Ala, is there a specific flow we should test for verification? Hi Elad, It is not easy to provide the steps to verify. Having said that, I'd like to have another look at the info provided by Germano and others to see if there is another "hidden" issue. We'll verify based on regression testing. Hey Gordon and Germano, Can you please provide more details on the configuration of the VMs that encounter this issue. For e.g: how many disks, is the VM was HA, is the VM run on the SPM or HSM while the live merge failed, when the deleted snapshot was taken (maybe in an older version) Thanks Eyal, In the specific case for which I opened this bug, the VM in question had a single disk, was not HA and was not running on the SPM at the time of the failure. I've "worked" on multiple instances of this bug but I don't have access to the specific details of all of them at this time. However, I've seen this happen with VMs with more than one disk too. Regarding when the snapshot itself had been created, It seems that the majority of these instances have occurred when using API backup utilities, such as Commvault. What they do is to create a snapshot, attach it to a backup VM, back it up, etc., then remove the snapshot. So, the snapshot in question would have been created shortly before the failure. Hope this helps. Regards, GFW. (In reply to Eyal Shenitzky from comment #42) > For e.g: how many disks, is the VM was HA, is the VM run on the SPM or HSM > while the live merge failed, when the deleted snapshot was taken (maybe in > an older version) We have seen this several times by now, I don't think there is a unique config that triggers this. Anyway, we just saw another one, single disk, not HA and not on SPM. Latest 4.2.6 everything. It seems to be a timing/race issue that is quite hard to hit. But given enough volume it shows up regularly. I've hit this once in our labs, then never again. Even with the same VM, I remember trying again using it, never reproduced again. Managed to reproduce this bug with the following steps:
In 4.3 engine and cluster < 4.2:
1) Create vm1 with a disk
2) Create backup_vm with a disk
3) Create snapshot ('snap1') to vm1
4) run vm1 backup_vm
5) Attach snap1 to backup_vm
6) Power-off backup_vm
7) Remove backup_vm
8) Remove snap1
Live merge failed with the following error -
2018-10-11 11:56:48,883+03 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-10) [c1dce6ca-df36-4e6f-a5f1-e3cc063ccf6a] Failed to live merge. Top volume e9ffc75b-caca-4188-aa3f-a983ae2554d1 is still in qemu chain [ca9223cb-960a-4da7-8c6e-6166b269c813, e9ffc75b-caca-4188-aa3f-a983ae2554d1]
It seems like live merge isn't working with Engine 4.3/4.2 and cluster < 4.2. (see https://bugzilla.redhat.com/show_bug.cgi?id=1623467) So it can be reproduced with the following steps: In 4.3 engine and cluster < 4.2: 1) Create vm1 with a disk 2) Create snapshot ('snap1') to vm1 3) Run vm1 4) Remove snap1 Germano, Can we reproduce this issue in engine 4.2 with cluster 4.2 (the difference between the version is the use of dumpXML verb instead of VMList verb)? If it is reproduced, please attach the logs of the engine and hosts. (In reply to Eyal Shenitzky from comment #47) > Germano, > Can we reproduce this issue in engine 4.2 with cluster 4.2 (the difference > between the version is the use of dumpXML verb instead of VMList verb)? > > If it is reproduced, please attach the logs of the engine and hosts. Hi Eyal, No, it does not reproduce: 2018-10-15 16:25:48,609+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [34ccb2c9-dfc6-44c2-823a-31029bd30b93] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'test' deletion for VM 'VM' has been completed. ovirt-engine-4.2.6.4-1.el7.noarch vdsm-4.20.39.1-1.el7.x86_64 I don't think this bug reproduces easily with a simple steps like this, otherwise we would see it tons of times a day on support cases. It seems to be a timing issue where the engine moves to the STATUS step prematurely. Maybe something with job/task reporting. And if this constantly reproduces on 4.3 engine with cluster < 4.2, then it could be a different bug with similar symptoms? After hitting 1623467, Eyal and I initially thought they may be related, but determined they aren't, since the code causing 1623467 is only available on master at the moment
After looking at some of the customer logs, I was able to reproduce it consistently, however, it is not simple and requires exact timing, so I currently only have a reproducer with breakpoints.
As it seems, the issue is a race between VM monitoring and the merge command:
2018-06-17 03:32:09,914+08 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] START, MergeVDSCommand(HostName = ****, MergeVDSCommandParameters:{runAsync='true', hostId='1dac87af-1909-42fb-be8b-c82fbe55d09b', vmId='260b72f7-0bce-46c4-93f0-d341f8ef7f66', storagePoolId='58c66ccd-02d9-00cc-0324-0000000002ea', storageDomainId='2499b162-1380-4976-b199-dd7b5f0ec7ce', imageGroupId='2bb5ecf2-c119-47b5-add5-4a771826ca56', imageId='8274b638-7214-4a50-ba0c-4a36091c1d2c', baseImageId='56960ce0-7257-462a-805a-c619d25373c1', topImageId='8274b638-7214-4a50-ba0c-4a36091c1d2c', bandwidth='0'}), log id: 45a243f
2018-06-17 03:32:10,390+08 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] FINISH, MergeVDSCommand, return: dfcdef93-913f-4fc3-b66d-0827e6fdc8c9, log id: 45a243f
2018-06-17 03:32:10,391+08 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] Stored placeholder for VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]
2018-06-17 03:32:10,392+08 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (DefaultQuartzScheduler10) [8ef8cce] VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]: Deleting
As we can see in the logs:
Merge command returns with job id dfcdef93-913f-4fc3-b66d-0827e6fdc8c9
2018-06-17 03:32:10,390+08 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] FINISH, MergeVDSCommand, return: dfcdef93-913f-4fc3-b66d-0827e6fdc8c9, log id: 45a243f
This means live merge has just started and the engine will start monitoring this job id.
The job id is persisted in the memory and the database:
2018-06-17 03:32:10,391+08 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (pool-5-thread-9) [509c6653-0685-4b5a-a2f8-e4f2a6e0e75f] Stored placeholder for VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]
And then almost immediately removed
2018-06-17 03:32:10,392+08 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (DefaultQuartzScheduler10) [8ef8cce] VM Job [dfcdef93-913f-4fc3-b66d-0827e6fdc8c9]: Deleting
So the engine thought live merge was finished within a second which seems highly unlikely to happen.
The way jobs are updated is with a thread that periodically sends a getAllVmsStats call to vdsm, this call can be seen in the vdsm log but not in the engine since it is set to DEBUG level.
So the steps for this to happen:
1. A periodic run of the VmsMonitoring starts - Thread A
2. Live merge starts - Thread B
3. Vdsm returns the job id to the engine
4. Thread B adds the job to the VmJobsMonitoring repository
5. Thread A checks if the jobs in the repository are still running on the VM, and since it uses the data from step 1, it is before the blockjob was added to the VM metadata, and as such it is stale
6. Thread A removes the job prematurely based on the stale data
Verified. Ovirt engine 4.3.0-0.2.master.20181121071050.gita8fcd23.el7 VDSM 4.30.3-10.git284ece2.el7.noarch Same verification scenario as in 4.2 clone bug 1585950: 1) create VM and start it. 2) Once up create 5 disks(file based, used NFS SD) 3) create snapshot 's1' 4) create additional 5 disks (file based, used NFS SD) 5) Create snapshot 's2' 6) Delete snapshot 's2' Live merged was successful. Just as bug 1585950 in Searching in engine log I saw the phrase which indicates the same issue code flow that triggered the bug occurred: 2018-11-28 11:19:58,908+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] The reported data is not current and should not be used to monitor job dfced4f6-7033-4427-98c0-c8631d487527 Full flow of this job id dfced4f6-7033-4427-98c0-c8631d487527 : > grep dfced4f6-7033-4427-98c0-c8631d487527 engine.log 2018-11-28 11:19:58,837+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-2) [f9d09e47-072d-4ca1-8f35-f29c58869f73] FINISH, MergeVDSCommand, return: dfced4f6-7033-4427-98c0-c8631d487527, log id: 44188091 2018-11-28 11:19:58,839+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-commandCoordinator-Thread-2) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Stored placeholder for VM Job [dfced4f6-7033-4427-98c0-c8631d487527] 2018-11-28 11:19:58,908+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [] The reported data is not current and should not be used to monitor job dfced4f6-7033-4427-98c0-c8631d487527, <ovirt-vm:block_jobs>{"dfced4f6-7033-4427-98c0-c8631d487527": {"blockJobType": "commit", "topVolume": "ff0dafb3-11aa-43ce-a39a-04be464b9f4d", "strategy": "commit", "jobID": "dfced4f6-7033-4427-98c0-c8631d487527", "disk": {"domainID": "447bb679-b19d-441a-aca7-50d933c8dcaf", "imageID": "209badb1-44cb-491c-b0a3-d00adaf29f23", "volumeID": "ff0dafb3-11aa-43ce-a39a-04be464b9f4d", "poolID": "e635f8db-2636-461c-b97d-a0d0b1e4978d"}, "baseVolume": "6d58ef89-c383-4b23-995e-1072821c64a1"}, "08e9dfb3-8c81-47e0-bfad-36ccbfebde38": {"blockJobType": "commit", "topVolume": "5afce49b-21f1-4921-9381-da794797e667", "strategy": "commit", "jobID": "08e9dfb3-8c81-47e0-bfad-36ccbfebde38", "disk": {"domainID": "447bb679-b19d-441a-aca7-50d933c8dcaf", "imageID": "db5e50e2-6b97-4214-ab5c-772c393f4a31", "volumeID": "5afce49b-21f1-4921-9381-da794797e667", "poolID": "e635f8db-2636-461c-b97d-a0d0b1e4978d"}, "baseVolume": "6be66ded-eb93-4e5b-afc9-9142d9e10107"}}</ovirt-vm:block_jobs> 2018-11-28 11:20:06,260+02 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-44) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Waiting on merge command to complete (jobId = dfced4f6-7033-4427-98c0-c8631d487527) 2018-11-28 11:20:14,232+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM Job [dfced4f6-7033-4427-98c0-c8631d487527]: Deleting 2018-11-28 11:20:18,787+02 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [f9d09e47-072d-4ca1-8f35-f29c58869f73] Merge command (jobId = dfced4f6-7033-4427-98c0-c8631d487527) has completed for images '6d58ef89-c383-4b23-995e-1072821c64a1'..'ff0dafb3-11aa-43ce-a39a-04be464b9f4d' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085 sync2jira sync2jira |