Bug 1768167
Summary: | [downstream clone - 4.3.7] Chance of data corruption if SPM VDSM is restarted during live storage migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
Component: | vdsm | Assignee: | Vojtech Juranek <vjuranek> |
Status: | CLOSED ERRATA | QA Contact: | Yosi Ben Shimon <ybenshim> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.3.5 | CC: | acavalla, acorreaa, aefrat, dhuertas, eshames, gwatson, jcoscia, lsurette, nsoffer, pelauter, rdlugyhe, srevivo, tnisan, vjuranek, ycui |
Target Milestone: | ovirt-4.3.7 | Keywords: | ZStream |
Target Release: | 4.3.7 | Flags: | lsvaty:
testing_plan_complete-
|
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vdsm-4.30.37 | Doc Type: | Bug Fix |
Doc Text: |
Previously, stopping, killing, or restarting the VDSM service on the Storage Pool Manager (SPM VDSM) while performing a live storage migration risked corrupting the virtual disk data. The current release removes ExecStopPost from the VDSM service to help prevent data corruption on virtual disks during live storage migration.
|
Story Points: | --- |
Clone Of: | 1759388 | Environment: | |
Last Closed: | 2019-12-12 10:36:52 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1759388 | ||
Bug Blocks: |
Description
RHV bug bot
2019-11-03 07:07:07 UTC
How reproducible: Always Steps to Reproduce: 1. Start Live Disk Storage Migration 2. Wait for the engine to get to Sync Image (runs on the SPM) 2019-10-10 16:43:21,344+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [2a54a9c2-670b-4252-a209-c27384a39afe] START, SyncImageGroupDataVDSCommand( SyncImageGroupDataVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', dstDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', syncType='INTERNAL'}), log id: 5e8835b 3. Restart VDSM on the SPM 4. The engine switches the SPM to another host, and the task for the Sync fails in the engine with: 2019-10-10 16:44:12,335+10 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] BaseAsyncTask::logEndTaskFailure: Task 'a562339d-a39c-4cc1-a72a-eb31dc8d04b7' (Parent Command 'SyncImageGroupData', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').', -- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').' 5. The engine goes to rollback, disable Replication and Deletes the LV on the Destination Storage Domain using the new SPM: 2019-10-10 16:44:24,275+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [62d2f19a] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', postZeros='false', discard='false', forceDelete='false'}), log id: 3f13b8ad But... In the old SPM, the copy is still running, over an LV that does not even exist anymore: # ps -ef | grep qemu-img vdsm 15737 1 11 16:43 ? 00:00:08 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/storage.kvm:_exports_nfs/12f0464a-cced-4ce1-a021-04294ef124ec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 -O raw -W /rhev/data-center/mnt/blockSD/884b66d7-bdec-419d-88bf-f92f96f30dec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 root 16318 10481 0 16:44 pts/0 00:00:00 grep --color=auto qemu # lvs | grep 884b66d7-bdec-419d-88bf-f92f96f30dec 2bed4bc9-25a7-4969-9013-637e50a5e8b1 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 4.50g 491ad280-926f-4f15-9afa-d4c7930af9bd 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 50.00g b7b916ad-6d6e-4d7d-bd7b-669666eeb935 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m beaaeb71-4b1b-40d1-b641-30c483c5cb10 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m ids 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-ao---- 128.00m inbox 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m leases 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 2.00g master 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 1.00g metadata 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m outbox 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m xleases 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 1.00g Now e have open possibilities for a lot of corruption events, and explains the qcow2s being completely overwritten with other VMs data. If a new LV is created or extended on the DST storage, it will use extents that can be actively written by the old SPM, corrupting the volumes. This is what happened in SFDC 02486028 and corrupted 2 VMs. (Originally by Germano Veit Michel) I think we have systemd bug here. I reproduced the issue on Fedora 29, and reported bug 1761260. We need to reproduce on RHEL 7 and 8, and file RHEL bugs. It looks like the only way to avoid this issue is to remove ExecStopPost from vdsm service, until this issue is fixed. We can run the ExecStopScript from vdsm itself when receiving a termination signal. (Originally by Nir Soffer) In order to QA_ACK this bug I need a clear scenario of how to reproduce it. Please add it. (Originally by Avihai Efrat) > In order to QA_ACK this bug I need a clear scenario of how to reproduce it. see comment #5 To simplify it little bit: 1. Start Live Disk Storage Migration 2. Wait for the engine to get to Sync Image (runs on the SPM) - qemi-img process is started on SPM 3. Restart, stop or kill vdsmd on SPM After step 3., qemi-img process should be terminated as well (if qemu-img process is still running, the bug is still present). (Originally by Vojtech Juranek) Verified on vdsm-4.30.37-1.el7ev.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4230 |