Description of problem: Initially we had one report, where several VMs got corrupted after its disks were migrated to a new storage domain. We thought it could have been caused by an external factor, nothing relevant was found on RHV or external logs. It may be a coincidence, but now we have a second case with very similar symptoms. After a few minutes, hours or even days post an LSM, the VMs pause with IO error, and upon checking the disks are totally corrupted, to the point that even the qcow2 header is missing. The LSM completes successfully, there are no errors. Following the logs shows the move was fine, the sizes etc all seem to match. No storage related errors too. It seems to be some sort of silent corruption happening during or after LSM. Take this one for example, at the end of the LSM, there is a QemuImageInfo executed against the volume the VM is running with. 2019-10-02 06:46:34,445-0300 INFO (jsonrpc/12) [vdsm.api] START getQemuImageInfo(sdUUID=u'8156d292-c426-4663-b91c-ebf7bb639171', spUUID=u'00000002-0002-0002-0002-000000000213', imgUUID=u'76658a66-8d2d-4f12-886c-54e562194303', volUUID=u'221040a3-7210-4c2e-be89-af7c2e54dae7', options=None) from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:48) The info returns fine, with correct virtual size and showing its q qcow2 file. So I assume the image was mostly OK at that time (at least had a fine qcow2 header) 2019-10-02 06:46:34,630-0300 INFO (jsonrpc/12) [vdsm.api] FINISH getQemuImageInfo return={'info': {'compat': u'1.1', 'clustersize': 65536, 'actualsize': 0, 'virtualsize': 536870912000, 'format': u'qcow2'}} from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:54) But 35 minutes later, the VM pauses: 2019-10-02 07:20:35,394-0300 INFO (libvirt/events) [virt.vm] (vmId='1833e768-4922-4b72-af5f-6efc81b02e51') abnormal vm stop device virtio-disk0 error eio (vm:5064) And it won't turn on again, as qemu-kvm does not think this is a qcow2 image: 2019-10-02T19:57:06.897382Z qemu-kvm: -drive file=/rhev/data-center/mnt/blockSD/8156d292-c426-4663-b91c-ebf7bb639171/images/76658a66-8d2d-4f12-886c-54e562194303/221040a3-7210-4c2e-be89-af7c2e54dae7,format=qcow2,if=none,id=drive-ua-76658a66-8d2d-4f12-886c-54e562194303,serial=76658a66-8d2d-4f12-886c-54e562194303,werror=stop,rerror=stop,cache=none,aio=native: Image is not in qcow2 format. Upon running a hexdump against the file, we cannot even see the qcow2 header (starts with 'QFI') # hexdump -C -n 0xffff /dev/8156d292-c426-4663-b91c-ebf7bb639171/221040a3-7210-4c2e-be89-af7c2e54dae7 00000000 b8 ac e4 d8 2c 65 11 40 c3 4a fe b8 1d 34 e5 d9 |....,e.@.J...4..| 00000010 2d 64 d9 4d 88 c7 cc 65 d9 2c e4 d8 0d 70 9f 98 |-d.M...e.,...p..| 00000020 ca 1c b7 94 e7 30 c2 73 c8 73 1f c3 73 7e 22 c6 |.....0.s.s..s~".| 00000030 73 5c 8c 9b 4d 8c e3 98 30 c7 27 62 dc 68 cf 11 |s\..M...0.'b.h..| 00000040 46 83 87 1d 39 b9 91 2d c6 75 8c 05 d0 40 b8 0c |F...9..-.u...@..| Version-Release number of selected component (if applicable): Seen, but not limited too: vdsm-4.30.24-2.el7ev.x86_64 (host running the VM) vdsm-4.30.13-1.el7ev.x86_64 (SPM) rhvm-4.3.5.4-0.1.el7.noarch qemu-kvm-rhev-2.12.0-33.el7.x86_64 kernel-3.10.0-1062.el7.x86_64 How reproducible: Unknown
How reproducible: Always Steps to Reproduce: 1. Start Live Disk Storage Migration 2. Wait for the engine to get to Sync Image (runs on the SPM) 2019-10-10 16:43:21,344+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [2a54a9c2-670b-4252-a209-c27384a39afe] START, SyncImageGroupDataVDSCommand( SyncImageGroupDataVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', dstDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', syncType='INTERNAL'}), log id: 5e8835b 3. Restart VDSM on the SPM 4. The engine switches the SPM to another host, and the task for the Sync fails in the engine with: 2019-10-10 16:44:12,335+10 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] BaseAsyncTask::logEndTaskFailure: Task 'a562339d-a39c-4cc1-a72a-eb31dc8d04b7' (Parent Command 'SyncImageGroupData', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').', -- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').' 5. The engine goes to rollback, disable Replication and Deletes the LV on the Destination Storage Domain using the new SPM: 2019-10-10 16:44:24,275+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [62d2f19a] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', postZeros='false', discard='false', forceDelete='false'}), log id: 3f13b8ad But... In the old SPM, the copy is still running, over an LV that does not even exist anymore: # ps -ef | grep qemu-img vdsm 15737 1 11 16:43 ? 00:00:08 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/storage.kvm:_exports_nfs/12f0464a-cced-4ce1-a021-04294ef124ec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 -O raw -W /rhev/data-center/mnt/blockSD/884b66d7-bdec-419d-88bf-f92f96f30dec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 root 16318 10481 0 16:44 pts/0 00:00:00 grep --color=auto qemu # lvs | grep 884b66d7-bdec-419d-88bf-f92f96f30dec 2bed4bc9-25a7-4969-9013-637e50a5e8b1 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 4.50g 491ad280-926f-4f15-9afa-d4c7930af9bd 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 50.00g b7b916ad-6d6e-4d7d-bd7b-669666eeb935 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m beaaeb71-4b1b-40d1-b641-30c483c5cb10 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m ids 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-ao---- 128.00m inbox 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m leases 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 2.00g master 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 1.00g metadata 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m outbox 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m xleases 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 1.00g Now e have open possibilities for a lot of corruption events, and explains the qcow2s being completely overwritten with other VMs data. If a new LV is created or extended on the DST storage, it will use extents that can be actively written by the old SPM, corrupting the volumes. This is what happened in SFDC 02486028 and corrupted 2 VMs.
I think we have systemd bug here. I reproduced the issue on Fedora 29, and reported bug 1761260. We need to reproduce on RHEL 7 and 8, and file RHEL bugs. It looks like the only way to avoid this issue is to remove ExecStopPost from vdsm service, until this issue is fixed. We can run the ExecStopScript from vdsm itself when receiving a termination signal.
In order to QA_ACK this bug I need a clear scenario of how to reproduce it. Please add it.
> In order to QA_ACK this bug I need a clear scenario of how to reproduce it. see comment #5 To simplify it little bit: 1. Start Live Disk Storage Migration 2. Wait for the engine to get to Sync Image (runs on the SPM) - qemi-img process is started on SPM 3. Restart, stop or kill vdsmd on SPM After step 3., qemi-img process should be terminated as well (if qemu-img process is still running, the bug is still present).
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
I verified this BZ according steps from comment #14: 1. Start disk live migration between two nfs SDs 2. Monitor the engine log for this string: "CopyImageGroupVolumesData" (SyncImageGroupDataVDSCommand is not relevant any more due to this patch https://gerrit.ovirt.org/#/c/104763/) 3. As the "CopyImageGroupVolumesData" begins, restart vdsmd on the SPM 4. Monitor as the SPM changes to another host, and process related to qemu-img is not running any more. 5. The disk migrated successfully and its status is ok. tested on rhv 4.4.0-0.24.master.el8ev Adding engine and vdsm logs NOTE that this flow was executed 3 or 4 times in the row, so expect duplicates in the logs.
Created attachment 1667675 [details] verification logs. engine + vdsms
Is verification step 2 from comment #25 can be considered as valid? Taking in consideration that i might be seeing the result of this change: https://gerrit.ovirt.org/#/c/104763/
> Is verification step 2 from comment #25 can be considered as valid? I'm not engine expert, but I guess it's still valid. If you want to be sure, monitor host for qemi-img process to be started and then restart vdsmd on given host. It's qemi-img process which actually does the corruption, so if it runs, you can be sure you are validating it correctly.
Thanks Vojtech. Indeed, i monitored for the qemu-img process and saw it starting. Moving to 'verified'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3246