Bug 1768167

Summary: [downstream clone - 4.3.7] Chance of data corruption if SPM VDSM is restarted during live storage migration
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: vdsmAssignee: Vojtech Juranek <vjuranek>
Status: CLOSED ERRATA QA Contact: Yosi Ben Shimon <ybenshim>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.5CC: acavalla, acorreaa, aefrat, dhuertas, eshames, gwatson, jcoscia, lsurette, nsoffer, pelauter, rdlugyhe, srevivo, tnisan, vjuranek, ycui
Target Milestone: ovirt-4.3.7Keywords: ZStream
Target Release: 4.3.7Flags: lsvaty: testing_plan_complete-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.30.37 Doc Type: Bug Fix
Doc Text:
Previously, stopping, killing, or restarting the VDSM service on the Storage Pool Manager (SPM VDSM) while performing a live storage migration risked corrupting the virtual disk data. The current release removes ExecStopPost from the VDSM service to help prevent data corruption on virtual disks during live storage migration.
Story Points: ---
Clone Of: 1759388 Environment:
Last Closed: 2019-12-12 10:36:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1759388    
Bug Blocks:    

Description RHV bug bot 2019-11-03 07:07:07 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1759388 +++
======================================================================

Description of problem:

Initially we had one report, where several VMs got corrupted after its disks were migrated to a new storage domain. We thought it could have been caused by an external factor, nothing relevant was found on RHV or external logs.
It may be a coincidence, but now we have a second case with very similar symptoms. After a few minutes, hours or even days post an LSM, the VMs pause with IO error, and upon checking the disks are totally corrupted, to the point that even the qcow2 header is missing.
The LSM completes successfully, there are no errors. Following the logs shows the move was fine, the sizes etc all seem to match. No storage related errors too. It seems to be some sort of silent corruption happening during or after LSM.

Take this one for example, at the end of the LSM, there is a QemuImageInfo executed against the volume the VM is running with.

2019-10-02 06:46:34,445-0300 INFO  (jsonrpc/12) [vdsm.api] START getQemuImageInfo(sdUUID=u'8156d292-c426-4663-b91c-ebf7bb639171', spUUID=u'00000002-0002-0002-0002-000000000213', imgUUID=u'76658a66-8d2d-4f12-886c-54e562194303', volUUID=u'221040a3-7210-4c2e-be89-af7c2e54dae7', options=None) from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:48)

The info returns fine, with correct virtual size and showing its q qcow2 file. So I assume the image was mostly OK at that time (at least had a fine qcow2 header)

2019-10-02 06:46:34,630-0300 INFO  (jsonrpc/12) [vdsm.api] FINISH getQemuImageInfo return={'info': {'compat': u'1.1', 'clustersize': 65536, 'actualsize': 0, 'virtualsize': 536870912000, 'format': u'qcow2'}} from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:54)

But 35 minutes later, the VM pauses:

2019-10-02 07:20:35,394-0300 INFO  (libvirt/events) [virt.vm] (vmId='1833e768-4922-4b72-af5f-6efc81b02e51') abnormal vm stop device virtio-disk0 error eio (vm:5064)

And it won't turn on again, as qemu-kvm does not think this is a qcow2 image:

2019-10-02T19:57:06.897382Z qemu-kvm: -drive file=/rhev/data-center/mnt/blockSD/8156d292-c426-4663-b91c-ebf7bb639171/images/76658a66-8d2d-4f12-886c-54e562194303/221040a3-7210-4c2e-be89-af7c2e54dae7,format=qcow2,if=none,id=drive-ua-76658a66-8d2d-4f12-886c-54e562194303,serial=76658a66-8d2d-4f12-886c-54e562194303,werror=stop,rerror=stop,cache=none,aio=native: Image is not in qcow2 format.

Upon running a hexdump against the file, we cannot even see the qcow2 header (starts with 'QFI')

# hexdump -C -n 0xffff /dev/8156d292-c426-4663-b91c-ebf7bb639171/221040a3-7210-4c2e-be89-af7c2e54dae7
00000000  b8 ac e4 d8 2c 65 11 40  c3 4a fe b8 1d 34 e5 d9  |....,e.@.J...4..|
00000010  2d 64 d9 4d 88 c7 cc 65  d9 2c e4 d8 0d 70 9f 98  |-d.M...e.,...p..|
00000020  ca 1c b7 94 e7 30 c2 73  c8 73 1f c3 73 7e 22 c6  |.....0.s.s..s~".|
00000030  73 5c 8c 9b 4d 8c e3 98  30 c7 27 62 dc 68 cf 11  |s\..M...0.'b.h..|
00000040  46 83 87 1d 39 b9 91 2d  c6 75 8c 05 d0 40 b8 0c  |F...9..-.u...@..|

Version-Release number of selected component (if applicable):
Seen, but not limited too:
vdsm-4.30.24-2.el7ev.x86_64 (host running the VM)
vdsm-4.30.13-1.el7ev.x86_64 (SPM)
rhvm-4.3.5.4-0.1.el7.noarch
qemu-kvm-rhev-2.12.0-33.el7.x86_64
kernel-3.10.0-1062.el7.x86_64

How reproducible:
Unknown

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2019-11-03 07:07:18 UTC
How reproducible:
Always

Steps to Reproduce:
1. Start Live Disk Storage Migration

2. Wait for the engine to get to Sync Image (runs on the SPM)

2019-10-10 16:43:21,344+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [2a54a9c2-670b-4252-a209-c27384a39afe] START, SyncImageGroupDataVDSCommand( SyncImageGroupDataVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', dstDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', syncType='INTERNAL'}), log id: 5e8835b

3. Restart VDSM on the SPM

4. The engine switches the SPM to another host, and the task for the Sync fails in the engine with:

2019-10-10 16:44:12,335+10 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] BaseAsyncTask::logEndTaskFailure: Task 'a562339d-a39c-4cc1-a72a-eb31dc8d04b7' (Parent Command 'SyncImageGroupData', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').',
-- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').'

5. The engine goes to rollback, disable Replication and Deletes the LV on the Destination Storage Domain using the new SPM:

2019-10-10 16:44:24,275+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [62d2f19a] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', postZeros='false', discard='false', forceDelete='false'}), log id: 3f13b8ad


But...

In the old SPM, the copy is still running, over an LV that does not even exist anymore:

# ps -ef | grep qemu-img
vdsm     15737     1 11 16:43 ?        00:00:08 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/storage.kvm:_exports_nfs/12f0464a-cced-4ce1-a021-04294ef124ec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 -O raw -W /rhev/data-center/mnt/blockSD/884b66d7-bdec-419d-88bf-f92f96f30dec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8
root     16318 10481  0 16:44 pts/0    00:00:00 grep --color=auto qemu

# lvs | grep 884b66d7-bdec-419d-88bf-f92f96f30dec
  2bed4bc9-25a7-4969-9013-637e50a5e8b1 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-------   4.50g                                                    
  491ad280-926f-4f15-9afa-d4c7930af9bd 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-------  50.00g                                                    
  b7b916ad-6d6e-4d7d-bd7b-669666eeb935 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m                                                    
  beaaeb71-4b1b-40d1-b641-30c483c5cb10 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m                                                    
  ids                                  884b66d7-bdec-419d-88bf-f92f96f30dec -wi-ao---- 128.00m                                                    
  inbox                                884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  leases                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   2.00g                                                    
  master                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   1.00g                                                    
  metadata                             884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  outbox                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  xleases                              884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   1.00g   

Now e have open possibilities for a lot of corruption events, and explains the qcow2s being completely overwritten with other VMs data. If a new LV is created or extended on the DST storage, it will use extents that can be actively written by the old SPM, corrupting the volumes.

This is what happened in SFDC 02486028 and corrupted 2 VMs.

(Originally by Germano Veit Michel)

Comment 11 RHV bug bot 2019-11-03 07:07:29 UTC
I think we have systemd bug here. I reproduced the issue on Fedora 29, and
reported bug 1761260.

We need to reproduce on RHEL 7 and 8, and file RHEL bugs.

It looks like the only way to avoid this issue is to remove ExecStopPost from
vdsm service, until this issue is fixed.

We can run the ExecStopScript from vdsm itself when receiving a termination
signal.

(Originally by Nir Soffer)

Comment 13 RHV bug bot 2019-11-03 07:07:32 UTC
In order to QA_ACK this bug I need a clear scenario of how to reproduce it.
Please add it.

(Originally by Avihai Efrat)

Comment 14 RHV bug bot 2019-11-03 07:07:34 UTC
> In order to QA_ACK this bug I need a clear scenario of how to reproduce it.

see comment #5
To simplify it little bit:

1. Start Live Disk Storage Migration
2. Wait for the engine to get to Sync Image (runs on the SPM) - qemi-img process is started on SPM
3. Restart, stop or kill vdsmd on SPM

After step 3., qemi-img process should be terminated as well (if qemu-img process is still running, the bug is still present).

(Originally by Vojtech Juranek)

Comment 17 Evelina Shames 2019-11-14 15:17:20 UTC
Verified on vdsm-4.30.37-1.el7ev.x86_64

Comment 20 errata-xmlrpc 2019-12-12 10:36:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4230