Bug 1759388 - Chance of data corruption if SPM VDSM is restarted during LSM
Summary: Chance of data corruption if SPM VDSM is restarted during LSM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.0
: ---
Assignee: Vojtech Juranek
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks: 1768167
TreeView+ depends on / blocked
 
Reported: 2019-10-08 02:02 UTC by Germano Veit Michel
Modified: 2020-08-04 13:27 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, ExecStopPost was present in the VDSM service file. This meant that, after stopping VDSM, some of its child processes could continue and, in some cases, lead to data corruption. The current release fixes this issue. It removes ExecStopPost from the VDSM service. As a result, terminating VDSM also stops its child processes.
Clone Of:
: 1768167 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:27:17 UTC
oVirt Team: Storage
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
verification logs. engine + vdsms (3.10 MB, application/zip)
2020-03-05 08:32 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4496851 0 None None None 2019-10-14 02:10:01 UTC
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:27:55 UTC
oVirt gerrit 104067 0 'None' MERGED vdsmd.service: Ensure that child processes are killed 2021-01-14 06:56:32 UTC
oVirt gerrit 104280 0 'None' MERGED vdsmd.service: Ensure that child processes are killed 2021-01-14 06:55:53 UTC
oVirt gerrit 104283 0 'None' MERGED vdsmd: change KillMode back to control-group 2021-01-14 06:55:53 UTC

Description Germano Veit Michel 2019-10-08 02:02:06 UTC
Description of problem:

Initially we had one report, where several VMs got corrupted after its disks were migrated to a new storage domain. We thought it could have been caused by an external factor, nothing relevant was found on RHV or external logs.
It may be a coincidence, but now we have a second case with very similar symptoms. After a few minutes, hours or even days post an LSM, the VMs pause with IO error, and upon checking the disks are totally corrupted, to the point that even the qcow2 header is missing.
The LSM completes successfully, there are no errors. Following the logs shows the move was fine, the sizes etc all seem to match. No storage related errors too. It seems to be some sort of silent corruption happening during or after LSM.

Take this one for example, at the end of the LSM, there is a QemuImageInfo executed against the volume the VM is running with.

2019-10-02 06:46:34,445-0300 INFO  (jsonrpc/12) [vdsm.api] START getQemuImageInfo(sdUUID=u'8156d292-c426-4663-b91c-ebf7bb639171', spUUID=u'00000002-0002-0002-0002-000000000213', imgUUID=u'76658a66-8d2d-4f12-886c-54e562194303', volUUID=u'221040a3-7210-4c2e-be89-af7c2e54dae7', options=None) from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:48)

The info returns fine, with correct virtual size and showing its q qcow2 file. So I assume the image was mostly OK at that time (at least had a fine qcow2 header)

2019-10-02 06:46:34,630-0300 INFO  (jsonrpc/12) [vdsm.api] FINISH getQemuImageInfo return={'info': {'compat': u'1.1', 'clustersize': 65536, 'actualsize': 0, 'virtualsize': 536870912000, 'format': u'qcow2'}} from=::ffff:A.B.C.D,46328, flow_id=dd1eafa7-443c-49f2-b582-da1065c5015f, task_id=eddbe929-f59e-44ea-bd32-0abd89e2bc52 (api:54)

But 35 minutes later, the VM pauses:

2019-10-02 07:20:35,394-0300 INFO  (libvirt/events) [virt.vm] (vmId='1833e768-4922-4b72-af5f-6efc81b02e51') abnormal vm stop device virtio-disk0 error eio (vm:5064)

And it won't turn on again, as qemu-kvm does not think this is a qcow2 image:

2019-10-02T19:57:06.897382Z qemu-kvm: -drive file=/rhev/data-center/mnt/blockSD/8156d292-c426-4663-b91c-ebf7bb639171/images/76658a66-8d2d-4f12-886c-54e562194303/221040a3-7210-4c2e-be89-af7c2e54dae7,format=qcow2,if=none,id=drive-ua-76658a66-8d2d-4f12-886c-54e562194303,serial=76658a66-8d2d-4f12-886c-54e562194303,werror=stop,rerror=stop,cache=none,aio=native: Image is not in qcow2 format.

Upon running a hexdump against the file, we cannot even see the qcow2 header (starts with 'QFI')

# hexdump -C -n 0xffff /dev/8156d292-c426-4663-b91c-ebf7bb639171/221040a3-7210-4c2e-be89-af7c2e54dae7
00000000  b8 ac e4 d8 2c 65 11 40  c3 4a fe b8 1d 34 e5 d9  |....,e.@.J...4..|
00000010  2d 64 d9 4d 88 c7 cc 65  d9 2c e4 d8 0d 70 9f 98  |-d.M...e.,...p..|
00000020  ca 1c b7 94 e7 30 c2 73  c8 73 1f c3 73 7e 22 c6  |.....0.s.s..s~".|
00000030  73 5c 8c 9b 4d 8c e3 98  30 c7 27 62 dc 68 cf 11  |s\..M...0.'b.h..|
00000040  46 83 87 1d 39 b9 91 2d  c6 75 8c 05 d0 40 b8 0c  |F...9..-.u...@..|

Version-Release number of selected component (if applicable):
Seen, but not limited too:
vdsm-4.30.24-2.el7ev.x86_64 (host running the VM)
vdsm-4.30.13-1.el7ev.x86_64 (SPM)
rhvm-4.3.5.4-0.1.el7.noarch
qemu-kvm-rhev-2.12.0-33.el7.x86_64
kernel-3.10.0-1062.el7.x86_64

How reproducible:
Unknown

Comment 5 Germano Veit Michel 2019-10-10 06:56:21 UTC
How reproducible:
Always

Steps to Reproduce:
1. Start Live Disk Storage Migration

2. Wait for the engine to get to Sync Image (runs on the SPM)

2019-10-10 16:43:21,344+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [2a54a9c2-670b-4252-a209-c27384a39afe] START, SyncImageGroupDataVDSCommand( SyncImageGroupDataVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='12f0464a-cced-4ce1-a021-04294ef124ec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', dstDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', syncType='INTERNAL'}), log id: 5e8835b

3. Restart VDSM on the SPM

4. The engine switches the SPM to another host, and the task for the Sync fails in the engine with:

2019-10-10 16:44:12,335+10 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] BaseAsyncTask::logEndTaskFailure: Task 'a562339d-a39c-4cc1-a72a-eb31dc8d04b7' (Parent Command 'SyncImageGroupData', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').',
-- Exception: 'VDSGenericException: VDSTaskResultNotSuccessException: TaskState contained successful return code, but a non-success result ('cleanSuccess').'

5. The engine goes to rollback, disable Replication and Deletes the LV on the Destination Storage Domain using the new SPM:

2019-10-10 16:44:24,275+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-26) [62d2f19a] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='9a84d2c8-e3fd-11e9-a89b-52540019c104', ignoreFailoverLimit='false', storageDomainId='884b66d7-bdec-419d-88bf-f92f96f30dec', imageGroupId='b657618f-de0b-4046-9314-26cabe8089db', postZeros='false', discard='false', forceDelete='false'}), log id: 3f13b8ad


But...

In the old SPM, the copy is still running, over an LV that does not even exist anymore:

# ps -ef | grep qemu-img
vdsm     15737     1 11 16:43 ?        00:00:08 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/storage.kvm:_exports_nfs/12f0464a-cced-4ce1-a021-04294ef124ec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8 -O raw -W /rhev/data-center/mnt/blockSD/884b66d7-bdec-419d-88bf-f92f96f30dec/images/b657618f-de0b-4046-9314-26cabe8089db/259ccaa0-4890-4b1c-9db3-6806fa068dd8
root     16318 10481  0 16:44 pts/0    00:00:00 grep --color=auto qemu

# lvs | grep 884b66d7-bdec-419d-88bf-f92f96f30dec
  2bed4bc9-25a7-4969-9013-637e50a5e8b1 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-------   4.50g                                                    
  491ad280-926f-4f15-9afa-d4c7930af9bd 884b66d7-bdec-419d-88bf-f92f96f30dec -wi-------  50.00g                                                    
  b7b916ad-6d6e-4d7d-bd7b-669666eeb935 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m                                                    
  beaaeb71-4b1b-40d1-b641-30c483c5cb10 884b66d7-bdec-419d-88bf-f92f96f30dec -wi------- 128.00m                                                    
  ids                                  884b66d7-bdec-419d-88bf-f92f96f30dec -wi-ao---- 128.00m                                                    
  inbox                                884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  leases                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   2.00g                                                    
  master                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   1.00g                                                    
  metadata                             884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  outbox                               884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a----- 128.00m                                                    
  xleases                              884b66d7-bdec-419d-88bf-f92f96f30dec -wi-a-----   1.00g   

Now e have open possibilities for a lot of corruption events, and explains the qcow2s being completely overwritten with other VMs data. If a new LV is created or extended on the DST storage, it will use extents that can be actively written by the old SPM, corrupting the volumes.

This is what happened in SFDC 02486028 and corrupted 2 VMs.

Comment 11 Nir Soffer 2019-10-13 20:25:39 UTC
I think we have systemd bug here. I reproduced the issue on Fedora 29, and
reported bug 1761260.

We need to reproduce on RHEL 7 and 8, and file RHEL bugs.

It looks like the only way to avoid this issue is to remove ExecStopPost from
vdsm service, until this issue is fixed.

We can run the ExecStopScript from vdsm itself when receiving a termination
signal.

Comment 13 Avihai 2019-10-30 09:34:48 UTC
In order to QA_ACK this bug I need a clear scenario of how to reproduce it.
Please add it.

Comment 14 Vojtech Juranek 2019-10-30 10:50:47 UTC
> In order to QA_ACK this bug I need a clear scenario of how to reproduce it.

see comment #5
To simplify it little bit:

1. Start Live Disk Storage Migration
2. Wait for the engine to get to Sync Image (runs on the SPM) - qemi-img process is started on SPM
3. Restart, stop or kill vdsmd on SPM

After step 3., qemi-img process should be terminated as well (if qemu-img process is still running, the bug is still present).

Comment 17 RHV bug bot 2019-11-19 11:53:00 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 18 RHV bug bot 2019-11-19 12:03:03 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 20 RHV bug bot 2019-12-13 13:17:33 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 21 RHV bug bot 2019-12-20 17:46:42 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 22 RHV bug bot 2020-01-08 14:50:10 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 23 RHV bug bot 2020-01-08 15:20:29 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 24 RHV bug bot 2020-01-24 19:51:54 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops@redhat.com

Comment 25 Ilan Zuckerman 2020-03-05 08:26:13 UTC
I verified this BZ according steps from comment #14:

1. Start disk live migration between two nfs SDs
2. Monitor the engine log for this string: "CopyImageGroupVolumesData"
 (SyncImageGroupDataVDSCommand is not relevant any more due to this patch https://gerrit.ovirt.org/#/c/104763/)
3. As the "CopyImageGroupVolumesData" begins, restart vdsmd on the SPM
4. Monitor as the SPM changes to another host, and process related to qemu-img is not running any more.
5. The disk migrated successfully and its status is ok.

tested on rhv 4.4.0-0.24.master.el8ev

Adding engine and vdsm logs

NOTE that this flow was executed 3 or 4 times in the row, so expect duplicates in the logs.

Comment 26 Ilan Zuckerman 2020-03-05 08:32:07 UTC
Created attachment 1667675 [details]
verification logs. engine + vdsms

Comment 27 Ilan Zuckerman 2020-03-05 11:56:54 UTC
Is verification step 2 from comment #25 can be considered as valid?
Taking in consideration that i might be seeing the result of this change: https://gerrit.ovirt.org/#/c/104763/

Comment 28 Vojtech Juranek 2020-03-09 09:30:07 UTC
> Is verification step 2 from comment #25 can be considered as valid?

I'm not engine expert, but I guess it's still valid. If you want to be sure, monitor host for qemi-img process to be started and then restart vdsmd on given host. It's qemi-img process which actually does the corruption, so if it runs, you can be sure you are validating it correctly.

Comment 29 Ilan Zuckerman 2020-03-10 07:53:03 UTC
Thanks Vojtech.
Indeed, i monitored for the qemu-img process and saw it starting.
Moving to 'verified'.

Comment 38 errata-xmlrpc 2020-08-04 13:27:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.