Bug 1983882
| Summary: | Guest data corruption after migration | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Vojtech Juranek <vjuranek> | ||||||
| Component: | Core | Assignee: | Vojtech Juranek <vjuranek> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | sshmulev | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 4.40.80 | CC: | ahadas, bugs, eshenitz, lsvaty, nsoffer, pelauter | ||||||
| Target Milestone: | ovirt-4.4.9 | Keywords: | ZStream | ||||||
| Target Release: | 4.40.90.1 | Flags: | pm-rhel:
ovirt-4.4+
pelauter: blocker+ pelauter: planning_ack+ |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | vdsm-4.40.90.1 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause:
Extending the disk during VM migration when source host is SPM and destination is host which doesn't support disk refresh during migration would cause that the disk extension done on source host won't be visible on the destination host.
Consequence:
If the VM on the source write to part of the disk which is not visible on the destination host before it's migrated, silence data corruption would happen.
Fix:
Don't make disk extension visible on the source host to the VM, even if the source host is SPM.
Result:
In scenarios described above disk extension won't be visible to the VM, so VM can be suspended due to insuficien disk space, but no data corruption would happen.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-10-21 07:27:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Vojtech Juranek
2021-07-20 05:15:38 UTC
Created attachment 1803485 [details]
reproducer-logs
Here the possible flow of events:
1. Disk extended on the source host
2. QEMU writing data after end of the old device
3. Migration completes, seeing old disk size
Data written after end of the device will read as zeroes on the destination host,
corrupting the disk.
We have 2 cases:
1. Destination does not support disk refresh (vdsm < 4.4.7)
In this case on the first extend attempt we disable automatic disk extension, but
the source disk was already extended on the SPM. Here we have pretty large window
when the guest can write data after the end of the old device.
The reproducer script (attachment 1803484 [details]) reproduces this corruption reliably.
2. Destination supports disk refresh (vdsm >= 4.4.7)
In this case we refresh the disk after every extend, avoiding the data corruption.
But there is still tiny window when the data corruption can occur - after disk was
extended, but before the disk was refreshed. If the migration finished during this
tiny window the disk will be corrupted.
Upgrading to vdsm >= 4.4.7 is the fix to avoid the first case.
We don't have a fix for the second case.
Updating severity, corrupting user data is the worst kind of bug.
Created attachment 1803651 [details]
reproducer script v2
This version is more reliable, creating a checksum only for block 1024 instead of entire disk. This 2048 times faster then computing a checksum of the entire disk, preventing migration failure with slow network.
Here is example run:
# ./reproduce.sh
13:53:58.942847860 Starting stress
Press Enter when migration was started:
13:54:33.498894131 Writing to disk...
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.31383 s, 232 MB/s
13:54:35.830913298 Waiting until disk is extended
13:54:45.847324723 Writing more data to disk...
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.5595 s, 210 MB/s
0+1 records in
0+1 records out
17 bytes copied, 0.0225126 s, 0.8 kB/s
13:54:48.571706100 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:48.576898723 Stopping stress
13:54:49.638947169 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:50.648707636 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:51.662211024 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:52.677077952 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:53.696843911 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:54.715865137 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:55.725831834 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:56.753593973 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca
13:54:57.774374789 Block 1024 checksum 3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3
13:54:57.776767720 Disk was corrupted
Note: this cannot reproduce the issue with new hosts (vdsm >= 4.4.7), since it
waits after writing the first 512 MiB. To reproduce the corruption with new host
we must write to block 1024 immediately after the disk is extended and migration
must finish quickly before the source host refresh the disk on the destination.
I don't think we have a chance to reproduce this case.
Created attachment 1803850 [details]
Reproducer script for testing new host
This script run 2 threads:
- One writing data, trigger disk extends
- Other comparing disk checksums, detecting disk corruption
How to use:
1. set the chunk size to 128 MiB. see the script for example
configuration
2. add 3g thin disk on block storage
3. In the guest, run the script:
# ./reproduce-new-host.py
4. Migrate the vm from the SPM and back
The bug may be reproduce when migrating from the SPM.
Example log:
2021-07-20 21:24:50,265 INFO [write] Writing extent 0...
2021-07-20 21:24:58,912 INFO [write] Writing extent 1...
2021-07-20 21:25:07,615 INFO [write] Writing extent 2...
2021-07-20 21:25:16,424 INFO [write] Writing extent 3...
2021-07-20 21:25:25,256 INFO [write] Writing extent 4...
2021-07-20 21:25:34,198 INFO [write] Writing extent 5...
2021-07-20 21:25:43,102 INFO [write] Writing extent 6...
2021-07-20 21:25:51,926 INFO [write] Writing extent 7...
2021-07-20 21:26:00,747 INFO [write] Writing extent 8...
2021-07-20 21:26:09,552 INFO [write] Writing extent 9...
2021-07-20 21:26:18,596 INFO [write] Writing extent 10...
2021-07-20 21:26:27,418 INFO [write] Writing extent 11...
2021-07-20 21:26:36,370 INFO [write] Writing extent 12...
2021-07-20 21:26:45,221 INFO [write] Writing extent 13...
2021-07-20 21:26:54,107 INFO [write] Writing extent 14...
2021-07-20 21:27:02,872 INFO [write] Writing extent 15...
2021-07-20 21:27:11,664 INFO [write] Writing extent 16...
2021-07-20 21:27:20,561 INFO [write] Writing extent 17...
2021-07-20 21:27:29,264 INFO [write] Writing extent 18...
2021-07-20 21:27:37,923 INFO [write] Writing extent 19...
2021-07-20 21:27:46,845 INFO [write] Writing extent 20...
2021-07-20 21:27:55,638 INFO [write] Writing extent 21...
2021-07-20 21:28:04,370 INFO [write] Writing extent 22...
2021-07-20 21:28:13,210 INFO [write] Writing extent 23...
I could not reproduce the issue. Maybe using a stress test
running 100s or migrations we have a chance to reproduce
this.
But the script is useful for testing migration flows.
Created attachment 1803851 [details]
Reproducer script for testing new host
New version fixes bad logging call if corruption was detected.
There are several directions to solve this issue:
- Cancel migration on the SPM if disk is extended. The user will have to retry
the migration. If the VM is very active and migration take lot of time, this
may never succeed. Bad for user experience.
- Change libvirt/qemu to give vdsm control at the end of migration, so vdsm can
refresh lvs extended on the source before libvirt stats the guest on the
destination host. I think this will be very complicated, and adding a refresh
will increase the downtime.
- Change libvirt/qemu so vdsm can specify the device size when using block device.
qemu will treat such write after the specified size as ENOSPC without submitting
any I/O, and will treat read in the same way, returning zeroes without doing
actual I/O. Libvirt will add new API to control the device actual size. Vdsm
will update libvirt about the lv size after refreshing the lv. This does not
effect migration downtime, but may be very complicated.
- Change LVM to allow extending a lv *without* updating the device size, so extend
when running on the SPM will behave exactly like extend when running on another
host. I think this is the most promising direction since it is very close to
what we already do and how lvm works.
The root problem is that on the SPM we run:
lvextend ... vg/lv
This command changes lvm metadata in the pv, and update the kernel state
(resize /dev/dm-N device).
When we migrage from the SPM, we want to separate the changes, so
lvextend will only change lvm metadata without modifying /dev/dm-N.
Then, after we refresh the disk on the destination, the device mapper
device will be updated by:
lvchange --refresh vg/lv
Turns out that lvextend has the option:
--driverloaded y|n
If set to no, the command will not attempt to use device-mapper.
For testing and debugging.
Which may do what we need.
We can add a flag to extend refresh=True. When False, call with
lvextend --driverloaded n ... vg/lv
to avoid the device update.
This will affect automatic extend running on the SPM, since we don't know if
a migration will be started right after we started an extend, and waiting
for inflight extend when starting migration is complicated and bad for the
user experience.
Initial testing of --driverloaded option show that it works: # lvcreate -n test --size 1g 313e6d78-80f7-41ab-883b-d1bddf77a5da Logical volume "test" created. # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 1G 0 lvm # lvextend --driverloaded n --size +1g 313e6d78-80f7-41ab-883b-d1bddf77a5da/test WARNING: Activation disabled. No device-mapper interaction will be attempted. Size of logical volume 313e6d78-80f7-41ab-883b-d1bddf77a5da/test changed from 1.00 GiB (8 extents) to 2.00 GiB (16 extents). Logical volume 313e6d78-80f7-41ab-883b-d1bddf77a5da/test successfully resized. # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 1G 0 lvm # lvchange --refresh 313e6d78-80f7-41ab-883b-d1bddf77a5da/test # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 2G 0 lvm We need to test with this with running best during migration. I tried to test it with the latest script Nir attached (Reproducer script for testing new host)
Since it didn't have stress tool in the script, I turned it on manually before the migration and turned it off when I saw the first extension.
Here are the logs from the source host(SPM):
2021-10-03 12:39:27,655+0300 INFO (periodic/4) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Requesting extension for volume bc7c81db-abb8-4ae5-b772-95fc1c366fbe on domain
f76ddb17-7c14-4cfc-abed-ea1349ef0ef1 (apparent: 1207959552, capacity: 3221225472, allocated: 1141309440, physical: 1207959552 threshold_state: exceeded) (vm:1350)
2021-10-03 12:39:27,655+0300 INFO (periodic/4) [vdsm.api] START sendExtendMsg(spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', volDict={'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=1342177280, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>) from=internal, task_id=94e85d42-a7db-458a-8ab0-6d651e95c009 (api:48)
2021-10-03 12:39:29,659+0300 INFO (periodic/0) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Requesting extension for volume bc7c81db-abb8-4ae5-b772-95fc1c366fbe on domain f76ddb17-7c14-4cfc-abed-ea1349ef0ef1 (apparent: 1207959552, capacity: 3221225472, allocated: 1141309440, physical: 1207959552 threshold_state: exceeded) (vm:1350)
2021-10-03 12:39:29,660+0300 INFO (periodic/0) [vdsm.api] START sendExtendMsg(spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', volDict={'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=1342177280, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>) from=internal, task_id=3bd9358c-2864-48a1-a81a-8b568f3d2a6d (api:48)
2021-10-03 12:39:29,701+0300 INFO (mailbox-hsm/0) [storage.ThreadPool.WorkerThread] START task 6e2ffb70-bb2b-427d-be45-97b345abb1f0 (cmd=<function runTask at 0x7fb4d5e122f0>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>, {'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=2.05*, extend-volume=2.05, refresh-destination-volume=0.00*)>, 'callback': None})) (threadPool:146)
2021-10-03 12:39:30,125+0300 WARN (mailbox-hsm/0) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1597)
Before the extension:(data from source)
2021-10-03 12:39:23,410+0300 INFO (periodic/5) [vdsm.api] START getVolumeSize(sdUUID='f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', imgUUID='c5ab2a91-b0c3-48a2-a653-61c80f32aacc', volUUID='bc7c81db-abb8-4ae5-b772-95fc1c366fbe') from=internal, task_id=6412041d-14fe-46e6-9a3a-953a22af32f3 (api:48)
2021-10-03 12:39:23,410+0300 INFO (periodic/5) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1207959552', 'truesize': '1207959552'} from=internal, task_id=6412041d-14fe-46e6-9a3a-953a22af32f3 (api:54)
After the extension:(data from dest)
2021-10-03 12:43:24,974+0300 INFO (periodic/2) [vdsm.api] START getVolumeSize(sdUUID='f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', imgUUID='c5ab2a91-b0c3-48a2-a653-61c80f32aacc', volUUID='bc7c81db-abb8-4ae5-b772-95fc1c366fbe', options=None) from=internal, task_id=148b5b3d-5d48-4951-962a-5f0735f441ee (api:48)
2021-10-03 12:43:24,974+0300 INFO (periodic/2) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1342177280', 'truesize': '1342177280'} from=internal, task_id=148b5b3d-5d48-4951-962a-5f0735f441ee (api:54)
Versions:
SPM - vdsm-4.40.90.1-1.el8ev.x86_64
NON SPM - vdsm-4.40.60.6-1.el8ev.x86_64
1) I didn't have any data corruption in all of my tries to reproduce the issue, is that enough to say this is ok?
2) As it was mentioned in the fix for the bug, after the first extension in this scenario, further extensions will be blocked - After the migration is fully done, should the disk be blocked for extensions as well? tried with or without migration and it seems to be still turned off.
> 1) I didn't have any data corruption in all of my tries to reproduce the > issue, is that enough to say this is ok? You should be able to reproduce the issue. Nir wrote in comment #4 that he wasn't able to reproduce the issue with "Reproducer script for testing new host". IMHO you should use "reproducer script v2". I was able to reproduce the issue with it very reliably. Once reproduce, you should apply the patch and verify that issue doesn't appear any more. > 2) As it was mentioned in the fix for the bug, after the first extension in > this scenario, further extensions will be blocked - After the migration is > fully done, should the disk be blocked for extensions as well? tried with or > without migration and it seems to be still turned off. no, once migrated and there are more writes, the disk should be extended as usual (and VM resumed if it was suspended due to lack of disk space during migration) Verified using the "reproducer script v2" attached. Tried to reproduce the issue with a none fixed version: SPM with vdsm-4.40.80.5-1.el8ev.x86_64 None-SPM - vdsm-4.40.60.7-1.el8ev.x86_64 problem reproduced: 15:25:48.776940681 Starting stress Press Enter when migration was started: 15:26:17.571062005 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.02058 s, 89.2 MB/s 15:26:23.643644764 Waiting until disk is extended 15:26:33.986112410 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.01468 s, 89.3 MB/s 0+1 records in 0+1 records out 17 bytes copied, 0.036696 s, 0.5 kB/s 15:26:40.546513745 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:40.573175317 Stopping stress ./rep.sh: line 39: 5258 Terminated stress --quiet --vm 1 --vm-bytes 256M 15:26:41.785037050 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:42.807564796 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:43.830035774 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:44.856188442 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:45.880841733 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:46.918733875 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:47.943472268 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:48.968195296 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:49.985555331 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:51.009182605 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:52.034402910 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:53.056241145 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:54.088545379 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:55.113889349 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:56.140813197 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:57.222328483 Block 1024 checksum 3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3 15:26:57.227912105 Disk was corrupted [1]+ Terminated stress --quiet --vm 1 --vm-bytes 256M On the fixed version it passed: 16:23:38.376413592 Starting stress Press Enter when migration was started: 16:24:05.723303394 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.07161 s, 88.4 MB/s 16:24:11.849690679 Waiting until disk is extended 16:24:21.879880452 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.18676 s, 86.8 MB/s 0+1 records in 0+1 records out 17 bytes copied, 0.0224842 s, 0.8 kB/s 16:24:28.203507895 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:28.216593750 Stopping stress ./rep.sh: line 39: 7436 Terminated stress --quiet --vm 1 --vm-bytes 256M 16:24:29.369323173 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:30.393301942 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:31.411367575 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:32.430409806 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:33.449047009 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:34.553433788 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:35.597597104 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:36.617205793 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:37.640333386 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:38.665031176 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:39.689913731 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:40.706734709 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:41.723192333 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca ...... Continued to loop for a long time after the migration is done because the checksum is identical - which means the disk was not corrupted. This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |