Created attachment 1803484 [details] reproducer Description of problem: When the disk of the VM is extended during migration, guest data can be corrupted when the source host is SPM. Fix for BZ #1883399 should prevent data corruption in most of the cases, refreshing first storage on the destination and then on the source, failing the extension when destination host doesn't support disk refresh. However, when the source host is SPM, which does the disk extend, storage is refreshed on the source right after extending the disk, no matter if destination host supports disk refresh or not. As a result, in such case the disk can be refreshed only on the source and guest data can be corrupted after migration if VM wrote to the extended area on source before migration finished. This was originally reported by Nir in https://bugzilla.redhat.com/show_bug.cgi?id=1883399#c88 Reproducer and logs are also taken from BZ #1883399. Steps to Reproduce: 1. In the guest, start the script: [root@f32 ~]# ./reproduce.sh 17:26:27.660876580 Starting stress Press Enter when migration was started: 2. In engine, start the migration 3. Wait until the migration show first non-zero progress message Clicking refresh on engine UI every few seconds helps 4. Press Enter in the guest script to continue the reproducer 17:27:03.735988728 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.21757 s, 242 MB/s 17:27:05.966949840 Waiting until disk is extended 17:27:16.040376014 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.84322 s, 189 MB/s 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0197118 s, 53.2 MB/s 17:27:19.041140885 Computing checksum... 17:27:29.890113434 Disk checksum 1a5350b04ce65e0ffac9a10d46f282769865e3db 17:27:29.908828362 Stopping stress 17:27:29.944121095 Computing checksum... 17:27:37.706975020 Disk checksum 1a5350b04ce65e0ffac9a10d46f282769865e3db 17:27:37.708425142 Computing checksum... 17:27:43.754118454 Disk checksum 91d50642dd930e9542c39d36f0516d45f4e1af0d 17:27:43.756070975 Disk was corrupted The scripts writes data to the disk, triggering extension. Then it writes more data, including the first MiB after the end of the original disk size. When writing is done, the script compute the first checksum. When done, it kills the stress processes so the migration can complete quickly. Then the script loops, computing the checksum of the disk. When the migration is done, the checksum on the destination will be different because the destination volume was never extended. Attached log analysis: ## src.log 1. Migration started 2021-07-16 00:26:38,357+0300 INFO (jsonrpc/5) [api.virt] START migrate(params={'abortOnError': 'true', 'autoConverge': 'true', 'dst': 'host4:54321', 'method': 'online', 'vmId': '9c4050b1-8070-429f-ae36-9eae0935ccab', 'src': 'host3', 'dstqemu': '192.168.122.23', 'convergenceSchedule': {'init': [{'name': 'setDowntime', 'params': ['100']}], 'stalling': [{'limit': 1, 'action': {'name': 'setDowntime', 'params': ['150']}}, {'limit': 2, 'action': {'name': 'setDowntime', 'params': ['200']}}, {'limit': 3, 'action': {'name': 'setDowntime', 'params': ['300']}}, {'limit': 4, 'action': {'name': 'setDowntime', 'params': ['400']}}, {'limit': 6, 'action': {'name': 'setDowntime', 'params': ['500']}}, {'limit': -1, 'action': {'name': 'abort', 'params': []}}]}, 'outgoingLimit': 2, 'enableGuestEvents': True, 'tunneled': 'false', 'encrypted': False, 'compressed': 'false', 'incomingLimit': 2}) from=::ffff:192.168.122.11,48102, flow_id=199fded5-d0fb-45dd-b53a-e3326ecc1aa9, vmId=9c4050b1-8070-429f-ae36-9eae0935ccab (api:48) 2021-07-16 00:26:38,360+0300 INFO (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0} from=::ffff:192.168.122.11,48102, flow_id=199fded5-d0fb-45dd-b53a-e3326ecc1aa9, vmId=9c4050b1-8070-429f-ae36-9eae0935ccab (api:54) 2. Migration progress started 2021-07-16 00:27:01,532+0300 INFO (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 20.012 seconds elapsed, 83% of data processed, total data: 1096MB, processed data: 858MB, remaining data: 188MB, transfer spe ed 416Mbps, zero pages: 13331MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:886) 3. Block threshold exceeded 2021-07-16 00:27:05,718+0300 INFO (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Block threshold 536870912 exceeded by 327680 for drive 'sdg' (/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/53594314-c33d-410e-a3c3-910f405b5131/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa) (drivemonitor:173) 4. Drive extended 2021-07-16 00:27:07,137+0300 INFO (mailbox-spm/0) [storage.LVM] Extending LV feab3738-c158-4d48-8a41-b5a95c057a50/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa to 2048 megabytes (lvm:1668) 5. Detecting that the destination does not support extension, disabling drive monitoring 2021-07-16 00:27:09,493+0300 WARN (mailbox-hsm/2) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1593) 6. Migration continues after drive was extended 2021-07-16 00:27:11,535+0300 INFO (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 30.014 seconds elapsed, 90% of data processed, total data: 1096MB, processed data: 1373MB, remaining data: 118MB, transfer speed 416Mbps, zero pages: 43933MB, compressed: 0MB, dirty rate: 13953, memory iteration: 3 (migration:886) ... 2021-07-16 00:27:31,545+0300 INFO (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 50.022 seconds elapsed, 86% of data processed, total data: 1096MB, processed data: 2413MB, remaining data: 157MB, transfer speed 416Mbps, zero pages: 45896MB, compressed: 0MB, dirty rate: 12900, memory iteration: 7 (migration:886) 7. Migration completes successfully 2021-07-16 00:27:39,048+0300 INFO (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU stopped: onSuspend (vm:6018) 2021-07-16 00:27:40,778+0300 INFO (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') new computed progress 86 < than old value 90, discarded (migration:255) 2021-07-16 00:27:40,780+0300 INFO (migsrc/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') migration took 62 seconds to complete (migration:580) 2021-07-16 00:27:40,780+0300 INFO (migsrc/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Changed state to Down: Migration succeeded (code=4) (vm:1902) ## dst.log 1. Migration started 2021-07-16 00:26:38,398+0300 INFO (jsonrpc/2) [api.virt] START migrationCreate... 2. Destination disk volume size is 1G 2021-07-16 00:26:39,435+0300 INFO (vm/9c4050b1) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd -fe810b6aa2aa', options=None) from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:48) 2021-07-16 00:26:39,435+0300 INFO (vm/9c4050b1) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:54) 3. Migration completed 2021-07-16 00:27:39,199+0300 INFO (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU running: onResume (vm:6018) 2021-07-16 00:27:39,214+0300 INFO (libvirt/events) [vds] Channel state for vm_id=9c4050b1-8070-429f-ae36-9eae0935ccab changed from=disconnected(2) to=connected(1) (qemuguestagent:289) 2021-07-16 00:27:39,221+0300 INFO (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Starting connection (guestagent:252) 2021-07-16 00:27:39,226+0300 INFO (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU running: domain initialization (vm:6018) 2021-07-16 00:27:39,236+0300 INFO (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') End of migration (vm:4081) 4. Setting block threshold on the extended disk 2021-07-16 00:27:40,365+0300 INFO (periodic/3) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Setting block threshold to 536870912 bytes for drive 'sdg' (/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/53594314-c33d-410e-a3c3-910f405b5131/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa) apparentsize 1073741824 (drivemonitor:116) The destination host still see 1 GiB disk, since the disk was not refreshed on the destination. The disk on the destination is not extended after the migration: $ egrep -A1 getVolumeSize.+c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa vdsm.log 2021-07-16 00:26:39,435+0300 INFO (vm/9c4050b1) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa', options=None) from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:48) 2021-07-16 00:26:39,435+0300 INFO (vm/9c4050b1) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:54) ... 2021-07-16 00:42:33,940+0300 INFO (periodic/0) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa', options=None) from=internal, task_id=be57bda6-143c-44fb-bf7a-f5bcb5382aa7 (api:48) 2021-07-16 00:42:33,940+0300 INFO (periodic/0) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=be57bda6-143c-44fb-bf7a-f5bcb5382aa7 (api:54) This will cause guest data corruption as soon as the guest try to read from the missing parts of the disk.
Created attachment 1803485 [details] reproducer-logs
Here the possible flow of events: 1. Disk extended on the source host 2. QEMU writing data after end of the old device 3. Migration completes, seeing old disk size Data written after end of the device will read as zeroes on the destination host, corrupting the disk. We have 2 cases: 1. Destination does not support disk refresh (vdsm < 4.4.7) In this case on the first extend attempt we disable automatic disk extension, but the source disk was already extended on the SPM. Here we have pretty large window when the guest can write data after the end of the old device. The reproducer script (attachment 1803484 [details]) reproduces this corruption reliably. 2. Destination supports disk refresh (vdsm >= 4.4.7) In this case we refresh the disk after every extend, avoiding the data corruption. But there is still tiny window when the data corruption can occur - after disk was extended, but before the disk was refreshed. If the migration finished during this tiny window the disk will be corrupted. Upgrading to vdsm >= 4.4.7 is the fix to avoid the first case. We don't have a fix for the second case. Updating severity, corrupting user data is the worst kind of bug.
Created attachment 1803651 [details] reproducer script v2 This version is more reliable, creating a checksum only for block 1024 instead of entire disk. This 2048 times faster then computing a checksum of the entire disk, preventing migration failure with slow network. Here is example run: # ./reproduce.sh 13:53:58.942847860 Starting stress Press Enter when migration was started: 13:54:33.498894131 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.31383 s, 232 MB/s 13:54:35.830913298 Waiting until disk is extended 13:54:45.847324723 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.5595 s, 210 MB/s 0+1 records in 0+1 records out 17 bytes copied, 0.0225126 s, 0.8 kB/s 13:54:48.571706100 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:48.576898723 Stopping stress 13:54:49.638947169 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:50.648707636 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:51.662211024 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:52.677077952 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:53.696843911 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:54.715865137 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:55.725831834 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:56.753593973 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 13:54:57.774374789 Block 1024 checksum 3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3 13:54:57.776767720 Disk was corrupted Note: this cannot reproduce the issue with new hosts (vdsm >= 4.4.7), since it waits after writing the first 512 MiB. To reproduce the corruption with new host we must write to block 1024 immediately after the disk is extended and migration must finish quickly before the source host refresh the disk on the destination. I don't think we have a chance to reproduce this case.
Created attachment 1803850 [details] Reproducer script for testing new host This script run 2 threads: - One writing data, trigger disk extends - Other comparing disk checksums, detecting disk corruption How to use: 1. set the chunk size to 128 MiB. see the script for example configuration 2. add 3g thin disk on block storage 3. In the guest, run the script: # ./reproduce-new-host.py 4. Migrate the vm from the SPM and back The bug may be reproduce when migrating from the SPM. Example log: 2021-07-20 21:24:50,265 INFO [write] Writing extent 0... 2021-07-20 21:24:58,912 INFO [write] Writing extent 1... 2021-07-20 21:25:07,615 INFO [write] Writing extent 2... 2021-07-20 21:25:16,424 INFO [write] Writing extent 3... 2021-07-20 21:25:25,256 INFO [write] Writing extent 4... 2021-07-20 21:25:34,198 INFO [write] Writing extent 5... 2021-07-20 21:25:43,102 INFO [write] Writing extent 6... 2021-07-20 21:25:51,926 INFO [write] Writing extent 7... 2021-07-20 21:26:00,747 INFO [write] Writing extent 8... 2021-07-20 21:26:09,552 INFO [write] Writing extent 9... 2021-07-20 21:26:18,596 INFO [write] Writing extent 10... 2021-07-20 21:26:27,418 INFO [write] Writing extent 11... 2021-07-20 21:26:36,370 INFO [write] Writing extent 12... 2021-07-20 21:26:45,221 INFO [write] Writing extent 13... 2021-07-20 21:26:54,107 INFO [write] Writing extent 14... 2021-07-20 21:27:02,872 INFO [write] Writing extent 15... 2021-07-20 21:27:11,664 INFO [write] Writing extent 16... 2021-07-20 21:27:20,561 INFO [write] Writing extent 17... 2021-07-20 21:27:29,264 INFO [write] Writing extent 18... 2021-07-20 21:27:37,923 INFO [write] Writing extent 19... 2021-07-20 21:27:46,845 INFO [write] Writing extent 20... 2021-07-20 21:27:55,638 INFO [write] Writing extent 21... 2021-07-20 21:28:04,370 INFO [write] Writing extent 22... 2021-07-20 21:28:13,210 INFO [write] Writing extent 23... I could not reproduce the issue. Maybe using a stress test running 100s or migrations we have a chance to reproduce this. But the script is useful for testing migration flows.
Created attachment 1803851 [details] Reproducer script for testing new host New version fixes bad logging call if corruption was detected.
There are several directions to solve this issue: - Cancel migration on the SPM if disk is extended. The user will have to retry the migration. If the VM is very active and migration take lot of time, this may never succeed. Bad for user experience. - Change libvirt/qemu to give vdsm control at the end of migration, so vdsm can refresh lvs extended on the source before libvirt stats the guest on the destination host. I think this will be very complicated, and adding a refresh will increase the downtime. - Change libvirt/qemu so vdsm can specify the device size when using block device. qemu will treat such write after the specified size as ENOSPC without submitting any I/O, and will treat read in the same way, returning zeroes without doing actual I/O. Libvirt will add new API to control the device actual size. Vdsm will update libvirt about the lv size after refreshing the lv. This does not effect migration downtime, but may be very complicated. - Change LVM to allow extending a lv *without* updating the device size, so extend when running on the SPM will behave exactly like extend when running on another host. I think this is the most promising direction since it is very close to what we already do and how lvm works. The root problem is that on the SPM we run: lvextend ... vg/lv This command changes lvm metadata in the pv, and update the kernel state (resize /dev/dm-N device). When we migrage from the SPM, we want to separate the changes, so lvextend will only change lvm metadata without modifying /dev/dm-N. Then, after we refresh the disk on the destination, the device mapper device will be updated by: lvchange --refresh vg/lv Turns out that lvextend has the option: --driverloaded y|n If set to no, the command will not attempt to use device-mapper. For testing and debugging. Which may do what we need. We can add a flag to extend refresh=True. When False, call with lvextend --driverloaded n ... vg/lv to avoid the device update. This will affect automatic extend running on the SPM, since we don't know if a migration will be started right after we started an extend, and waiting for inflight extend when starting migration is complicated and bad for the user experience.
Initial testing of --driverloaded option show that it works: # lvcreate -n test --size 1g 313e6d78-80f7-41ab-883b-d1bddf77a5da Logical volume "test" created. # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 1G 0 lvm # lvextend --driverloaded n --size +1g 313e6d78-80f7-41ab-883b-d1bddf77a5da/test WARNING: Activation disabled. No device-mapper interaction will be attempted. Size of logical volume 313e6d78-80f7-41ab-883b-d1bddf77a5da/test changed from 1.00 GiB (8 extents) to 2.00 GiB (16 extents). Logical volume 313e6d78-80f7-41ab-883b-d1bddf77a5da/test successfully resized. # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 1G 0 lvm # lvchange --refresh 313e6d78-80f7-41ab-883b-d1bddf77a5da/test # lsblk | grep test └─313e6d78--80f7--41ab--883b--d1bddf77a5da-test 253:29 0 2G 0 lvm We need to test with this with running best during migration.
I tried to test it with the latest script Nir attached (Reproducer script for testing new host) Since it didn't have stress tool in the script, I turned it on manually before the migration and turned it off when I saw the first extension. Here are the logs from the source host(SPM): 2021-10-03 12:39:27,655+0300 INFO (periodic/4) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Requesting extension for volume bc7c81db-abb8-4ae5-b772-95fc1c366fbe on domain f76ddb17-7c14-4cfc-abed-ea1349ef0ef1 (apparent: 1207959552, capacity: 3221225472, allocated: 1141309440, physical: 1207959552 threshold_state: exceeded) (vm:1350) 2021-10-03 12:39:27,655+0300 INFO (periodic/4) [vdsm.api] START sendExtendMsg(spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', volDict={'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=1342177280, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>) from=internal, task_id=94e85d42-a7db-458a-8ab0-6d651e95c009 (api:48) 2021-10-03 12:39:29,659+0300 INFO (periodic/0) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Requesting extension for volume bc7c81db-abb8-4ae5-b772-95fc1c366fbe on domain f76ddb17-7c14-4cfc-abed-ea1349ef0ef1 (apparent: 1207959552, capacity: 3221225472, allocated: 1141309440, physical: 1207959552 threshold_state: exceeded) (vm:1350) 2021-10-03 12:39:29,660+0300 INFO (periodic/0) [vdsm.api] START sendExtendMsg(spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', volDict={'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=1342177280, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>) from=internal, task_id=3bd9358c-2864-48a1-a81a-8b568f3d2a6d (api:48) 2021-10-03 12:39:29,701+0300 INFO (mailbox-hsm/0) [storage.ThreadPool.WorkerThread] START task 6e2ffb70-bb2b-427d-be45-97b345abb1f0 (cmd=<function runTask at 0x7fb4d5e122f0>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7fb4785b5278>>, {'domainID': 'f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', 'imageID': 'c5ab2a91-b0c3-48a2-a653-61c80f32aacc', 'internal': False, 'name': 'sdb', 'newSize': 1342177280, 'poolID': '84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', 'volumeID': 'bc7c81db-abb8-4ae5-b772-95fc1c366fbe', 'clock': <Clock(total=2.05*, extend-volume=2.05, refresh-destination-volume=0.00*)>, 'callback': None})) (threadPool:146) 2021-10-03 12:39:30,125+0300 WARN (mailbox-hsm/0) [virt.vm] (vmId='4838c7f8-13b0-4605-966a-3bccbe30ac22') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1597) Before the extension:(data from source) 2021-10-03 12:39:23,410+0300 INFO (periodic/5) [vdsm.api] START getVolumeSize(sdUUID='f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', imgUUID='c5ab2a91-b0c3-48a2-a653-61c80f32aacc', volUUID='bc7c81db-abb8-4ae5-b772-95fc1c366fbe') from=internal, task_id=6412041d-14fe-46e6-9a3a-953a22af32f3 (api:48) 2021-10-03 12:39:23,410+0300 INFO (periodic/5) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1207959552', 'truesize': '1207959552'} from=internal, task_id=6412041d-14fe-46e6-9a3a-953a22af32f3 (api:54) After the extension:(data from dest) 2021-10-03 12:43:24,974+0300 INFO (periodic/2) [vdsm.api] START getVolumeSize(sdUUID='f76ddb17-7c14-4cfc-abed-ea1349ef0ef1', spUUID='84ba839f-92fa-4bb2-8a55-a4b771e7eb1d', imgUUID='c5ab2a91-b0c3-48a2-a653-61c80f32aacc', volUUID='bc7c81db-abb8-4ae5-b772-95fc1c366fbe', options=None) from=internal, task_id=148b5b3d-5d48-4951-962a-5f0735f441ee (api:48) 2021-10-03 12:43:24,974+0300 INFO (periodic/2) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1342177280', 'truesize': '1342177280'} from=internal, task_id=148b5b3d-5d48-4951-962a-5f0735f441ee (api:54) Versions: SPM - vdsm-4.40.90.1-1.el8ev.x86_64 NON SPM - vdsm-4.40.60.6-1.el8ev.x86_64 1) I didn't have any data corruption in all of my tries to reproduce the issue, is that enough to say this is ok? 2) As it was mentioned in the fix for the bug, after the first extension in this scenario, further extensions will be blocked - After the migration is fully done, should the disk be blocked for extensions as well? tried with or without migration and it seems to be still turned off.
> 1) I didn't have any data corruption in all of my tries to reproduce the > issue, is that enough to say this is ok? You should be able to reproduce the issue. Nir wrote in comment #4 that he wasn't able to reproduce the issue with "Reproducer script for testing new host". IMHO you should use "reproducer script v2". I was able to reproduce the issue with it very reliably. Once reproduce, you should apply the patch and verify that issue doesn't appear any more. > 2) As it was mentioned in the fix for the bug, after the first extension in > this scenario, further extensions will be blocked - After the migration is > fully done, should the disk be blocked for extensions as well? tried with or > without migration and it seems to be still turned off. no, once migrated and there are more writes, the disk should be extended as usual (and VM resumed if it was suspended due to lack of disk space during migration)
Verified using the "reproducer script v2" attached. Tried to reproduce the issue with a none fixed version: SPM with vdsm-4.40.80.5-1.el8ev.x86_64 None-SPM - vdsm-4.40.60.7-1.el8ev.x86_64 problem reproduced: 15:25:48.776940681 Starting stress Press Enter when migration was started: 15:26:17.571062005 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.02058 s, 89.2 MB/s 15:26:23.643644764 Waiting until disk is extended 15:26:33.986112410 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.01468 s, 89.3 MB/s 0+1 records in 0+1 records out 17 bytes copied, 0.036696 s, 0.5 kB/s 15:26:40.546513745 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:40.573175317 Stopping stress ./rep.sh: line 39: 5258 Terminated stress --quiet --vm 1 --vm-bytes 256M 15:26:41.785037050 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:42.807564796 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:43.830035774 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:44.856188442 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:45.880841733 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:46.918733875 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:47.943472268 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:48.968195296 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:49.985555331 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:51.009182605 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:52.034402910 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:53.056241145 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:54.088545379 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:55.113889349 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:56.140813197 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 15:26:57.222328483 Block 1024 checksum 3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3 15:26:57.227912105 Disk was corrupted [1]+ Terminated stress --quiet --vm 1 --vm-bytes 256M On the fixed version it passed: 16:23:38.376413592 Starting stress Press Enter when migration was started: 16:24:05.723303394 Writing to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.07161 s, 88.4 MB/s 16:24:11.849690679 Waiting until disk is extended 16:24:21.879880452 Writing more data to disk... 512+0 records in 512+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 6.18676 s, 86.8 MB/s 0+1 records in 0+1 records out 17 bytes copied, 0.0224842 s, 0.8 kB/s 16:24:28.203507895 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:28.216593750 Stopping stress ./rep.sh: line 39: 7436 Terminated stress --quiet --vm 1 --vm-bytes 256M 16:24:29.369323173 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:30.393301942 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:31.411367575 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:32.430409806 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:33.449047009 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:34.553433788 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:35.597597104 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:36.617205793 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:37.640333386 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:38.665031176 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:39.689913731 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:40.706734709 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca 16:24:41.723192333 Block 1024 checksum 7623926e2157f135ec276d28b13943f9d6d2e5ca ...... Continued to loop for a long time after the migration is done because the checksum is identical - which means the disk was not corrupted.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.