Bug 1883399 - During migration, late volume extension request on src is possibly not refreshed on dst, qcow2 corrupt bit set
Summary: During migration, late volume extension request on src is possibly not refres...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.10
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.4.7
: 4.4.7
Assignee: Vojtech Juranek
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks: 1985906
TreeView+ depends on / blocked
 
Reported: 2020-09-29 04:13 UTC by Germano Veit Michel
Modified: 2021-07-26 08:54 UTC (History)
20 users (show)

Fixed In Version: vdsm-4.40.70.6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-22 15:08:31 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)
Reproducer script (1.42 KB, application/x-shellscript)
2021-07-16 01:07 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5440961 0 None None None 2020-09-29 04:37:17 UTC
Red Hat Product Errata RHBA-2021:2864 0 None None None 2021-07-22 15:09:30 UTC
oVirt gerrit 114589 0 master POST virt: refresh disk on destination during migration 2021-06-21 20:00:45 UTC
oVirt gerrit 114590 0 master ABANDONED virt: disallow disk extension during migration 2021-06-03 14:38:57 UTC
oVirt gerrit 114616 0 master MERGED tests: add migration thread into live merge test VM 2021-06-07 17:00:18 UTC
oVirt gerrit 115065 0 master MERGED virt: add new API for volume refresh 2021-06-22 15:20:30 UTC
oVirt gerrit 115175 0 master POST doc: document chunked drive extenstion flow during migration 2021-06-21 20:00:49 UTC
oVirt gerrit 115376 0 master MERGED caps: add capability for refreshing disk 2021-06-23 15:07:57 UTC
oVirt gerrit 115378 0 master MERGED virt: disable disk extension during migration when host doesn't support it 2021-06-30 12:40:43 UTC
oVirt gerrit 115478 0 master MERGED virt: improve the check when disk should be refreshed 2021-07-08 09:13:34 UTC
oVirt gerrit 115608 0 master MERGED virt: fix error message when disk refresh fails 2021-07-08 13:53:29 UTC
oVirt gerrit 115642 0 master MERGED virt: fix error handling in Vm.refresh_disk() method 2021-07-09 22:44:02 UTC
oVirt gerrit 115645 0 master ABANDONED virt: refresh all VM disks once migration finishes 2021-07-12 14:10:40 UTC

Description Germano Veit Michel 2020-09-29 04:13:04 UTC
Description of problem:

A VM paused with eio right after running on the destination once the migration finished:
qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed
The corrupt bit was set and it won't start again.

From the logs, I found interesting that there is an extension request completing on the SRC at about the same time the migration finishes, and the DST host appears to have the LV without refreshing (old size) when the VM runs.

Migration Source
----------------

1. This is already during late migration, threshold for drive is exceeded:

2020-09-26 17:22:26,578+0800 INFO  (libvirt/events) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') block threshold 1610612736 exceeded on 'vdb' (/rhev/data-center/mnt/blockSD/04bd57db-c224-497c-a916-29bd2642b419/images/731db18d-6064-4c2d-8de5-4fd50a3f9f7b/0c815200-c526-4310-9057-17bbc2167615) (drivemonitor:165)

2. Request Extension (note the values)

22020-09-26 17:22:29,527+0800 INFO  (periodic/37) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Requesting extension for volume 0c815200-c526-4310-9057-17bbc2167615 on domain 04bd57db-c224-497c-a916-29bd2642b419 (
  apparent:    6442450944, 
  capacity:  536870912000,
  allocated:   6098780160,
  physical:    6442450944
  threshold_state: exceeded) (vm:1289)

3. Refresh LV and finish extension
  
2020-09-26 17:22:29,617+0800 INFO  (mailbox-hsm/4) [storage.LVM] Refreshing LVs (vg=04bd57db-c224-497c-a916-29bd2642b419, lvs=(u'0c815200-c526-4310-9057-17bbc2167615',)) (lvm:1637)
2020-09-26 17:22:30,094+0800 INFO  (mailbox-hsm/4) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=754ff0c8-c5e0-4a78-83e0-4f8c44c10f54 (api:54)
2020-09-26 17:22:30,095+0800 INFO  (mailbox-hsm/4) [vdsm.api] START getVolumeSize(sdUUID=u'04bd57db-c224-497c-a916-29bd2642b419', spUUID='33f59a82-e273-433f-a59f-4ae661527d04', imgUUID=u'731db18d-6064-4c2d-8de5-4fd50a3f9f7b', volUUID=u'0c815200-c526-4310-9057-17bbc2167615', options=None) from=internal, task_id=d7ec6245-5b30-4978-a649-c1e2fe26436f (api:48)
2020-09-26 17:22:30,095+0800 INFO  (mailbox-hsm/4) [vdsm.api] FINISH getVolumeSize return={'truesize': '7516192768', 'apparentsize': '7516192768'} from=internal, task_id=d7ec6245-5b30-4978-a649-c1e2fe26436f (api:54)
2020-09-26 17:22:30,095+0800 INFO  (mailbox-hsm/4) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Extend volume 0c815200-c526-4310-9057-17bbc2167615 completed <Clock(total=2.57, extend-volume=2.09, refresh-volume=0.48)> (vm:1412)

2020-09-26 17:22:30,095+0800 INFO  (mailbox-hsm/4) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') setting block threshold to 6979321856 bytes for drive 'vdb' (apparentsize 7516192768) (drivemonitor:110)

4. Just before the block threshold is set, the VM has finished migrating

2020-09-26 17:22:42,044+0800 INFO  (libvirt/events) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') CPU stopped: onSuspend (vm:6094)

5. Set block threshold on source fails, as the VM is not there anymore

2020-09-26 17:22:46,983+0800 ERROR (mailbox-hsm/4) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Failed to set block threshold on 'vdb' (/rhev/data-center/mnt/blockSD/04bd57db-c224-497c-a916-29bd2642b419/images/731db18d-6064-4c2d-8de5-4fd50a3f9f7b/0c815200-c526-4310-9057-17bbc2167615): Requested operation is not valid: domain is not running (drivemonitor:122)

6. Source ends

2020-09-26 17:22:47,001+0800 INFO  (migsrc/978c5fea) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Changed state to Down: Migration succeeded (code=4) (vm:1702)


At this point it appears the worst that could happen is the VM running with wrong/old block threshold... But now on the Destination things get more complicated:

Migration Destionation
----------------------

7. Migration starts

2020-09-26 17:21:56,288+0800 INFO  (jsonrpc/6) [api.virt] START migrationCreate

8. prepareImage and getVolumeSize for the Disk above are before the extension (which is at 17:22:30 - step 2 above), so this are pre-externsion values.

2020-09-26 17:21:56,718+0800 INFO  (vm/978c5fea) [vdsm.api] START getVolumeSize(sdUUID='04bd57db-c224-497c-a916-29bd2642b419', spUUID='33f59a82-e273-433f-a59f-4ae661527d04', imgUUID='731db18d-6064-4c2d-8de5-4fd50a3f9f7b', volUUID='0c815200-c526-4310-9057-17bbc2167615', options=None) from=internal, task_id=d572ccc5-b902-406e-8490-05067f0a527a (api:48)
2020-09-26 17:21:56,718+0800 INFO  (vm/978c5fea) [vdsm.api] FINISH getVolumeSize return={'truesize': '6442450944', 'apparentsize': '6442450944'} from=internal, task_id=d572ccc5-b902-406e-8490-05067f0a527a (api:54)

9. LV was extended and refreshed on the source, but no signs of refresh on the DST...

10. VM runs, and immediately stops with EIO.

2020-09-26 17:22:42,212+0800 INFO  (libvirt/events) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') CPU running: onResume (vm:6094)
2020-09-26 17:22:42,341+0800 INFO  (vm/978c5fea) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Starting connection (guestagent:256)
2020-09-26 17:22:42,458+0800 INFO  (libvirt/events) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') abnormal vm stop device ua-731db18d-6064-4c2d-8de5-4fd50a3f9f7b error eio (vm:5079)

qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed

11. Set block threshold and extension request that make no sense:

2020-09-26 17:22:43,007+0800 INFO  (periodic/265) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') setting block threshold to 4831838208 bytes for drive 'vdb' (apparentsize 6442450944) (drivemonitor:110)
2020-09-26 17:22:43,011+0800 ERROR (periodic/265) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Failed to set block threshold on 'vdb' (/rhev/data-center/mnt/blockSD/04bd57db-c224-497c-a916-29bd2642b419/images/731db18d-6064-4c2d-8de5-4fd50a3f9f7b/0c815200-c526-4310-9057-17bbc2167615): internal error: unable to execute QEMU command 'query-named-block-nodes': Block device #block529 is ejected (drivemonitor:122)
2020-09-26 17:22:43,011+0800 ERROR (periodic/265) [virt.vm] (vmId='978c5fea-5470-49d3-a064-03c28c307136') Improbable extension request for volume 0c815200-c526-4310-9057-17bbc2167615 on domain 04bd57db-c224-497c-a916-29bd2642b419, pausing the VM to avoid corruptions (capacity: 0, allocated: 8884736, physical: 6442450944, next physical size: 0) (drivemonitor:209)

At this point the volume was, on the src:
[SRC] 2020-09-26 17:22:30,095+0800 INFO  (mailbox-hsm/4) [vdsm.api] FINISH getVolumeSize return={'truesize': '7516192768', 'apparentsize': '7516192768'} from=internal, task_id=d7ec6245-5b30-4978-a649-c1e2fe26436f (api:54)
So it seems it wasn't refreshed on the DST...

12. And we have this LVM warning on LV deactivation on shutdown of the VM on DST:

2020-09-26 17:24:10,764+0800 WARN  (periodic/264) [storage.LVM] Command ['/sbin/lvm', 'lvs', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/360060e80123c960050403c9600000108$|^/dev/mapper/360060e80123c960050403c9600000109$|", "r|.*|"] } global {  locking_type=4  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,vg_name,attr,size,seg_start_pe,devices,tags', '--select', 'vg_name = 04bd57db-c224-497c-a916-29bd2642b419'] succeeded with warnings: ['  WARNING: Cannot find matching striped segment for 04bd57db-c224-497c-a916-29bd2642b419/0c815200-c526-4310-9057-17bbc2167615.'] (lvm:466)

I believe the above warning can also be triggered by the lack of a refresh.

Version-Release number of selected component (if applicable):
vdsm-4.30.44-1.el7ev.x86_64

How reproducible:
Unknown

Comment 3 Germano Veit Michel 2020-09-29 04:17:41 UTC
(In reply to Germano Veit Michel from comment #0)
> 12. And we have this LVM warning on LV deactivation on shutdown of the VM on
> DST:
> 
> 2020-09-26 17:24:10,764+0800 WARN  (periodic/264) [storage.LVM] Command
> ['/sbin/lvm', 'lvs', '--config', 'devices { 
> preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1 
> write_cache_state=0  disable_after_error_count=3 
> filter=["a|^/dev/mapper/360060e80123c960050403c9600000108$|^/dev/mapper/
> 360060e80123c960050403c9600000109$|", "r|.*|"] } global {  locking_type=4 
> prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup { 
> retain_min=50  retain_days=0 }', '--noheadings', '--units', 'b',
> '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o',
> 'uuid,name,vg_name,attr,size,seg_start_pe,devices,tags', '--select',
> 'vg_name = 04bd57db-c224-497c-a916-29bd2642b419'] succeeded with warnings:
> ['  WARNING: Cannot find matching striped segment for
> 04bd57db-c224-497c-a916-29bd2642b419/0c815200-c526-4310-9057-17bbc2167615.']
> (lvm:466)
> 
> I believe the above warning can also be triggered by the lack of a refresh.

This suggests it too https://access.redhat.com/solutions/3231111

Comment 7 Vojtech Juranek 2021-05-19 08:02:15 UTC
Hi,
could someone from libvirt or qemu give me hint if it's possible to do some action once migration finish, but before VM is resumed on the destination? If we have to extend VM image while migration is ongoing, we extend it on the source host, but the change is not visible on the destination host - we have to do storage (LVs) refresh on the destination host make this change visible. If we don't do this, we end up with corrupted image, as describe in this BZ. We have to refresh LVs on the destination once migration is complete (VM is stopped on the source host), but before VM is resumed on the destination host.

There are two questions:
* vdsm doesn't resume VM on the destination host, it's done by libvirt, so we probably need some libvirt hook run after migration is finished. Is there any such hook? 
* we want to minimize time between VM is stopped on the source host and it's resumed on destination, therefore we want to do storage refresh only when it's really needed. Is there a way how run libvirt hook conditionally (only when image was extended on the source)?

Also, if you have any other ideas/approach how to tackle this problem, you ideas are more than welcome.
Thanks!

Comment 8 Dr. David Alan Gilbert 2021-05-19 09:50:14 UTC
The block storage side is different fromt eh area I normally look at; lets try pkrempa (libvirt) and eblake (qemu/nbd)

Comment 9 Peter Krempa 2021-05-24 09:06:37 UTC
(In reply to Vojtech Juranek from comment #7)
> Hi,
> could someone from libvirt or qemu give me hint if it's possible to do some
> action once migration finish, but before VM is resumed on the destination?

Currently not. Libvirt's hook is currently called only before the migration starts to allow filtering the destination XML.

> If we have to extend VM image while migration is ongoing, we extend it on
> the source host, but the change is not visible on the destination host - we
> have to do storage (LVs) refresh on the destination host make this change
> visible. 

I suggest that if you are going to extend an image while migration is ongoing, you refresh the LVs on the destionation _before_ refreshing them on the source. That should guarantee that the destination will not see the image smaller than the source.

> If we don't do this, we end up with corrupted image, as describe in
> this BZ. We have to refresh LVs on the destination once migration is
> complete (VM is stopped on the source host), but before VM is resumed on the
> destination host.
> 
> There are two questions:
> * vdsm doesn't resume VM on the destination host, it's done by libvirt, so
> we probably need some libvirt hook run after migration is finished. Is there
> any such hook? 

As said above, we currently don't have such hook.

> * we want to minimize time between VM is stopped on the source host and it's
> resumed on destination, therefore we want to do storage refresh only when
> it's really needed. Is there a way how run libvirt hook conditionally (only
> when image was extended on the source)?

That should not be necessary. You just want to refresh the LVs on the destination before refreshing the source, that way it won't have any impact on the migration.
 
> Also, if you have any other ideas/approach how to tackle this problem, you
> ideas are more than welcome.
> Thanks!

Comment 10 Vojtech Juranek 2021-05-24 10:24:54 UTC
 
> > * we want to minimize time between VM is stopped on the source host and it's
> > resumed on destination, therefore we want to do storage refresh only when
> > it's really needed. Is there a way how run libvirt hook conditionally (only
> > when image was extended on the source)?
> 
> That should not be necessary. You just want to refresh the LVs on the
> destination before refreshing the source, that way it won't have any impact
> on the migration.

thanks, but the issue is we have to ensure that the refresh happens *before* VM is resumed on the destination. As the resuming VM on destination is done by libvirt and is out of our control (and given there's no libvirt hook to be run once migration is finished), how can be ensure VM won't be resumed before refresh on the destination is done/finished?

Comment 11 Peter Krempa 2021-05-24 10:33:37 UTC
(In reply to Vojtech Juranek from comment #10)
>  
> > > * we want to minimize time between VM is stopped on the source host and it's
> > > resumed on destination, therefore we want to do storage refresh only when
> > > it's really needed. Is there a way how run libvirt hook conditionally (only
> > > when image was extended on the source)?
> > 
> > That should not be necessary. You just want to refresh the LVs on the
> > destination before refreshing the source, that way it won't have any impact
> > on the migration.
> 
> thanks, but the issue is we have to ensure that the refresh happens *before*
> VM is resumed on the destination. As the resuming VM on destination is done

I was pointing out that if you refresh the LVs on the destination _before_ you refresh them on the source of the migration the problem should not exist regardless of whether that happens before or after the VM is unpaused, because the source either won't see them extended or it will be already extended on the source.

> by libvirt and is out of our control (and given there's no libvirt hook to
> be run once migration is finished), how can be ensure VM won't be resumed
> before refresh on the destination is done/finished?

Unfortunately, it's currently not possible. Please file an RFE if you require such functionality.

Comment 12 Eric Blake 2021-05-25 14:27:23 UTC
I wonder if having block resize blocked by an ongoing live migration is the right approach to avoid problems like this, or if that is too strict.

Comment 13 Nir Soffer 2021-05-25 15:08:42 UTC
(In reply to Eric Blake from comment #12)
> I wonder if having block resize blocked by an ongoing live migration is the
> right approach to avoid problems like this, or if that is too strict.

I think block resize is not related - this is the automatic extend of the underlying
lv, without changing the the guest visible disk size. When we refresh the lv after
extend, lvm suspends the device, update the device size, and resume the device. We
don't tell qemu about this change, it should be transparent to qemu.

When the vm pauses with ENOSPC, we do resume the vm after successful extend.

Refreshing the destination before the source seems the best way to handle this issue.

Comment 17 Vojtech Juranek 2021-06-21 14:11:26 UTC
Part of the fix is new API for refreshing the disk. One possible way how to test the API is via vdsm-client:

[root@localhost ~]# cat args.json
{
    "vmID": "2dad9038-3e3a-4b5e-8d20-b0da37d9ef79",
    "drive": {
        "device": "disk",
        "domainID": "cdac2a0c-b110-456d-a988-7d588626c871",
        "imageID": "26dc7573-d7ad-422b-952f-f1ce3517277a",
        "poolID": "f034d5d0-ce1e-11eb-a870-525400ddf532",
        "volumeID": "0cafdf0f-8921-4e7f-8ba9-eeec86990740"
    }
}


[root@localhost ~]# vdsm-client -f args.json VM refresh_disk
{
    "apparentsize": "134217728",
    "truesize": "134217728"
}

Comment 21 Vojtech Juranek 2021-06-29 13:22:28 UTC
For the reference, here are the log from my test run for case when destination host supports disk refresh and when it doesn't support disk refresh (in such case drive monitor is disabled temporarily during the migration to prevent other extend attempts):

Disk refresh on destination host supported
==========================

Source
---------

2021-06-29 08:52:57,654-0400 WARN  (Thread-2) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') TEST EXTENSION (migration:555)
2021-06-29 08:52:57,656-0400 INFO  (Thread-2) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=3a9caeae-2c5d-4e5a-b03b-631b85566779 (api:48)
2021-06-29 08:52:57,658-0400 INFO  (Thread-2) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '5905580032', 'truesize': '5905580032'} from=internal, task_id=3a9caeae-2c5d-4e5a-b03b-631b85566779 (api:54)
2021-06-29 08:52:57,658-0400 INFO  (Thread-2) [vdsm.api] START sendExtendMsg(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', volDict={'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'internal': False, 'name': 'sdc', 'newSize': 5916065792, 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f', 'clock': <Clock(extend-volume=0.00*, total=0.00*)>, 'callback': None}, newSize=5916065792, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f5aff51d2e8>>) from=internal, task_id=0b110377-3173-4aba-a6e7-b7116ad9bc1c (api:48)
2021-06-29 08:52:57,659-0400 INFO  (Thread-2) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=0b110377-3173-4aba-a6e7-b7116ad9bc1c (api:54)
2021-06-29 08:52:57,679-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-06-29 08:52:58,303-0400 INFO  (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=09c3b083-deab-4578-8e2d-ac0d5b532220 (api:48)
2021-06-29 08:52:58,304-0400 INFO  (periodic/1) [vdsm.api] FINISH repoStats return={'cdac2a0c-b110-456d-a988-7d588626c871': {'code': 0, 'lastCheck': '1.4', 'delay': '0.00066451', 'valid': True, 'version': 5, 'acquired': True, 'actual': True}} from=internal, task_id=09c3b083-deab-4578-8e2d-ac0d5b532220 (api:54)
2021-06-29 08:52:59,575-0400 INFO  (mailbox-spm/0) [storage.ThreadPool.WorkerThread] START task ee1652f1-a1db-4e2d-a995-5516b6f08355 (cmd=<function runTask at 0x7f5b54cef488>, args=(functools.partial(<bound method SPM_Extend_Message.processRequest of <class 'vdsm.storage.mailbox.SPM_Extend_Message'>>, <vdsm.storage.sp.StoragePool object at 0x7f5afec8cc18>), 256, b'1xtndq\xc8&\x86X}\x88\xa9mE\x10\xb1\x0c*\xac\xcd\x1fk\xf7\x90D\xab\xd4\xa3\x03A\x08\x96\xd33`\xc1000000000000160a00000000000')) (threadPool:146)
2021-06-29 08:52:59,576-0400 INFO  (mailbox-spm/0) [storage.SPM.Messages.Extend] processRequest: extending volume c16033d3-9608-4103-a3d4-ab4490f76b1f in domain cdac2a0c-b110-456d-a988-7d588626c871 (pool f034d5d0-ce1e-11eb-a870-525400ddf532) to size 5642 (mailbox:168)
2021-06-29 08:52:59,576-0400 INFO  (mailbox-spm/0) [storage.LVM] Extending LV cdac2a0c-b110-456d-a988-7d588626c871/c16033d3-9608-4103-a3d4-ab4490f76b1f to 5642 megabytes (lvm:1668)
2021-06-29 08:52:59,731-0400 INFO  (mailbox-spm/0) [storage.ThreadPool.WorkerThread] FINISH task ee1652f1-a1db-4e2d-a995-5516b6f08355 (threadPool:148)
2021-06-29 08:52:59,883-0400 INFO  (jsonrpc/1) [vdsm.api] START getSpmStatus(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', options=None) from=::ffff:192.168.122.148,34282, task_id=d10a26f9-8245-4ae6-9dc0-110d678ccc27 (api:48)
2021-06-29 08:52:59,889-0400 INFO  (jsonrpc/1) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 72, 'spmId': 4}} from=::ffff:192.168.122.148,34282, task_id=d10a26f9-8245-4ae6-9dc0-110d678ccc27 (api:54)
2021-06-29 08:52:59,895-0400 INFO  (jsonrpc/2) [vdsm.api] START getStoragePoolInfo(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', options=None) from=::ffff:192.168.122.148,34366, task_id=d446d996-3d46-4ec6-8a70-088618ea1444 (api:48)
2021-06-29 08:52:59,971-0400 INFO  (jsonrpc/2) [vdsm.api] FINISH getStoragePoolInfo return={'info': {'domains': 'cdac2a0c-b110-456d-a988-7d588626c871:Active', 'isoprefix': '', 'lver': 72, 'master_uuid': 'cdac2a0c-b110-456d-a988-7d588626c871', 'master_ver': 104, 'name': 'No Description', 'pool_status': 'connected', 'spm_id': 4, 'type': 'ISCSI', 'version': '5'}, 'dominfo': {'cdac2a0c-b110-456d-a988-7d588626c871': {'status': 'Active', 'alerts': [], 'isoprefix': '', 'version': 5, 'disktotal': '31809601536', 'diskfree': '16374562816'}}} from=::ffff:192.168.122.148,34366, task_id=d446d996-3d46-4ec6-8a70-088618ea1444 (api:54)
2021-06-29 08:53:00,731-0400 INFO  (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU stopped: onSuspend (vm:6018)
2021-06-29 08:53:01,739-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-06-29 08:53:01,744-0400 INFO  (mailbox-hsm/4) [storage.ThreadPool.WorkerThread] START task c47460da-e9a9-42a7-93b5-b2b9f4b23040 (cmd=<function runTask at 0x7f5b54cef488>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f5aff51d2e8>>, {'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'internal': False, 'name': 'sdc', 'newSize': 5916065792, 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f', 'clock': <Clock(extend-volume=4.09, total=4.09*, refresh-destination-volume=0.00*)>, 'callback': None})) (threadPool:146)
2021-06-29 08:53:01,745-0400 INFO  (mailbox-hsm/4) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Volume sdc (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) was extended during migration, refreshing it on destination host. (vm:1441)
2021-06-29 08:53:02,010-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') migration took 6 seconds to complete (migration:551)
2021-06-29 08:53:02,010-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Changed state to Down: Migration succeeded (code=4) (vm:1895)
2021-06-29 08:53:02,015-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Stopping connection (guestagent:438)
2021-06-29 08:53:02,016-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Enabling drive monitoring (drivemonitor:59)
2021-06-29 08:53:02,019-0400 INFO  (jsonrpc/3) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.122.148,34282, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:48)
2021-06-29 08:53:02,019-0400 INFO  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Release VM resources (vm:5313)
2021-06-29 08:53:02,019-0400 INFO  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Stopping connection (guestagent:438)
2021-06-29 08:53:02,019-0400 INFO  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') _destroyVmGraceful attempt #0 (vm:5349)
2021-06-29 08:53:02,020-0400 WARN  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') VM '2dad9038-3e3a-4b5e-8d20-b0da37d9ef79' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm:5370)
2021-06-29 08:53:02,021-0400 INFO  (jsonrpc/3) [vdsm.api] START teardownImage(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='26dc7573-d7ad-422b-952f-f1ce3517277a', volUUID=None) from=::ffff:192.168.122.148,34282, task_id=9da17f79-7dbf-40f1-a097-c2817e820e10 (api:48)
2021-06-29 08:53:02,021-0400 INFO  (jsonrpc/3) [storage.StorageDomain] Removing image run directory '/run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/26dc7573-d7ad-422b-952f-f1ce3517277a' (blockSD:1387)
2021-06-29 08:53:02,021-0400 INFO  (jsonrpc/3) [storage.fileUtils] Removing directory: /run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/26dc7573-d7ad-422b-952f-f1ce3517277a (fileUtils:182)
2021-06-29 08:53:02,087-0400 INFO  (jsonrpc/3) [storage.LVM] Deactivating lvs: vg=cdac2a0c-b110-456d-a988-7d588626c871 lvs=['0cafdf0f-8921-4e7f-8ba9-eeec86990740'] (lvm:1766)
2021-06-29 08:53:02,163-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH teardownImage return=None from=::ffff:192.168.122.148,34282, task_id=9da17f79-7dbf-40f1-a097-c2817e820e10 (api:54)
2021-06-29 08:53:02,164-0400 INFO  (jsonrpc/3) [vdsm.api] START teardownImage(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='472e1fa3-8ed6-4a3b-abb0-fe166bf0349f', volUUID=None) from=::ffff:192.168.122.148,34282, task_id=0c80a593-e9b0-4189-9b38-d09a1f84836d (api:48)
2021-06-29 08:53:02,164-0400 INFO  (jsonrpc/3) [storage.StorageDomain] Removing image run directory '/run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/472e1fa3-8ed6-4a3b-abb0-fe166bf0349f' (blockSD:1387)
2021-06-29 08:53:02,164-0400 INFO  (jsonrpc/3) [storage.fileUtils] Removing directory: /run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/472e1fa3-8ed6-4a3b-abb0-fe166bf0349f (fileUtils:182)
2021-06-29 08:53:02,232-0400 INFO  (jsonrpc/3) [storage.LVM] Deactivating lvs: vg=cdac2a0c-b110-456d-a988-7d588626c871 lvs=['16a8a0ff-c612-4855-b15a-668efaa74abc'] (lvm:1766)
2021-06-29 08:53:02,318-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH teardownImage return=None from=::ffff:192.168.122.148,34282, task_id=0c80a593-e9b0-4189-9b38-d09a1f84836d (api:54)
2021-06-29 08:53:02,318-0400 INFO  (jsonrpc/3) [vdsm.api] START teardownImage(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID=None) from=::ffff:192.168.122.148,34282, task_id=1a297208-9d5f-425a-b8e6-4b2741278c52 (api:48)
2021-06-29 08:53:02,319-0400 INFO  (jsonrpc/3) [storage.StorageDomain] Removing image run directory '/run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/bff096ea-3bf3-4d6e-adcc-704d95e9988e' (blockSD:1387)
2021-06-29 08:53:02,319-0400 INFO  (jsonrpc/3) [storage.fileUtils] Removing directory: /run/vdsm/storage/cdac2a0c-b110-456d-a988-7d588626c871/bff096ea-3bf3-4d6e-adcc-704d95e9988e (fileUtils:182)
2021-06-29 08:53:02,383-0400 INFO  (jsonrpc/3) [storage.LVM] Deactivating lvs: vg=cdac2a0c-b110-456d-a988-7d588626c871 lvs=['c16033d3-9608-4103-a3d4-ab4490f76b1f'] (lvm:1766)
2021-06-29 08:53:02,457-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH teardownImage return=None from=::ffff:192.168.122.148,34282, task_id=1a297208-9d5f-425a-b8e6-4b2741278c52 (api:54)
2021-06-29 08:53:02,457-0400 INFO  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Stopping connection (guestagent:438)
2021-06-29 08:53:02,461-0400 WARN  (jsonrpc/3) [root] Attempting to remove a non existing network: ovirtmgmt/2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (libvirtnetwork:200)
2021-06-29 08:53:02,461-0400 WARN  (jsonrpc/3) [root] Attempting to remove a non existing net user: ovirtmgmt/2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (libvirtnetwork:207)
2021-06-29 08:53:02,462-0400 INFO  (jsonrpc/3) [vdsm.api] START inappropriateDevices(thiefId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') from=::ffff:192.168.122.148,34282, task_id=06899319-b960-4fd2-aae0-1f52ec484e60 (api:48)
2021-06-29 08:53:02,465-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH inappropriateDevices return=None from=::ffff:192.168.122.148,34282, task_id=06899319-b960-4fd2-aae0-1f52ec484e60 (api:54)
2021-06-29 08:53:02,568-0400 INFO  (jsonrpc/3) [root] /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd: rc=0 err=b'' (hooks:122)
2021-06-29 08:53:02,651-0400 INFO  (jsonrpc/3) [root] /usr/libexec/vdsm/hooks/after_vm_destroy/delete_vhostuserclient_hook: rc=0 err=b'' (hooks:122)
2021-06-29 08:53:02,652-0400 INFO  (jsonrpc/3) [api.virt] FINISH destroy return={'status': {'code': 0, 'message': 'Machine destroyed'}} from=::ffff:192.168.122.148,34282, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:54)
2021-06-29 08:53:02,916-0400 INFO  (mailbox-hsm/4) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f') from=internal, task_id=54fc1c43-657c-4f85-9d58-1a632efbd921 (api:48)
2021-06-29 08:53:02,979-0400 INFO  (mailbox-hsm/4) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('c16033d3-9608-4103-a3d4-ab4490f76b1f',)) (lvm:1782)
2021-06-29 08:53:03,043-0400 INFO  (mailbox-hsm/4) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=54fc1c43-657c-4f85-9d58-1a632efbd921 (api:54)
2021-06-29 08:53:03,044-0400 INFO  (mailbox-hsm/4) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=73e3296e-b53f-4e5c-bf25-b17dc31daa98 (api:48)
2021-06-29 08:53:03,101-0400 INFO  (mailbox-hsm/4) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '6039797760', 'truesize': '6039797760'} from=internal, task_id=73e3296e-b53f-4e5c-bf25-b17dc31daa98 (api:54)
2021-06-29 08:53:03,102-0400 INFO  (mailbox-hsm/4) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Extend volume c16033d3-9608-4103-a3d4-ab4490f76b1f completed <Clock(extend-volume=4.09, total=5.45, refresh-destination-volume=1.17, refresh-volume=0.13)> (vm:1574)
2021-06-29 08:53:03,102-0400 INFO  (mailbox-hsm/4) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting block threshold to 5997854720 bytes for drive 'sdc' (apparentsize 6039797760) (drivemonitor:116)
2021-06-29 08:53:03,103-0400 ERROR (mailbox-hsm/4) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Failed to set block threshold on 'sdc' (): Domain not found: no domain with matching uuid '2dad9038-3e3a-4b5e-8d20-b0da37d9ef79' (cirros) (drivemonitor:128)
2021-06-29 08:53:03,103-0400 ERROR (mailbox-hsm/4) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') cannot cont while Down (vm:1709)
2021-06-29 08:53:03,103-0400 INFO  (mailbox-hsm/4) [storage.ThreadPool.WorkerThread] FINISH task c47460da-e9a9-42a7-93b5-b2b9f4b23040 (threadPool:148)


Destination
----------------

2021-06-29 08:53:02,800-0400 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities took more than 1.00 seconds to succeed: 1.06 (__init__:316)
2021-06-29 08:53:02,805-0400 INFO  (jsonrpc/5) [api.virt] START refresh_disk(drive={'device': 'disk', 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f'}) from=::ffff:192.168.122.240,33876, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:48)
2021-06-29 08:53:02,805-0400 INFO  (jsonrpc/5) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Refreshing volume for drive sdd (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) (vm:1408)
2021-06-29 08:53:02,805-0400 INFO  (jsonrpc/5) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f') from=::ffff:192.168.122.240,33876, task_id=52ea0e73-6cac-405c-b735-e39143a85bd3 (api:48)
2021-06-29 08:53:02,806-0400 INFO  (jsonrpc/5) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('c16033d3-9608-4103-a3d4-ab4490f76b1f',)) (lvm:1782)
2021-06-29 08:53:02,896-0400 INFO  (jsonrpc/5) [vdsm.api] FINISH refreshVolume return=None from=::ffff:192.168.122.240,33876, task_id=52ea0e73-6cac-405c-b735-e39143a85bd3 (api:54)
2021-06-29 08:53:02,897-0400 INFO  (jsonrpc/5) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=::ffff:192.168.122.240,33876, task_id=51197411-de51-462b-86ad-e559d16545a0 (api:48)
2021-06-29 08:53:02,902-0400 INFO  (jsonrpc/5) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '6039797760', 'truesize': '6039797760'} from=::ffff:192.168.122.240,33876, task_id=51197411-de51-462b-86ad-e559d16545a0 (api:54)
2021-06-29 08:53:02,903-0400 INFO  (jsonrpc/5) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting block threshold to 5997854720 bytes for drive 'sdd' (apparentsize 6039797760) (drivemonitor:116)
2021-06-29 08:53:02,906-0400 INFO  (jsonrpc/5) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=::ffff:192.168.122.240,33876, task_id=88efb701-6463-417c-814f-ba285efb29d7 (api:48)
2021-06-29 08:53:02,907-0400 INFO  (jsonrpc/5) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '6039797760', 'truesize': '6039797760'} from=::ffff:192.168.122.240,33876, task_id=88efb701-6463-417c-814f-ba285efb29d7 (api:54)
2021-06-29 08:53:02,907-0400 INFO  (jsonrpc/5) [api.virt] FINISH refresh_disk return={'result': {'apparentsize': '6039797760', 'truesize': '6039797760'}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:192.168.122.240,33876, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:54)




Disk refresh on destination host NOT supported
=============================

Source
---------

2021-06-29 08:46:02,897-0400 WARN  (Thread-1) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') TEST EXTENSION (migration:555)
2021-06-29 08:46:02,898-0400 INFO  (Thread-1) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=edb77ef8-2b2a-4407-a5c0-e4bd614d5334 (api:48)
2021-06-29 08:46:02,900-0400 INFO  (Thread-1) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '5771362304', 'truesize': '5771362304'} from=internal, task_id=edb77ef8-2b2a-4407-a5c0-e4bd614d5334 (api:54)
2021-06-29 08:46:02,901-0400 INFO  (Thread-1) [vdsm.api] START sendExtendMsg(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', volDict={'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'internal': False, 'name': 'sdc', 'newSize': 5781848064, 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f', 'clock': <Clock(extend-volume=0.00*, total=0.00*)>, 'callback': None}, newSize=5781848064, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f5afab441d0>>) from=internal, task_id=825db9bf-7289-493b-b3d1-e97b4c804d0f (api:48)
2021-06-29 08:46:02,901-0400 INFO  (Thread-1) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=825db9bf-7289-493b-b3d1-e97b4c804d0f (api:54)
2021-06-29 08:46:02,919-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-06-29 08:46:02,928-0400 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::ffff:192.168.122.148,34282 (api:48)
2021-06-29 08:46:02,930-0400 INFO  (jsonrpc/7) [api.host] FINISH getAllVmStats return={'status': {'code': 0, 'message': 'Done'}, 'statsList': (suppressed)} from=::ffff:192.168.122.148,34282 (api:54)
2021-06-29 08:46:03,954-0400 INFO  (mailbox-spm/0) [storage.ThreadPool.WorkerThread] START task b4bde1ea-80c7-45e1-8d15-0da41d33b87e (cmd=<function runTask at 0x7f5b54cef488>, args=(functools.partial(<bound method SPM_Extend_Message.processRequest of <class 'vdsm.storage.mailbox.SPM_Extend_Message'>>, <vdsm.storage.sp.StoragePool object at 0x7f5afec8cc18>), 256, b'1xtndq\xc8&\x86X}\x88\xa9mE\x10\xb1\x0c*\xac\xcd\x1fk\xf7\x90D\xab\xd4\xa3\x03A\x08\x96\xd33`\xc1000000000000158a00000000000')) (threadPool:146)
2021-06-29 08:46:03,954-0400 INFO  (mailbox-spm/0) [storage.SPM.Messages.Extend] processRequest: extending volume c16033d3-9608-4103-a3d4-ab4490f76b1f in domain cdac2a0c-b110-456d-a988-7d588626c871 (pool f034d5d0-ce1e-11eb-a870-525400ddf532) to size 5514 (mailbox:168)
2021-06-29 08:46:03,955-0400 INFO  (mailbox-spm/0) [storage.LVM] Extending LV cdac2a0c-b110-456d-a988-7d588626c871/c16033d3-9608-4103-a3d4-ab4490f76b1f to 5514 megabytes (lvm:1668)
2021-06-29 08:46:04,093-0400 INFO  (mailbox-spm/0) [storage.ThreadPool.WorkerThread] FINISH task b4bde1ea-80c7-45e1-8d15-0da41d33b87e (threadPool:148)
2021-06-29 08:46:04,950-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-06-29 08:46:04,954-0400 INFO  (mailbox-hsm/1) [storage.ThreadPool.WorkerThread] START task 67143acc-5685-4d6e-98a7-ff2e0dcef81f (cmd=<function runTask at 0x7f5b54cef488>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f5afab441d0>>, {'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'internal': False, 'name': 'sdc', 'newSize': 5781848064, 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f', 'clock': <Clock(extend-volume=2.06, total=2.06*, refresh-destination-volume=0.00*)>, 'callback': None})) (threadPool:146)
2021-06-29 08:46:04,955-0400 INFO  (mailbox-hsm/1) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Volume sdc (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) was extended during migration, refreshing it on destination host. (vm:1441)
2021-06-29 08:46:06,059-0400 WARN  (mailbox-hsm/1) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1586)
2021-06-29 08:46:06,060-0400 INFO  (mailbox-hsm/1) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Disabling drive monitoring (drivemonitor:66)
2021-06-29 08:46:06,060-0400 INFO  (mailbox-hsm/1) [storage.ThreadPool.WorkerThread] FINISH task 67143acc-5685-4d6e-98a7-ff2e0dcef81f (threadPool:148)
2021-06-29 08:46:06,180-0400 INFO  (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU stopped: onSuspend (vm:6018)
2021-06-29 08:46:06,182-0400 INFO  (qgapoller/2) [vds] Failed to get guest info for vm=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79, error: Requested operation is not valid: domain is not running (qemuguestagent:711)
2021-06-29 08:46:06,975-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-06-29 08:46:07,459-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') migration took 6 seconds to complete (migration:551)
2021-06-29 08:46:07,459-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Changed state to Down: Migration succeeded (code=4) (vm:1895)
2021-06-29 08:46:07,464-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Stopping connection (guestagent:438)
2021-06-29 08:46:07,464-0400 INFO  (migsrc/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Enabling drive monitoring (drivemonitor:59)


Destination
----------------

2021-06-29 08:46:06,053-0400 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getCapabilities took more than 1.00 seconds to succeed: 1.09 (__init__:316)
2021-06-29 08:46:06,218-0400 INFO  (libvirt/events) [vds] Channel state for vm_id=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 changed from=disconnected(2) to=disconnected(2) (qemuguestagent:289)
2021-06-29 08:46:06,234-0400 INFO  (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU running: onResume (vm:6018)
2021-06-29 08:46:06,251-0400 INFO  (libvirt/events) [vds] Channel state for vm_id=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 changed from=disconnected(2) to=disconnected(2) (qemuguestagent:289)
2021-06-29 08:46:06,261-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Starting connection (guestagent:252)
2021-06-29 08:46:06,264-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU running: domain initialization (vm:6018)
2021-06-29 08:46:06,272-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') End of migration (vm:4074)
2021-06-29 08:46:06,424-0400 INFO  (periodic/2) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting block threshold to 92274688 bytes for drive 'sda' (apparentsize 134217728) (drivemonitor:116)
2021-06-29 08:46:06,428-0400 INFO  (periodic/2) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting block threshold to 4253024256 bytes for drive 'sdb' (apparentsize 4294967296) (drivemonitor:116)
2021-06-29 08:46:06,431-0400 INFO  (periodic/2) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting block threshold to 5729419264 bytes for drive 'sdd' (apparentsize 5771362304) (drivemonitor:116)

Comment 34 Nir Soffer 2021-07-07 16:54:03 UTC
Reproducing the corruption after migration may not be possible, it depends on
how qemu read and write data from the image. I think this will happen only
when qemu try to read data written on the source disk using extended volume
when the vm is running on the destination disk before the disk was extended.

We don't need to reproduce the corruption, it is enough to reproduce the case
when vm disk is extended on the source host during migration, without extending
the disk on the destination host. In this case the vm on the destination is
running with invalid disk that is likely to cause corruption.

To reproduce, run vdsm < 4.40.70.5, using the flow described in comment 31.

Expected result is multiple disk extends on source host and not disk extend
on the destination.

With vdsm 4.40.70.6, we expect that for every extend on the source host,
we refresh the disk on the destination host *before* the disk is refreshed
on the source host.

Please see the example logs in comment 21 for the expected flow.

Comment 38 Nir Soffer 2021-07-07 22:04:34 UTC
Sophie, when writing data in the vm, please add timestamps to make it
easier to understand whats going on inside the guest:

for i in $(seq -w 11); do
    echo "$(date '+%H:%M:%S.%N')  writing..."
    dd if=/dev/zero bs=1M count=128 of=data.$i oflag=direct conv=fsync
    echo "$(date '+%H:%M:%S.%N')  waiting..."
    sleep 6
done

And share the output collected during the test.

Comment 40 Dr. David Alan Gilbert 2021-07-08 08:12:40 UTC
> > 2021-07-07 12:13:48,866+0300 ERROR (migsrc/3b0f6e8e) [virt.vm]
> > (vmId='3b0f6e8e-39c3-4ff3-b0af-c2c007e52c5f') internal error: qemu
> > unexpectedly closed the monitor: 2021-07-07T09:13:47.248962Z qemu-kvm:
> > Unknown combination of migration flags: 0x0

That looks to me like the source cancelled or killed the migration.
We get that error on the destination when it reads from the stream and the stream disappears.
The /var/log/libvirt/qemu/vmname.log  on the source might help, but you probably want to look
at the libvirt/rhv logs to see why it did it.

Comment 46 Dr. David Alan Gilbert 2021-07-08 08:48:05 UTC
2021-07-08 11:28:26,216+0300 INFO  (migmon/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') new computed progress 93 < than old value 97, discarded (migration:223)
2021-07-08 11:28:26,217+0300 WARN  (migmon/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Migration stalling: remaining (114MiB) > lowmark (37MiB). (migration:840)
2021-07-08 11:28:26,217+0300 WARN  (migmon/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Aborting migration (migration:889)
2021-07-08 11:28:27,665+0300 ERROR (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') operation aborted: migration out job: canceled by client (migration:299)

Yep, rhv cancelled the migration; I think because it wasn't making much progress.  I suspect you're just stressing the guest too hard during the migration for it to complete.

Comment 47 sshmulev 2021-07-08 09:08:27 UTC
I tried it without stressing it, the migration succeeded but had some errros in the dst:

2021-07-08 12:01:08,033+0300 ERROR (jsonrpc/0) [api] FINISH refresh_disk error=local variable 'drive' referenced before assignment (api:134)
2021-07-08 12:01:13,804+0300 ERROR (jsonrpc/4) [api] FINISH refresh_disk error=Failed to refresh drive: {'vm_id': '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7fd488225da0>, 'reason': "VM '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined"} (api:131)
2021-07-08 12:01:19,893+0300 ERROR (jsonrpc/3) [api] FINISH refresh_disk error=Failed to refresh drive: {'vm_id': '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7fd488225da0>, 'reason': "VM '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined"} (api:131)



Source:
2021-07-08 12:01:08,035+0300 ERROR (mailbox-hsm/3) [storage.TaskManager.Task] (Task='58c7367b-4bc2-476d-8ca7-2185ca3795ed') Unexpected error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1561, in after_volume_extension
    self._refresh_destination_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1536, in _refresh_destination_volume
    dest_vol_size = self._refresh_migrating_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1451, in _refresh_migrating_volume
    return self._migrationSourceThread.refresh_destination_disk(vol_pdiv)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 724, in refresh_destination_disk
    raise exception.CannotRefreshDisk(reason=result["message"])
KeyError: 'message'
2021-07-08 12:01:08,036+0300 DEBUG (mailbox-hsm/3) [storage.TaskManager.Task] (Task='58c7367b-4bc2-476d-8ca7-2185ca3795ed') Task._run: 58c7367b-4bc2-476d-8ca7-2185ca3795ed ({'domainID': 'aa73b4d0-a22d-4f65-b254-594b34f0e6a8', 'imageID': '
10883a68-06ec-460b-a115-d23260482fa9', 'internal': False, 'name': 'sda', 'newSize': 14763950080, 'poolID': '626ec25b-08fb-4777-b8f8-10fed9a09831', 'volumeID': '6f350411-34ce-4b13-8214-17aa0ea829b9', 'clock': <Clock(total=2.47*, extend-vol
ume=2.05, refresh-destination-volume=0.41)>, 'callback': None},) {} failed - stopping task (task:896)


will attach these logs as well

Comment 52 Nir Soffer 2021-07-08 13:13:55 UTC
Looks like the fix for this bug breaks migration if refreshing the disk
fails on the destination. This is severe regression and should be blocker
for the relase.

Comment 53 Vojtech Juranek 2021-07-08 13:49:20 UTC
(In reply to Nir Soffer from comment #52)
> Looks like the fix for this bug breaks migration if refreshing the disk
> fails on the destination. 

I don't think so, in the logs you can see migration succeeded and continue with another disk refresh:

2021-07-08 12:01:24,356+0300 INFO  (libvirt/events) [vds] Channel state for vm_id=2eb53807-6c20-4976-9a8b-31531b5fdef1 changed from=disconnected(2) to=disconnected(2) (qemuguestagent:289)
2021-07-08 12:01:24,386+0300 INFO  (libvirt/events) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') CPU running: onResume (vm:6018)
2021-07-08 12:01:24,432+0300 INFO  (libvirt/events) [vds] Channel state for vm_id=2eb53807-6c20-4976-9a8b-31531b5fdef1 changed from=disconnected(2) to=connected(1) (qemuguestagent:289)
2021-07-08 12:01:24,447+0300 INFO  (vm/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Starting connection (guestagent:252)
2021-07-08 12:01:24,451+0300 INFO  (vm/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') CPU running: domain initialization (vm:6018)
2021-07-08 12:01:24,462+0300 INFO  (vm/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') End of migration (vm:4074)
2021-07-08 12:01:25,041+0300 INFO  (periodic/0) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Setting block threshold to 10804527104 bytes for drive 'vda' (/rhev/data-center/mnt/blockSD/cf6decd7-3277-4c27-af72-f8da515793ab/images/3b0482c4-0479-4cee-8ec3-cea4520f2ef1/2ba44e7c-18d4-4826-b7a7-c0dcfae52fb0) apparentsize 10871635968 (drivemonitor:116)
2021-07-08 12:01:25,054+0300 INFO  (periodic/0) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Setting block threshold to 14965276672 bytes for drive 'sda' (/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9) apparentsize 15032385536 (drivemonitor:116)

Also in the comment #47, I can read "the migration succeeded but had some errros in the dst"

Comment 54 Nir Soffer 2021-07-08 13:55:05 UTC
(In reply to Vojtech Juranek from comment #53)
> (In reply to Nir Soffer from comment #52)
> > Looks like the fix for this bug breaks migration if refreshing the disk
> > fails on the destination. 
> 
> I don't think so, in the logs you can see migration succeeded and continue
> with another disk refresh:

So this was a temporary refresh failure, maybe because refresh was called
too early before the destination vm was ready?

In this case this is safe, maybe should be fixed in master by doing refresh
only once the destination vm was stated.

Comment 55 Nir Soffer 2021-07-08 13:59:33 UTC
Vojta, what about this error?

> 2021-07-08 12:01:08,033+0300 ERROR (jsonrpc/0) [api] FINISH refresh_disk error=local variable 'drive' referenced before assignment (api:134)

This is not handled by https://gerrit.ovirt.org/c/vdsm/+/115608.

Comment 57 sshmulev 2021-07-08 14:29:18 UTC
Tried to update the changes (https://gerrit.ovirt.org/c/vdsm/+/115608) in my host and still get error:

Source:
2021-07-08 17:19:38,859+0300 ERROR (mailbox-hsm/3) [storage.TaskManager.Task] (Task='ee4d847c-1720-4604-9071-84007c0cbf9b') Unexpected error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1561, in after_volume_extension
    self._refresh_destination_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1536, in _refresh_destination_volume
    dest_vol_size = self._refresh_migrating_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1451, in _refresh_migrating_volume
    return self._migrationSourceThread.refresh_destination_disk(vol_pdiv)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 725, in refresh_destination_disk
    reason=result["status"]["message"])
vdsm.common.exception.CannotRefreshDisk: Failed to refresh disk on the destination: {'reason': 'Failed to refresh drive: {\'vm_id\': \'2eb53807-6c20-4976-9a8b-31531b5fdef1\', \'drive\': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7f0604788da0>, \'reason\': "VM \'2eb53807-6c20-4976-9a8b-31531b5fdef1\' was not defined yet or was undefined"}'}
2021-07-08 17:19:38,859+0300 DEBUG (mailbox-hsm/3) [storage.TaskManager.Task] (Task='ee4d847c-1720-4604-9071-84007c0cbf9b') Task._run: ee4d847c-1720-4604-9071-84007c0cbf9b ({'domainID': 'aa73b4d0-a22d-4f65-b254-594b34f0e6a8', 'imageID': '10883a68-06ec-460b-a115-d23260482fa9', 'internal': False, 'name': 'sda', 'newSize': 18253611008, 'poolID': '626ec25b-08fb-4777-b8f8-10fed9a09831', 'volumeID': '6f350411-34ce-4b13-8214-17aa0ea829b9', 'clock': <Clock(total=2.67*, extend-volume=2.06, refresh-destination-volume=0.61)>, 'callback': None},) {} failed - stopping task (task:896)


2021-07-08 17:19:38,860+0300 ERROR (mailbox-hsm/3) [storage.ThreadPool.WorkerThread] FINISH task 37e9bbd0-23ec-4c38-8021-c015ff639d35 failed (cmd=<function runTask at 0x7f5f6c9b0950>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f5f1c488a58>>, {'domainID': 'aa73b4d0-a22d-4f65-b254-594b34f0e6a8', 'imageID': '10883a68-06ec-460b-a115-d23260482fa9', 'internal': False, 'name': 'sda', 'newSize': 18253611008, 'poolID': '626ec25b-08fb-4777-b8f8-10fed9a09831', 'volumeID': '6f350411-34ce-4b13-8214-17aa0ea829b9', 'clock': <Clock(total=2.67*, extend-volume=2.06, refresh-destination-volume=0.61)>, 'callback': None})) (threadPool:151)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/threadPool.py", line 147, in _processNextTask
    cmd(args)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/mailbox.py", line 88, in runTask
    ctask.prepare(cmd, *args)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare
    raise self.error
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1561, in after_volume_extension
    self._refresh_destination_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1536, in _refresh_destination_volume
    dest_vol_size = self._refresh_migrating_volume(volInfo)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1451, in _refresh_migrating_volume
    return self._migrationSourceThread.refresh_destination_disk(vol_pdiv)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 725, in refresh_destination_disk
    reason=result["status"]["message"])
vdsm.common.exception.CannotRefreshDisk: Failed to refresh disk on the destination: {'reason': 'Failed to refresh drive: {\'vm_id\': \'2eb53807-6c20-4976-9a8b-31531b5fdef1\', \'drive\': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7f0604788da0>, \'reason\': "VM \'2eb53807-6c20-4976-9a8b-31531b5fdef1\' was not defined yet or was undefined"}'}

2021-07-08 17:19:52,589+0300 ERROR (mailbox-hsm/1) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Failed to set block threshold for drive 'sda' (/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9): Domain not found: no domain with matching uuid '2eb53807-6c20-4976-9a8b-31531b5fdef1' (New_VM1) (drivemonitor:128)
2021-07-08 17:19:52,590+0300 ERROR (mailbox-hsm/1) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') cannot cont while Down (vm:1709)



dst:
2021-07-08 17:19:38,852+0300 ERROR (jsonrpc/1) [api] FINISH refresh_disk error=Failed to refresh drive: {'vm_id': '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7f0604788da0>, 'reason': "VM '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined"} (api:131)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1415, in refresh_disk
    self._update_drive_volume_size(drive, vol_size)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1607, in _update_drive_volume_size
    self.drive_monitor.set_threshold(drive, volsize.apparentsize)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/drivemonitor.py", line 121, in set_threshold
    self._vm._dom.setBlockThreshold(drive.name, threshold)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__
    % self.vmid)
vdsm.virt.virdomain.NotConnectedError: VM '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 795, in refresh_disk
    return self.vm.refresh_disk(drive)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1425, in refresh_disk
    drive=drive)
vdsm.common.exception.DriveRefreshError: Failed to refresh drive: {'vm_id': '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7f0604788da0>, 'reason': "VM '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined"}
2021-07-08 17:19:38,853+0300 INFO  (jsonrpc/1) [api.virt] FINISH refresh_disk return={'status': {'code': 1640, 'message': 'Failed to refresh drive: {\'vm_id\': \'2eb53807-6c20-4976-9a8b-31531b5fdef1\', \'drive\': <Drive name=sda, type=block, path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9 threshold=unset at 0x7f0604788da0>, \'reason\': "VM \'2eb53807-6c20-4976-9a8b-31531b5fdef1\' was not defined yet or was undefined"}'}} from=::ffff:10.46.12.145,37660, vmId=2eb53807-6c20-4976-9a8b-31531b5fdef1 (api:54)
2021-07-08 17:19:38,854+0300 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.refresh_disk failed (error 1640) in 0.25 seconds (__init__:312)
2021-07-08 17:19:39,312+0300 DEBUG (periodic/0) [storage.TaskManager.Task] (Task='939e294f-dffb-4d3e-ab07-6efdb70264d2') moving from state finished -> state preparing (task:624)
2021-07-08 17:19:39,312+0300 INFO  (periodic/0) [vdsm.api] START repoStats(domains=()) from=internal, task_id=939e294f-dffb-4d3e-ab07-6efdb70264d2 (api:48)
2021-07-08 17:19:39,312+0300 INFO  (periodic/0) [vdsm.api] FINISH repoStats return={'ad69c1c1-dc48-44f9-912c-2498b80a49b9': {'code': 0, 'lastCheck': '1.9', 'delay': '0.000208611', 'valid': True, 'version': 5, 'acquired': True, 'actual': True}, '594e0a4b-7934-4de9-a532-46c2013cd187': {'code': 0, 'lastCheck': '2.6', 'delay': '0.000555631', 'valid': True, 'version': 0, 'acquired': True, 'actual': True}, 'aa73b4d0-a22d-4f65-b254-594b34f0e6a8': {'code': 0, 'lastCheck': '1.7', 'delay': '0.000730953', 'valid': True, 'version': 5, 'acquired': True, 'actual': True}, '4447651e-a623-4c17-a242-88a5ed8e4991': {'code': 0, 'lastCheck': '2.6', 'delay': '0.000520551', 'valid': True, 'version': 5, 'acquired': True, 'actPattern not found  (press RETURN)

Comment 58 Milan Zamazal 2021-07-08 15:06:57 UTC
As for "VM ... was not defined yet or was undefined" error: If I'm looking correctly, we assign self._dom to a libvirt domain only after the libvirt migration is completed. This means the attempt to perform the refresh on it will fail with the given error. There can be several ways to fix it:

- To retrieve the domain directly from libvirt in an exception handler: self._connection.lookupByUUIDString(self.id)
- To poll for the domain while waiting to incoming migration completion, assign self._dom to virdomain.Defined once it is available and add a method for disk refresh to virdomain.Defined.
- Instead of polling, handling a corresponding libvirt event if it is available and do the same.

In either case, the domain may not exist in libvirt yet and the source must retry the action in a while in such a case.

Comment 59 Nir Soffer 2021-07-08 16:22:33 UTC
(In reply to sshmulev from comment #57)

Les not use private message without a reason, this creates mess for the community
when trying to understand bugs.

> Tried to update the changes (https://gerrit.ovirt.org/c/vdsm/+/115608) in my
> host and still get error:
> 
> Source:
> 2021-07-08 17:19:38,859+0300 ERROR (mailbox-hsm/3)
> [storage.TaskManager.Task] (Task='ee4d847c-1720-4604-9071-84007c0cbf9b')
> Unexpected error (task:877)
...
> vdsm.common.exception.CannotRefreshDisk: Failed to refresh disk on the
> destination: {'reason': 'Failed to refresh drive: {\'vm_id\':
> \'2eb53807-6c20-4976-9a8b-31531b5fdef1\', \'drive\': <Drive name=sda,
> type=block,
> path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/
> images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-
> 17aa0ea829b9 threshold=unset at 0x7f0604788da0>, \'reason\': "VM
> \'2eb53807-6c20-4976-9a8b-31531b5fdef1\' was not defined yet or was
> undefined"}'}

This means refresh disk failed on the destination...


> 2021-07-08 17:19:52,589+0300 ERROR (mailbox-hsm/1) [virt.vm]
> (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Failed to set block threshold
> for drive 'sda'
> (/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/
> 10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9):
> Domain not found: no domain with matching uuid
> '2eb53807-6c20-4976-9a8b-31531b5fdef1' (New_VM1) (drivemonitor:128)
> 2021-07-08 17:19:52,590+0300 ERROR (mailbox-hsm/1) [virt.vm]
> (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') cannot cont while Down
> (vm:1709)

This means extend finished after migration completed, this is expected
condition and should not be logged as error. This is not new issue and
not related to this change. Please file new bug to silence this bogus
error.

> dst:
> 2021-07-08 17:19:38,852+0300 ERROR (jsonrpc/1) [api] FINISH refresh_disk
> error=Failed to refresh drive: {'vm_id':
> '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda,
> type=block,
> path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/
> images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-
> 17aa0ea829b9 threshold=unset at 0x7f0604788da0>, 'reason': "VM
> '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was
> undefined"} (api:131)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1415, in
> refresh_disk
>     self._update_drive_volume_size(drive, vol_size)
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1607, in
> _update_drive_volume_size
>     self.drive_monitor.set_threshold(drive, volsize.apparentsize)
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/drivemonitor.py", line
> 121, in set_threshold
>     self._vm._dom.setBlockThreshold(drive.name, threshold)
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50,
> in __getattr__
>     % self.vmid)
> vdsm.virt.virdomain.NotConnectedError: VM
> '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 795, in
> refresh_disk
>     return self.vm.refresh_disk(drive)
>   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1425, in
> refresh_disk
>     drive=drive)
> vdsm.common.exception.DriveRefreshError: Failed to refresh drive: {'vm_id':
> '2eb53807-6c20-4976-9a8b-31531b5fdef1', 'drive': <Drive name=sda,
> type=block,
> path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/
> images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-
> 17aa0ea829b9 threshold=unset at 0x7f0604788da0>, 'reason': "VM
> '2eb53807-6c20-4976-9a8b-31531b5fdef1' was not defined yet or was undefined"}
> 2021-07-08 17:19:38,853+0300 INFO  (jsonrpc/1) [api.virt] FINISH
> refresh_disk return={'status': {'code': 1640, 'message': 'Failed to refresh
> drive: {\'vm_id\': \'2eb53807-6c20-4976-9a8b-31531b5fdef1\', \'drive\':
> <Drive name=sda, type=block,
> path=/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/
> images/10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-
> 17aa0ea829b9 threshold=unset at 0x7f0604788da0>, \'reason\': "VM
> \'2eb53807-6c20-4976-9a8b-31531b5fdef1\' was not defined yet or was
> undefined"}'}} from=::ffff:10.46.12.145,37660,
> vmId=2eb53807-6c20-4976-9a8b-31531b5fdef1 (api:54)

Based on comment 58, we cannot set block threshold until migration completes.

I think the best way to handle it now is to not set block threshold on the 
destination host since we will not get block threshold event until migration
is completed anyway.

The missing part is to set block threshold on all disks once migration completes
on the destination.

Comment 60 Vojtech Juranek 2021-07-09 11:58:43 UTC
> I think the best way to handle it now is to not set block threshold on the 
> destination host since we will not get block threshold event until migration
> is completed anyway.
> 
> The missing part is to set block threshold on all disks once migration
> completes
> on the destination.

Ack. With this approach there are no errors in the log (at least never happened to me), except ones on the source where VM is already shut down and thus setting threshold fails (as you mentioned, can be tracked as separate bug as this issue was already there).

Example log:

SRC:
===
2021-07-09 07:30:45,503-0400 INFO  (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Volume sdc (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) was extended during migration, refreshing it on destination host. (vm:1459)
2021-07-09 07:30:46,562-0400 INFO  (mailbox-hsm/3) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f') from=internal, task_id=7b021415-3feb-4720-9e47-c343246ab318 (api:48)
2021-07-09 07:30:46,620-0400 INFO  (mailbox-hsm/3) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('c16033d3-9608-4103-a3d4-ab4490f76b1f',)) (lvm:1782)
2021-07-09 07:30:46,693-0400 INFO  (mailbox-hsm/3) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=7b021415-3feb-4720-9e47-c343246ab318 (api:54)
2021-07-09 07:30:46,693-0400 INFO  (mailbox-hsm/3) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=ad0cae56-9b69-43ef-abe4-62c88987c01f (api:48)
2021-07-09 07:30:46,695-0400 INFO  (mailbox-hsm/3) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '8053063680', 'truesize': '8053063680'} from=internal, task_id=ad0cae56-9b69-43ef-abe4-62c88987c01f (api:54)
2021-07-09 07:30:46,695-0400 INFO  (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Extend volume c16033d3-9608-4103-a3d4-ab4490f76b1f completed <Clock(extend-volume=2.07, total=3.26, refresh-destination-volume=1.06, refresh-volume=0.13)> (vm:1599)
2021-07-09 07:30:46,695-0400 INFO  (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Setting block threshold to 8011120640 bytes for drive 'sdc' () apparentsize 8053063680 (drivemonitor:116)
2021-07-09 07:30:46,955-0400 INFO  (jsonrpc/0) [vdsm.api] START getSpmStatus(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', options=None) from=::ffff:192.168.122.148,57736, task_id=db5240b8-e46b-419c-b4a3-648fc248272c (api:48)
2021-07-09 07:30:46,970-0400 INFO  (jsonrpc/0) [vdsm.api] FINISH getSpmStatus return={'spm_st': {'spmStatus': 'SPM', 'spmLver': 86, 'spmId': 4}} from=::ffff:192.168.122.148,57736, task_id=db5240b8-e46b-419c-b4a3-648fc248272c (api:54)
2021-07-09 07:30:46,985-0400 INFO  (jsonrpc/1) [vdsm.api] START getStoragePoolInfo(spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', options=None) from=::ffff:192.168.122.148,57742, task_id=55a78d0c-cd59-405a-a0d0-5d1962e8ebed (api:48)
2021-07-09 07:30:46,992-0400 INFO  (jsonrpc/1) [vdsm.api] FINISH getStoragePoolInfo return={'info': {'domains': 'cdac2a0c-b110-456d-a988-7d588626c871:Active', 'isoprefix': '', 'lver': 86, 'master_uuid': 'cdac2a0c-b110-456d-a988-7d588626c871', 'master_ver': 169, 'name': 'No Description', 'pool_status': 'connected', 'spm_id': 4, 'type': 'ISCSI', 'version': '5'}, 'dominfo': {'cdac2a0c-b110-456d-a988-7d588626c871': {'status': 'Active', 'alerts': [], 'isoprefix': '', 'version': 5, 'disktotal': '31809601536', 'diskfree': '14227079168'}}} from=::ffff:192.168.122.148,57742, task_id=55a78d0c-cd59-405a-a0d0-5d1962e8ebed (api:54)
2021-07-09 07:30:47,524-0400 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/f034d5d0-ce1e-11eb-a870-525400ddf532/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=4'] (mailbox:382)
2021-07-09 07:30:47,580-0400 INFO  (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU stopped: onSuspend (vm:6039)
2021-07-09 07:30:48,931-0400 ERROR (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Failed to set block threshold for drive 'sdc' (): Requested operation is not valid: domain is not running (drivemonitor:128)
2021-07-09 07:30:48,931-0400 ERROR (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') cannot cont while Migration Source (vm:1734)


DEST:
====
disk refresh happens, but no threshold is set (and no error):

2021-07-09 07:30:46,471-0400 INFO  (jsonrpc/3) [api.virt] START refresh_disk(drive={'device': 'disk', 'poolID': 'f034d5d0-ce1e-11eb-a870-525400ddf532', 'domainID': 'cdac2a0c-b110-456d-a988-7d588626c871', 'imageID': 'bff096ea-3bf3-4d6e-adcc-704d95e9988e', 'volumeID': 'c16033d3-9608-4103-a3d4-ab4490f76b1f'}, set_threshold=False) from=::ffff:192.168.122.240,40286, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:48)
2021-07-09 07:30:46,471-0400 INFO  (jsonrpc/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Refreshing volume for drive sdd (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) (vm:1408)
2021-07-09 07:30:46,471-0400 INFO  (jsonrpc/3) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f') from=::ffff:192.168.122.240,40286, task_id=100beb3e-80e7-4cd7-91db-2eda0c62083b (api:48)
2021-07-09 07:30:46,471-0400 INFO  (jsonrpc/3) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('c16033d3-9608-4103-a3d4-ab4490f76b1f',)) (lvm:1782)
2021-07-09 07:30:46,557-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH refreshVolume return=None from=::ffff:192.168.122.240,40286, task_id=100beb3e-80e7-4cd7-91db-2eda0c62083b (api:54)
2021-07-09 07:30:46,557-0400 INFO  (jsonrpc/3) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=::ffff:192.168.122.240,40286, task_id=a24e4f13-d490-4329-afd7-534ff39ccde3 (api:48)
2021-07-09 07:30:46,558-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '8053063680', 'truesize': '8053063680'} from=::ffff:192.168.122.240,40286, task_id=a24e4f13-d490-4329-afd7-534ff39ccde3 (api:54)
2021-07-09 07:30:46,558-0400 INFO  (jsonrpc/3) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=::ffff:192.168.122.240,40286, task_id=2aef42ee-ef75-4c06-aaab-1e10f59613b0 (api:48)
2021-07-09 07:30:46,558-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '8053063680', 'truesize': '8053063680'} from=::ffff:192.168.122.240,40286, task_id=2aef42ee-ef75-4c06-aaab-1e10f59613b0 (api:54)
2021-07-09 07:30:46,558-0400 INFO  (jsonrpc/3) [api.virt] FINISH refresh_disk return={'result': {'apparentsize': '8053063680', 'truesize': '8053063680'}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:192.168.122.240,40286, vmId=2dad9038-3e3a-4b5e-8d20-b0da37d9ef79 (api:54)

just before completing migration, all disk are refreshed and threshold is set:

2021-07-09 07:30:47,654-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Starting connection (guestagent:252)
2021-07-09 07:30:47,658-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') CPU running: domain initialization (vm:6039)
2021-07-09 07:30:47,665-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Refreshing volume for drive sda (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: 0cafdf0f-8921-4e7f-8ba9-eeec86990740) (vm:1408)
2021-07-09 07:30:47,665-0400 INFO  (vm/2dad9038) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='26dc7573-d7ad-422b-952f-f1ce3517277a', volUUID='0cafdf0f-8921-4e7f-8ba9-eeec86990740') from=internal, task_id=dbfdbaab-4f88-4ca9-950d-b73539067a20 (api:48)
2021-07-09 07:30:47,666-0400 INFO  (vm/2dad9038) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('0cafdf0f-8921-4e7f-8ba9-eeec86990740',)) (lvm:1782)
2021-07-09 07:30:47,760-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=dbfdbaab-4f88-4ca9-950d-b73539067a20 (api:54)
2021-07-09 07:30:47,761-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='26dc7573-d7ad-422b-952f-f1ce3517277a', volUUID='0cafdf0f-8921-4e7f-8ba9-eeec86990740', options=None) from=internal, task_id=bb679748-8748-4f0d-912b-c88748eb2d38 (api:48)
2021-07-09 07:30:47,766-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '134217728', 'truesize': '134217728'} from=internal, task_id=bb679748-8748-4f0d-912b-c88748eb2d38 (api:54)
2021-07-09 07:30:47,766-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Setting block threshold to 92274688 bytes for drive 'sda' (/rhev/data-center/mnt/blockSD/cdac2a0c-b110-456d-a988-7d588626c871/images/26dc7573-d7ad-422b-952f-f1ce3517277a/0cafdf0f-8921-4e7f-8ba9-eeec86990740) apparentsize 134217728 (drivemonitor:116)
2021-07-09 07:30:47,771-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='26dc7573-d7ad-422b-952f-f1ce3517277a', volUUID='0cafdf0f-8921-4e7f-8ba9-eeec86990740', options=None) from=internal, task_id=8a73b730-1b19-4676-af94-5cc6d19c551b (api:48)
2021-07-09 07:30:47,773-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '134217728', 'truesize': '134217728'} from=internal, task_id=8a73b730-1b19-4676-af94-5cc6d19c551b (api:54)
2021-07-09 07:30:47,773-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Refreshing volume for drive sdb (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: 16a8a0ff-c612-4855-b15a-668efaa74abc) (vm:1408)
2021-07-09 07:30:47,773-0400 INFO  (vm/2dad9038) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='472e1fa3-8ed6-4a3b-abb0-fe166bf0349f', volUUID='16a8a0ff-c612-4855-b15a-668efaa74abc') from=internal, task_id=d37341a4-15fb-4cf7-9d87-1c7c338ad4a6 (api:48)
2021-07-09 07:30:47,774-0400 INFO  (vm/2dad9038) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('16a8a0ff-c612-4855-b15a-668efaa74abc',)) (lvm:1782)
2021-07-09 07:30:47,883-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=d37341a4-15fb-4cf7-9d87-1c7c338ad4a6 (api:54)
2021-07-09 07:30:47,884-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='472e1fa3-8ed6-4a3b-abb0-fe166bf0349f', volUUID='16a8a0ff-c612-4855-b15a-668efaa74abc', options=None) from=internal, task_id=794c3def-de3b-4b2c-a747-dadd8d33520d (api:48)
2021-07-09 07:30:47,887-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '4294967296', 'truesize': '4294967296'} from=internal, task_id=794c3def-de3b-4b2c-a747-dadd8d33520d (api:54)
2021-07-09 07:30:47,887-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Setting block threshold to 4253024256 bytes for drive 'sdb' (/rhev/data-center/mnt/blockSD/cdac2a0c-b110-456d-a988-7d588626c871/images/472e1fa3-8ed6-4a3b-abb0-fe166bf0349f/16a8a0ff-c612-4855-b15a-668efaa74abc) apparentsize 4294967296 (drivemonitor:116)
2021-07-09 07:30:47,894-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='472e1fa3-8ed6-4a3b-abb0-fe166bf0349f', volUUID='16a8a0ff-c612-4855-b15a-668efaa74abc', options=None) from=internal, task_id=4d6848c2-dffd-46c0-9e12-42f7eef85f30 (api:48)
2021-07-09 07:30:47,895-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '4294967296', 'truesize': '4294967296'} from=internal, task_id=4d6848c2-dffd-46c0-9e12-42f7eef85f30 (api:54)
2021-07-09 07:30:47,895-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Refreshing volume for drive sdd (domainID: cdac2a0c-b110-456d-a988-7d588626c871, volumeID: c16033d3-9608-4103-a3d4-ab4490f76b1f) (vm:1408)
2021-07-09 07:30:47,896-0400 INFO  (vm/2dad9038) [vdsm.api] START refreshVolume(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f') from=internal, task_id=33a50061-ffb6-4cc9-bf73-9960ee146104 (api:48)
2021-07-09 07:30:47,978-0400 INFO  (vm/2dad9038) [storage.LVM] Refreshing LVs (vg=cdac2a0c-b110-456d-a988-7d588626c871, lvs=('c16033d3-9608-4103-a3d4-ab4490f76b1f',)) (lvm:1782)
2021-07-09 07:30:48,052-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=33a50061-ffb6-4cc9-bf73-9960ee146104 (api:54)
2021-07-09 07:30:48,052-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=21fb7cc1-1ac1-4273-81d8-2ff75915f21f (api:48)
2021-07-09 07:30:48,054-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '8053063680', 'truesize': '8053063680'} from=internal, task_id=21fb7cc1-1ac1-4273-81d8-2ff75915f21f (api:54)
2021-07-09 07:30:48,054-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Setting block threshold to 8011120640 bytes for drive 'sdd' (/rhev/data-center/mnt/blockSD/cdac2a0c-b110-456d-a988-7d588626c871/images/bff096ea-3bf3-4d6e-adcc-704d95e9988e/c16033d3-9608-4103-a3d4-ab4490f76b1f) apparentsize 8053063680 (drivemonitor:116)
2021-07-09 07:30:48,057-0400 INFO  (vm/2dad9038) [vdsm.api] START getVolumeSize(sdUUID='cdac2a0c-b110-456d-a988-7d588626c871', spUUID='f034d5d0-ce1e-11eb-a870-525400ddf532', imgUUID='bff096ea-3bf3-4d6e-adcc-704d95e9988e', volUUID='c16033d3-9608-4103-a3d4-ab4490f76b1f', options=None) from=internal, task_id=5650f968-9783-4cb3-9251-57cd5bd94e59 (api:48)
2021-07-09 07:30:48,059-0400 INFO  (vm/2dad9038) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '8053063680', 'truesize': '8053063680'} from=internal, task_id=5650f968-9783-4cb3-9251-57cd5bd94e59 (api:54)
2021-07-09 07:30:48,059-0400 INFO  (vm/2dad9038) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') End of migration (vm:4102)

Comment 61 sshmulev 2021-07-11 08:52:12 UTC
(In reply to Nir Soffer from comment #59)
> 2021-07-08 17:19:52,589+0300 ERROR (mailbox-hsm/1) [virt.vm]
> (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Failed to set block threshold
> for drive 'sda'
> (/rhev/data-center/mnt/blockSD/aa73b4d0-a22d-4f65-b254-594b34f0e6a8/images/
> 10883a68-06ec-460b-a115-d23260482fa9/6f350411-34ce-4b13-8214-17aa0ea829b9):
> Domain not found: no domain with matching uuid
> '2eb53807-6c20-4976-9a8b-31531b5fdef1' (New_VM1) (drivemonitor:128)
> 2021-07-08 17:19:52,590+0300 ERROR (mailbox-hsm/1) [virt.vm]
> (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') cannot cont while Down
> (vm:1709)

Thanks Nir, I filed a new issue regarding this (https://bugzilla.redhat.com/show_bug.cgi?id=1981079).

Comment 62 Avihai 2021-07-11 10:19:35 UTC
(In reply to Vojtech Juranek from comment #21)
> For the reference, here are the log from my test run for case when
> destination host supports disk refresh and when it doesn't support disk
> refresh (in such case drive monitor is disabled temporarily during the
> migration to prevent other extend attempts):


Sophie, can you please check if this new behavior is seen in our reproduction efforts?
See Vojtech comment #21 for full details.

Comment 63 sshmulev 2021-07-11 13:29:39 UTC
The bug has been reproduced on an older version with the following steps:

version:
vdsm-4.40.70.4-1.el8ev.x86_64
QEMU emulator version 5.2.0 (qemu-kvm-5.2.0-16.module+el8.4.0+11358+3b8f35f7.1)
ovirt-engine-4.4.7.4-0.9.el8ev

How reproducible:
Need to make multiple extensions during the migration to hit the issue. 

Steps to Reproduce:
1. Create a VM with a thin disk - better to make the disk big enough (for the many tries until we hit the extension during migration).
2. Better to modify in all the hosts the VM is migrated to/from, and the SPM host:
    vi /etc/vdsm/vdsm.conf.d/volume-utilization.conf I put these values and then restart vdsm.
    [irs]
    volume_utilization_percent = 50
    volume_utilization_chunk_mb = 128
    
    
3. Make the VM migration slow, so we can do multiple extends during 
   the migration.
   This can be done using stress tool, for example:
   dnf install stress
   stress --VM 1 --VM-bytes 128M
    
   This step is optional if the migration finishes too fast (I could reproduce it without it).

4.  Write data in the VM to trigger extension.
	for i in $(seq -w 11); do
	    echo "writing..."
	    dd if=/dev/zero bs=1M count=64 of=data.$i oflag=direct conv=fsync
	    echo "Waiting..."
	    sleep 6
	done
5. Start the script in the VM and right after start the migration

Actual results:
The VM paused and can't start after that.

from  /var/log/libvirt/qemu/:
qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed
2021-07-08T05:35:34.828163Z qemu-kvm: terminating on signal 15 from pid 25442 (/usr/sbin/libvirtd)
2021-07-08 05:35:35.029+0000: shutting down, reason=destroyed


The same steps to reproduce has been done on the latest version:
Versions:
vdsm-4.40.70.6-1.el8ev.x86_64
QEMU emulator version 5.2.0 (qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2)
ovirt-engine-4.4.7.6-0.11.el8ev


This corruption does not reproduce (tried 5 times) and the migration succeeds eventually.

Actual results:

2021-07-08 12:01:08,159+0300 INFO  (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') starting migration

during migration the extension starts:
2021-07-08 12:01:11,574+0300 INFO  (periodic/3) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Requesting extension for volume 6f350411-34ce-4b13-8214-17aa0ea829b9 on domain aa73b4d0-a22d-4f65-b254-594b34f0e6a8 (apparent: 14629732352, capacity: 32212254720, allocated: 14760804352, physical: 14763950080 threshold_state: exceeded) (vm:1348)
2021-07-08 12:01:11,574+0300 DEBUG (periodic/3) [storage.TaskManager.Task] (Task='431e72e6-3061-4e99-9dbc-5582960b43d0') moving from state finished -> state preparing (task:624)
2021-07-08 12:01:11,575+0300 INFO  (periodic/3) [vdsm.api] START sendExtendMsg(spUUID='626ec25b-08fb-4777-b8f8-10fed9a09831', volDict={'domainID': 'aa73b4d0-a22d-4f65-b254-594b34f0e6a8', 'imageID': '10883a68-06ec-460b-a115-d23260482fa9', 'internal': False, 'name': 'sda', 'newSize': 14898167808, 'poolID': '626ec25b-08fb-4777-b8f8-10fed9a09831', 'volumeID': '6f350411-34ce-4b13-8214-17aa0ea829b9', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=14898167808, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f05003b59e8>>) from=internal, task_id=431e72e6-3061-4e99-9dbc-5582960b43d0 (api:48)
2021-07-08 12:01:11,575+0300 DEBUG (periodic/3) [storage.SPM.Messages.Extend] new extend msg created: domain: aa73b4d0-a22d-4f65-b254-594b34f0e6a8, volume: 6f350411-34ce-4b13-8214-17aa0ea829b9 (mailbox:130)
2021-07-08 12:01:11,575+0300 INFO  (periodic/3) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=431e72e6-3061-4e99-9dbc-5582960b43d0 (api:54)
2021-07-08 12:01:11,575+0300 DEBUG (periodic/3) [storage.TaskManager.Task] (Task='431e72e6-3061-4e99-9dbc-5582960b43d0') finished: None (task:1202)
2021-07-08 12:01:11,575+0300 DEBUG (periodic/3) [storage.TaskManager.Task] (Task='431e72e6-3061-4e99-9dbc-5582960b43d0') moving from state finished -> state finished (task:624)

2021-07-08 12:01:25,631+0300 INFO  (mailbox-hsm/3) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Volume sda (domainID: aa73b4d0-a22d-4f65-b254-594b34f0e6a8, volumeID: 6f350411-34ce-4b13-8214-17aa0ea829b9) was extended during migration, refreshing it on destination host. (vm:1441)


2021-07-08 12:01:26,115+0300 INFO  (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') migration took 19 seconds to complete (migration:548)
2021-07-08 12:01:26,115+0300 INFO  (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Changed state to Down: Migration succeeded (code=4) (vm:1895)
2021-07-08 12:01:26,120+0300 INFO  (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Stopping connection (guestagent:438)
2021-07-08 12:01:26,121+0300 INFO  (migsrc/2eb53807) [virt.vm] (vmId='2eb53807-6c20-4976-9a8b-31531b5fdef1') Enabling drive monitoring (drivemonitor:59)


The last two logs attached here, are referring that the migration succeeded during extension(source-migration-succeeded,dst-migration-succeeded)

Comment 64 Nir Soffer 2021-07-11 16:36:04 UTC
We tested so far only the positive flow, migrating from new host to new host.
We need to test the flow when the destination host is old (e.g. 4.4.6).

In this flow vdsm should detect that the host does not support the
VM.refresh_disk API and disable drive monitoring during the migraiton.

We expect to see this warning:

    Migration destination host does not support
    extending disk during migration, disabling disk
    extension during migration

If vm writes too much data during the migration, the vm will pause with ENOSPC
error. It should be resumed on the destination when the migration finishes.

Comment 65 Nir Soffer 2021-07-11 16:52:49 UTC
(In reply to sshmulev from comment #63)
Awesome, thanks!

Comment 66 sshmulev 2021-07-11 19:41:30 UTC
So I tried to test the same flow as in comment63 but this time when the dest host is with the old version (vdsm-4.40.60.7)
I tried several variations (around 9 times) to make the migration work by changing the writing rate and timing but the migration failed with the same errors.


Source:
2021-07-11 19:54:10,785+0300 WARN  (mailbox-hsm/1) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1586)
2021-07-11 19:54:10,785+0300 INFO  (mailbox-hsm/1) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Disabling drive monitoring (drivemonitor:66)
2021-07-11 19:54:10,785+0300 DEBUG (mailbox-hsm/1) [storage.TaskManager.Task] (Task='ff55e231-2d53-4a49-9abc-e26d98f4f42a') finished: None (task:1202)
2021-07-11 19:54:10,785+0300 DEBUG (mailbox-hsm/1) [storage.TaskManager.Task] (Task='ff55e231-2d53-4a49-9abc-e26d98f4f42a') moving from state finished -> state finished (task:624)


But then during the writing this happened (probably because of the fast writing during migration):
2021-07-11 19:54:15,633+0300 INFO  (libvirt/events) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=be588fb1-4dc1-4256-8cb9-d66860941e21 (api:54)
2021-07-11 19:54:15,633+0300 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='be588fb1-4dc1-4256-8cb9-d66860941e21') finished: None (task:1202)
2021-07-11 19:54:15,634+0300 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='be588fb1-4dc1-4256-8cb9-d66860941e21') moving from state finished -> state finished (task:624)
2021-07-11 19:54:15,634+0300 DEBUG (libvirt/events) [storage.ResourceManager.Owner] Owner.releaseAll resources {} (resourceManager:738)
2021-07-11 19:54:15,634+0300 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='be588fb1-4dc1-4256-8cb9-d66860941e21') ref 0 aborting False (task:1000)
2021-07-11 19:54:15,636+0300 INFO  (libvirt/events) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') CPU stopped: onSuspend (vm:6018)
2021-07-11 19:54:15,638+0300 WARN  (libvirt/events) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') device sda out of space (vm:3990)
2021-07-11 19:54:16,136+0300 ERROR (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Failed to migrate (migration:472)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 446, in _regular_run
    time.time(), machineParams
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 546, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 635, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 564, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call
    return getattr(self._vm._dom, name)(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2119, in migrateToURI3
    raise libvirtError('virDomainMigrateToURI3() failed')
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2021-07-11T16:54:14.182752Z qemu-kvm: Unknown combination of migration flags: 0x0
2021-07-11T16:54:14.182905Z qemu-kvm: error while loading state section id 1(ram)
2021-07-11T16:54:14.183258Z qemu-kvm: load of migration failed: Invalid argument
2021-07-11 19:54:16,136+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Enabling drive monitoring (drivemonitor:59)



destination:
2021-07-11 19:34:48,289+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') starting migration to qemu+tls:<fqdn>/system with miguri tcp://<IP> (migration:534)
2021-07-11 19:34:48,427+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 dd if=/rhev/data-center/45e49155-e88d-45c6-a468-b85d552606d6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:211)
2021-07-11 19:34:48,449+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = b'1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB, 1000 KiB) copied, 0.00481891 s, 212 MB/s\n'; <rc> = 0 (commands:224)
2021-07-11 19:42:17,686+0300 INFO  (vm/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') End of migration (vm:3989)
2021-07-11 19:42:17,822+0300 INFO  (periodic/1) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') setting block threshold to 10804527104 bytes for drive 'vda' (apparentsize 10871635968) (drivemonitor:116)
2021-07-11 19:42:17,832+0300 INFO  (periodic/1) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') setting block threshold to 1006632960 bytes for drive 'sda' (apparentsize 1073741824) (drivemonitor:116)

Comment 67 Vojtech Juranek 2021-07-12 07:37:13 UTC
> 2021-07-11 19:54:15,636+0300 INFO  (libvirt/events) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') CPU stopped: onSuspend (vm:6018)
> 2021-07-11 19:54:15,638+0300 WARN  (libvirt/events) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') device sda out of space (vm:3990)
> 2021-07-11 19:54:16,136+0300 ERROR (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Failed to migrate (migration:472)

In this flow disk extension is not supported and we cannot extend it, otherwise disk corruption can occur, which we want to avoid in any case. The expected flow is that there is either enough space so that migration can finish without any issue or, if the disk become full, the VM is suspended, migration finishes and VM is resumed after migration finish. In the log above VM is suspended, but migration fails immediately after that. Milan, is this something expected (and our expectation that VM will be migrated even if it's suspended was wrong) or it's a bug somewhere in virt part?

Comment 68 Milan Zamazal 2021-07-12 08:27:26 UTC
The migration fails in virDomainMigrateToURI3 libvirt call, i.e. the migration was interrupted. The reported "migration flags" error indicates, as explained by David in Comment 40, that the migration was canceled. I don't think Vdsm cancels migration in such a situation (I think a running migration is only canceled when it cannot converge or when requested from Engine). I'm not sure what libvirt/QEMU is supposed to do when a VM gets paused due to a full disk while migrating, but I'd say it should continue with the migration, paused VMs can be migrated and the full disk doesn't look like something that should prevent it.

Sophie, is there any explanation in libvirt or QEMU logs?

Comment 72 Dr. David Alan Gilbert 2021-07-12 09:37:16 UTC
Not much to see:
2021-07-11 17:59:00.361+0000: 94615: info : qemuMonitorIOWrite:437 : QEMU_MONITOR_IO_WRITE: mon=0x7fbb88059730 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-1910"}
 len=114 ret=114 errno=0
....
2021-07-11 17:59:01.276+0000: 94615: info : virObjectNew:259 : OBJECT_NEW: obj=0x7fbb2c0abe00 classname=virDomainEventIOError
2021-07-11 17:59:01.276+0000: 94615: info : virObjectNew:259 : OBJECT_NEW: obj=0x7fbb2c0a1280 classname=virDomainEventIOError
2021-07-11 17:59:01.276+0000: 94615: info : virObjectNew:259 : OBJECT_NEW: obj=0x55d481c9c8a0 classname=virDomainEventLifecycle
2021-07-11 17:59:01.278+0000: 94615: info : virObjectUnref:381 : OBJECT_UNREF: obj=0x7fbb4011a210
2021-07-11 17:59:01.278+0000: 94615: info : virObjectUnref:381 : OBJECT_UNREF: obj=0x7fbb88059730
2021-07-11 17:59:01.278+0000: 27019: error : qemuMigrationAnyCompleted:1790 : operation failed: migration out job: failed due to I/O error
...
2021-07-11 17:59:01.278+0000: 27019: info : qemuMonitorSend:950 : QEMU_MONITOR_SEND_MSG: mon=0x7fbb88059730 msg={"execute":"migrate_cancel","id":"libvirt-1911"}

Comment 73 Peter Krempa 2021-07-12 09:41:40 UTC
That's the source log, if there's a connection I/O error, maybe the destination log will be more telling.

Additionally note that the libvirtd logs are not in debug mode, thus there's quite a lot of stuff missing.

Comment 77 Peter Krempa 2021-07-12 12:42:06 UTC
Okay it now makes sense to me:

The 'run' phase of migration is invoked as:

2021-07-12 11:25:11.399+0000: 274299: debug : qemuMigrationSrcRun:4028 : driver=0x7fb454114bc0, vm=0x55b596ab96a0, cookiein=<qemu-migration>
<name>New_VM</name>
<uuid>90798b5b-8f3b-4ae7-82be-0b401eee2d8b</uuid>
<hostname>storage-ge12-vdsm3.lab.eng.tlv2.redhat.com</hostname>
<hostuuid>14f530ec-59ea-482b-a9c3-5f4f45c72159</hostuuid>
<graphics type='spice' port='5900' listen='10.46.12.183' tlsPort='5901'>
<cert info='subject' value='O=lab.eng.tlv2.redhat.com,CN=storage-ge12-vdsm3.lab.eng.tlv2.redhat.com'/>
</graphics>
<capabilities>
<cap name='xbzrle' auto='no'/>
<cap name='auto-converge' auto='no'/>
<cap name='rdma-pin-all' auto='no'/>
<cap name='postcopy-ram' auto='no'/>
<cap name='compress' auto='no'/>
<cap name='pause-before-switchover' auto='no'/>
<cap name='late-block-activate' auto='yes'/>
<cap name='multifd' auto='no'/>
<cap name='dirty-bitmaps' auto='no'/>
</capabilities>
</qemu-migration>
, cookieinlen=842, cookieout=0x7fb4b0886558, cookieoutlen=0x7fb4b0886528, flags=0x100b, resource=52, spec=0x7fb4b0886380 (dest=1, fwd=0), dconn=0x7fb454016da0, graphicsuri=<null>, nmigrate_disks=0, migrate_disks=(nil)

You can see here that flags is 0x100b which translates to bitwise or of the following:
VIR_MIGRATE_ABORT_ON_ERROR 	= 	4096 (0x1000; 1 << 12) 	
Cancel migration if a soft error (such as I/O error) happens during migration.

VIR_MIGRATE_PERSIST_DEST 	= 	8 (0x8; 1 << 3) 	
Define the domain as persistent on the destination host after successful migration. If the domain was persistent on the source host and VIR_MIGRATE_UNDEFINE_SOURCE is not used, it will end up persistent on both hosts.

VIR_MIGRATE_PEER2PEER 	= 	2 (0x2; 1 << 1) 	
Tell the source libvirtd to connect directly to the destination host. Without this flag the client (e.g., virsh) connects to both hosts and controls the migration process. In peer-to-peer mode, the source libvirtd controls the migration by calling the destination daemon directly.

VIR_MIGRATE_LIVE 	= 	1 (0x1; 1 << 0) 	
Do not pause the domain during migration. The domain's memory will be transferred to the destination host while the domain is running. The migration may never converge if the domain is changing its memory faster then it can be transferred. The domain can be manually paused anytime during migration using virDomainSuspend.

Libvirt then gets a BLOCK_IO_ERROR event from qemu due to ENOSPACE:

2021-07-12 11:25:19.202+0000: 276287: info : qemuMonitorJSONIOProcessLine:237 : QEMU_MONITOR_RECV_EVENT: mon=0x7fb4b4086080 event={"timestamp": {"seconds": 1626089119, "microseconds": 202393}, "event": "BLOCK_IO_ERROR", "data": {"device": "", "nospace": true, "node-name": "libvirt-1-format", "reason": "No space left on device", "operation": "write", "action": "stop"}}

Now this triggers cancel of the migration as VIR_MIGRATE_ABORT_ON_ERROR is requested.

2021-07-12 11:25:19.203+0000: 274299: error : qemuMigrationAnyCompleted:1790 : operation failed: migration out job: failed due to I/O error

Which then makes the migration fail.

Now the question whether ENOSPACE should abort with VIR_MIGRATE_ABORT_ON_ERROR is a policy decision so I'm afraid we'll have to add another flag to avoid cancelling the migration if ENOSPACE is the IO error which happens. In such case please open a RFE on libvirt.

Comment 78 Avihai 2021-07-12 12:48:01 UTC
(In reply to Peter Krempa from comment #77)
> Okay it now makes sense to me:
> 
> The 'run' phase of migration is invoked as:
> 
> 2021-07-12 11:25:11.399+0000: 274299: debug : qemuMigrationSrcRun:4028 :
> driver=0x7fb454114bc0, vm=0x55b596ab96a0, cookiein=<qemu-migration>
> <name>New_VM</name>
> <uuid>90798b5b-8f3b-4ae7-82be-0b401eee2d8b</uuid>
> <hostname>storage-ge12-vdsm3.lab.eng.tlv2.redhat.com</hostname>
> <hostuuid>14f530ec-59ea-482b-a9c3-5f4f45c72159</hostuuid>
> <graphics type='spice' port='5900' listen='10.46.12.183' tlsPort='5901'>
> <cert info='subject'
> value='O=lab.eng.tlv2.redhat.com,CN=storage-ge12-vdsm3.lab.eng.tlv2.redhat.
> com'/>
> </graphics>
> <capabilities>
> <cap name='xbzrle' auto='no'/>
> <cap name='auto-converge' auto='no'/>
> <cap name='rdma-pin-all' auto='no'/>
> <cap name='postcopy-ram' auto='no'/>
> <cap name='compress' auto='no'/>
> <cap name='pause-before-switchover' auto='no'/>
> <cap name='late-block-activate' auto='yes'/>
> <cap name='multifd' auto='no'/>
> <cap name='dirty-bitmaps' auto='no'/>
> </capabilities>
> </qemu-migration>
> , cookieinlen=842, cookieout=0x7fb4b0886558, cookieoutlen=0x7fb4b0886528,
> flags=0x100b, resource=52, spec=0x7fb4b0886380 (dest=1, fwd=0),
> dconn=0x7fb454016da0, graphicsuri=<null>, nmigrate_disks=0,
> migrate_disks=(nil)
> 
> You can see here that flags is 0x100b which translates to bitwise or of the
> following:
> VIR_MIGRATE_ABORT_ON_ERROR 	= 	4096 (0x1000; 1 << 12) 	
> Cancel migration if a soft error (such as I/O error) happens during
> migration.
> 
> VIR_MIGRATE_PERSIST_DEST 	= 	8 (0x8; 1 << 3) 	
> Define the domain as persistent on the destination host after successful
> migration. If the domain was persistent on the source host and
> VIR_MIGRATE_UNDEFINE_SOURCE is not used, it will end up persistent on both
> hosts.
> 
> VIR_MIGRATE_PEER2PEER 	= 	2 (0x2; 1 << 1) 	
> Tell the source libvirtd to connect directly to the destination host.
> Without this flag the client (e.g., virsh) connects to both hosts and
> controls the migration process. In peer-to-peer mode, the source libvirtd
> controls the migration by calling the destination daemon directly.
> 
> VIR_MIGRATE_LIVE 	= 	1 (0x1; 1 << 0) 	
> Do not pause the domain during migration. The domain's memory will be
> transferred to the destination host while the domain is running. The
> migration may never converge if the domain is changing its memory faster
> then it can be transferred. The domain can be manually paused anytime during
> migration using virDomainSuspend.
> 
> Libvirt then gets a BLOCK_IO_ERROR event from qemu due to ENOSPACE:
> 
> 2021-07-12 11:25:19.202+0000: 276287: info :
> qemuMonitorJSONIOProcessLine:237 : QEMU_MONITOR_RECV_EVENT:
> mon=0x7fb4b4086080 event={"timestamp": {"seconds": 1626089119,
> "microseconds": 202393}, "event": "BLOCK_IO_ERROR", "data": {"device": "",
> "nospace": true, "node-name": "libvirt-1-format", "reason": "No space left
> on device", "operation": "write", "action": "stop"}}
> 
> Now this triggers cancel of the migration as VIR_MIGRATE_ABORT_ON_ERROR is
> requested.
> 
> 2021-07-12 11:25:19.203+0000: 274299: error : qemuMigrationAnyCompleted:1790
> : operation failed: migration out job: failed due to I/O error
> 
> Which then makes the migration fail.
> 
> Now the question whether ENOSPACE should abort with
> VIR_MIGRATE_ABORT_ON_ERROR is a policy decision so I'm afraid we'll have to
> add another flag to avoid cancelling the migration if ENOSPACE is the IO
> error which happens. In such case please open a RFE on libvirt.

Michal can you please confirm the following as we discussed:

VM migration that fails due to IO error( as out of space due to disabling extend in source host fix) is expected to fail migration.

If I understand correctly this is expected behavior.

VM migration in the same flow(new host -> old host) without extend should work.

Comment 79 Michal Skrivanek 2021-07-12 12:58:49 UTC
yes, we always block live migration on EIO for safety. In this case it would be safe to migrate since it's actually ENOSPACE but because libvirt can't distinguish there (and because this failure is temporary and will improve in next release anyway) it's ok to fail the migration in this scenario.

Comment 80 sshmulev 2021-07-12 13:45:24 UTC
Checked the third scenario of migration from new host -> old one, without disk extension - passed successfully

The steps to reproduce are mentioned in comment63

logs:
2021-07-12 15:57:56,288+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Creation of destination VM took: 2 seconds (migration:494)
2021-07-12 15:57:56,289+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') starting migration to qemu+tls://storage-ge12-vdsm3.lab.eng.tlv2.redhat.com/system with miguri tcp://10.46.12.183 (migration:543)

2021-07-12 15:57:56,288+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Creation of destination VM took: 2 seconds (migration:494)
2021-07-12 15:57:56,289+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') starting migration to qemu+tls://storage-ge12-vdsm3.lab.eng.tlv2.redhat.com/system with miguri tcp://10.46.12.183 (migration:543)

021-07-12 15:58:14,622+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') migration took 20 seconds to complete (migration:548)
2021-07-12 15:58:14,622+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Changed state to Down: Migration succeeded (code=4) (vm:1895)
2021-07-12 15:58:14,629+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Stopping connection (guestagent:438)
2021-07-12 15:58:14,629+0300 INFO  (migsrc/90798b5b) [virt.vm] (vmId='90798b5b-8f3b-4ae7-82be-0b401eee2d8b') Enabling drive monitoring (drivemonitor:59)

UI admin log:
Migration completed (VM: New_VM, Source: host_mixed_1, Destination: host_mixed_3, Duration: 7 seconds, Total: 20 seconds, Actual downtime: (N/A))

Versions:

Old host:
vdsm-4.40.60.7
QEMU emulator version 5.2.0 (qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2)
ovirt-engine-4.4.7.6-0.11.el8ev

New host: 
vdsm-4.40.70.6-1.el8ev.x86_64
QEMU emulator version 5.2.0 (qemu-kvm-5.2.0-16.module+el8.4.0+11536+725e25d9.2)
ovirt-engine-4.4.7.6-0.11.el8ev

To summarize the two other flows that have been tested as well:
1) New host- > new host with disk extension:
The migration succeeded with refresh disk error in comment57 (filed a new bug on that https://bugzilla.redhat.com/show_bug.cgi?id=1981079)
2) New host -> old host with disk extension:
The migration was failed, see comment77 & comment79

Comment 82 Nir Soffer 2021-07-14 12:35:40 UTC
We missed important aspect in these tests, where is hte SPM host.

We have these flows:

1. destination host support disk refresh, other host is SPM

In this flow we extend the volume on the SPM host, refresh the volume on the
destination and finally on the source. In 4.4.7 the volume is refreshed on
the destination, but the VM.disk_refresh call always fail (bug 1981307) so 
we never refresh the volume on the source. The VM is likey to pause during
migration which may cause migration to fail.

2. destination host support disk refresh, destination is SPM

Since the destination is the SPM, the disk is always refreshed when extending
it and the disk refresh call is not needed, but vdsm does not know this so
the flow behave exactly like 1.

3. destination host support disk refresh, source is SPM

In this flow the when we extend the volume on the source it is also refreshed
so we cannot maintain the rule of "refresh the destination before the source".
In 4.4. refreshing the disk on the destination always fails (bug 1981307) but
the source volume is already refreshed. Because the fail to refresh the 
destination, we don't set a new threshold, so the VM is likely to pause with
ENOSPC. This will trigger a new extend, but since the refresh disk flow
always fail, the vim is unlikely to be resumed. This may cause the migration
to fail, or the VM to be terminated (there is code trying to kill VMs if
they are paused for long time).

Because the volume is refreshed on both sides, and the VM paused, I hope
this cannot lead to data corruption when migration finish, but we need to 
test this.

4. destination host does not support disk refresh, other host is SPM

In this flow after we extend the disk on the source we disable drive monitoring
so the volume is never refreshed on the source or destination. This will pause
the VM which may cause the migration to fail.

5. destination host does not support disk refresh, destination host is SPM

In this flow the volume is always refreshed on the destination but vdsm does
now know this so this behave exactly like 4.

5. destination host does not support disk refresh, source host is SPM

Like 3, disk is always refreshed on the source, but unlike 3, it is never
refreshed on the destination. Since we disable disk monitoring the VM is
likey to pause which may fail the migration. However if we had only one
extend, and the vm was not paused, at the end of the migration the volume
size on the destination will be smaller than the volume size on the host,
and we expect the see the same data corruption.


## Note for testing

Previously we test writing in a loop during migration, which may cause the VM
to pause after extending disk was disabled. This hides the case in 5 when we
had one extend on the source. When testing case 5, we need to  do one write
that will cause one extend, and will write data *after* the end of the volume
on the destination.

To test this properly, we must know how much free space we have on the disk, so
we want to test this with a second empty disk instead of the OS disk.

The tests should do:

1. Use standard chunk size in vdsm (1024 MiB)

2. Add second 2g thin disk to the VM (e.g. /dev/sdb)
   the disk will be created with initial size of 1G

3. Open shell in the guest and run one stress process:

    stress --vm 2 --vm-bytes 64M

We want to keep this running during migration, to make sure it does not complete
to quickly before we extend the disk on the source.

On my setup this cause migration to take about 60 seconds. You may need to tune
the values on you setup, depending on the networking conditions between the hosts.

4. Start migration in engine

Wait until migration starts - the vm should start on the destination and migration
process should be reported. On my setup waiting 10-20 seconds seems to work.

On my setup, migration progress start at 0% for 10-20 seconds, and then jumps
to 83%. The progress is much slower from this point, and the migration takes
about 1 minute. 

5. Open another shell to the guest and run

  dd if=/dev/zero bs=1M count=1280 of=/dev/sdb oflag=direct conv=fsync
  while true; do dd if=/dev/sdb bs=1M iflag=direct of=/dev/null; done

This will write 1280 MiB to the disk, which will cause one extend. Then
it will loop reading data form the disk. This may help to reproduce data
corruption when qemu try to read data after the end of the device on the
destination.

Example run from my setup:

[root@f32 ~]# dd if=/dev/zero bs=1M count=1280 of=/dev/sdb oflag=direct conv=fsync; while true; do dd if=/dev/sdb bs=1M iflag=direct of=/dev/null; done
1280+0 records in
1280+0 records out
1342177280 bytes (1.3 GB, 1.2 GiB) copied, 10.3691 s, 129 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 5.43809 s, 395 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 6.86508 s, 313 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 13.1546 s, 163 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.90183 s, 740 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.49112 s, 862 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.94468 s, 729 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.63086 s, 816 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.46283 s, 872 MB/s


When this flow is done, you can stop the guess processes, remove the second disk
and repeat the test.

Please report what flows were tested before (which host was the SPM).

Please include run logs from the guest and vdsm logs for every flow tested.

Comment 83 sshmulev 2021-07-15 09:05:01 UTC
(In reply to Nir Soffer from comment #82)
> We missed important aspect in these tests, where is hte SPM host.

> 5. destination host does not support disk refresh, source host is SPM
> 
> Like 3, disk is always refreshed on the source, but unlike 3, it is never
> refreshed on the destination. Since we disable disk monitoring the VM is
> likey to pause which may fail the migration. However if we had only one
> extend, and the vm was not paused, at the end of the migration the volume
> size on the destination will be smaller than the volume size on the host,
> and we expect the see the same data corruption.
> 
> 

Documenting the most critical test flow:
"Destination host does not support disk refresh, source host is SPM"
Tesed this according to the suggested flow above(comment82):


Expected results:
Disk is always refreshed on the source, but unlike 3, it is never
refreshed on the destination. Since we disable disk monitoring the VM is
likey to pause which may fail the migration. However if we had only one
extend, and the vm was not paused, at the end of the migration the volume
size on the destination will be smaller than the volume size on the host,
and we expect the see the same data corruption.


Actual results:
1) There was one extension during the migration:
2021-07-15 11:37:32,095+0300 INFO  (mailbox-hsm/4) [storage.ThreadPool.WorkerThread] START task 4cd3e80b-ba8d-4500-9bd9-de9a5ef83613 (cmd=<function runTask at 0x7f8b9d63c950>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f8b4fc980b8>>, {'domainID': 'aaa224bc-d6e7-454b-b715-ca0b5ea338c0', 'imageID': 'fa8d652f-7b7f-4694-b5fe-8674eb4ac139', 'internal': False, 'name': 'sda', 'newSize': 11811160064, 'poolID': 'bb020b99-8e13-4dea-b4c2-bd581a41d6ff', 'volumeID': '75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344', 'clock': <Clock(total=2.06*, extend-volume=2.05, refresh-destination-volume=0.00*)>, 'callback': None})) (threadPool:146)
2021-07-15 11:37:32,096+0300 INFO  (mailbox-hsm/4) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Volume sda (domainID: aaa224bc-d6e7-454b-b715-ca0b5ea338c0, volumeID: 75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344) was extended during migration, refreshing it on destination host. (vm:1441)

2) Migration Failed at the end:
2021-07-15 11:39:33,001+0300 ERROR (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Failed to migrate (migration:472)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 446, in _regular_run
    time.time(), machineParams
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 546, in _startUnderlyingMigration
    self._perform_with_conv_schedule(duri, muri)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 635, in _perform_with_conv_schedule
    self._perform_migration(duri, muri)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line 564, in _perform_migration
    self._migration_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call
    return getattr(self._vm._dom, name)(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2119, in migrateToURI3
    raise libvirtError('virDomainMigrateToURI3() failed')
libvirt.libvirtError: operation aborted: migration out job: canceled by client
2021-07-15 11:39:33,001+0300 INFO  (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Enabling drive monitoring (drivemonitor:59)

3) No refresh disk on the dest and no error logs.

[root@vm-13-28 mnt]# ./writing
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 14.1161 s, 76.1 MB/s

This flow was tested 3 times with the same results.

Comment 86 sshmulev 2021-07-15 10:07:40 UTC
(In reply to Nir Soffer from comment #82)
> We missed important aspect in these tests, where is hte SPM host.
> 
> 3. destination host support disk refresh, source is SPM


reply to (3)
comment41 - migration fails
Source - is the SPM
attachment 1799568 [details], attachment 1799569 [details] 

comment41 - migration succeeded
Source - is the SPM
attachment 1799584 [details], attachment 1799585 [details]

Comment 87 Nir Soffer 2021-07-15 18:23:32 UTC
(In reply to sshmulev from comment #83)
I looked at attachment 1801788 [details], but there are 7 migrations in the log
and the host was not the SPM host during the entire time.

Please specify which migration from this log is relevant to the test for 
case 6 - destination host does not support disk refresh, source host is SPM.


Migration 1:

2021-07-15 10:24:18,732+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.011 seconds elapsed, 29% of data processed, total data: 1096MB, processed data: 301MB, remaining data: 782MB, transfer speed 416Mbps, zero pages: 3405MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 10:28:22,709+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.111 seconds elapsed, 89% of data processed, total data: 1096MB, processed data: 6060MB, remaining data: 120MB, transfer speed 416Mbps, zero pages: 123274MB, compressed: 0MB, dirty rate: 13639, memory iteration: 47 (migration:853)


Migration 2:

2021-07-15 10:30:55,568+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.008 seconds elapsed, 34% of data processed, total data: 1096MB, processed data: 353MB, remaining data: 729MB, transfer speed 412Mbps, zero pages: 3779MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 10:32:45,671+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.108 seconds elapsed, 90% of data processed, total data: 1096MB, processed data: 6065MB, remaining data: 110MB, transfer speed 412Mbps, zero pages: 125601MB, compressed: 0MB, dirty rate: 13507, memory iteration: 47 (migration:853)


Migration 3:

2021-07-15 10:34:25,980+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.006 seconds elapsed, 29% of data processed, total data: 1096MB, processed data: 306MB, remaining data: 782MB, transfer speed 412Mbps, zero pages: 2019MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 10:36:16,077+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.097 seconds elapsed, 94% of data processed, total data: 1096MB, processed data: 6013MB, remaining data: 68MB, transfer speed 412Mbps, zero pages: 125788MB, compressed: 0MB, dirty rate: 14072, memory iteration: 46 (migration:853)


Migration 4:

2021-07-15 10:42:13,642+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.002 seconds elapsed, 27% of data processed, total data: 1096MB, processed data: 286MB, remaining data: 804MB, transfer speed 416Mbps, zero pages: 1811MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 10:46:56,262+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.108 seconds elapsed, 89% of data processed, total data: 1096MB, processed data: 6049MB, remaining data: 120MB, transfer speed 416Mbps, zero pages: 130579MB, compressed: 0MB, dirty rate: 13611, memory iteration: 47 (migration:853)


Migration 5:

2021-07-15 11:10:26,866+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.003 seconds elapsed, 31% of data processed, total data: 1096MB, processed data: 327MB, remaining data: 766MB, transfer speed 416Mbps, zero pages: 755MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)


Migration 6:

2021-07-15 11:29:17,498+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 9.999 seconds elapsed, 34% of data processed, total data: 1096MB, processed data: 358MB, remaining data: 730MB, transfer speed 412Mbps, zero pages: 2185MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 11:31:07,596+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.093 seconds elapsed, 92% of data processed, total data: 1096MB, processed data: 6065MB, remaining data: 91MB, transfer speed 412Mbps, zero pages: 112432MB, compressed: 0MB, dirty rate: 13475, memory iteration: 45 (migration:853)


Migration 7:

2021-07-15 11:37:30,960+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.003 seconds elapsed, 27% of data processed, total data: 1096MB, processed data: 291MB, remaining data: 803MB, transfer speed 416Mbps, zero pages: 675MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
...
2021-07-15 11:39:21,069+0300 INFO  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.105 seconds elapsed, 96% of data processed, total data: 1096MB, processed data: 5997MB, remaining data: 45MB, transfer speed 416Mbps, zero pages: 103115MB, compressed: 0MB, dirty rate: 13352, memory iteration: 44 (migration:853)

Comment 88 Nir Soffer 2021-07-15 21:41:59 UTC
(In reply to sshmulev from comment #83)

Based on the logs you mention:

> 1) There was one extension during the migration:
> 2021-07-15 11:37:32,095+0300 INFO  (mailbox-hsm/4)
...
> 2) Migration Failed at the end:
> 2021-07-15 11:39:33,001+0300 ERROR (migsrc/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Failed to migrate
> (migration:472)
...

Looking at the last migration in this log.

VM UUID: c8abb557-2bf4-4a31-b8b0-34ff433ae3c6

VM XML extracted from the log:

2021-07-15 10:01:54,326+0300 INFO  (vm/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') <?xml version='1.0' encoding='utf-8'?>
<domain xmlns:ns0="http://libvirt.org/schemas/domain/qemu/1.0" xmlns:ns1="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0" type="kvm">
    <name>New_VM</name>
    <uuid>c8abb557-2bf4-4a31-b8b0-34ff433ae3c6</uuid>
    <memory>1048576</memory>
    ...
        <disk device="disk" snapshot="no" type="block">
            <target bus="virtio" dev="vda" />
            <source dev="/rhev/data-center/mnt/blockSD/94c9c176-e9f8-4a0c-8904-6a1aa055ec35/images/8678cc28-3a1b-4d87-9861-851060c76dd4/ca58b7de-c4fe-42a4-a114-e7ea3688a822">
                <seclabel model="dac" relabel="no" type="none" />
            </source>
            <driver cache="none" error_policy="stop" io="native" iothread="1" name="qemu" type="qcow2" />
            <alias name="ua-8678cc28-3a1b-4d87-9861-851060c76dd4" />
            <boot order="1" />
            <serial>8678cc28-3a1b-4d87-9861-851060c76dd4</serial>
        </disk>
        <disk device="disk" snapshot="no" type="block">
            <target bus="scsi" dev="sda" />
            <source dev="/rhev/data-center/mnt/blockSD/aaa224bc-d6e7-454b-b715-ca0b5ea338c0/images/fa8d652f-7b7f-4694-b5fe-8674eb4ac139/75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344">
                <seclabel model="dac" relabel="no" type="none" />
            </source>
            <driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2" />
            <alias name="ua-fa8d652f-7b7f-4694-b5fe-8674eb4ac139" />
            <address bus="0" controller="0" target="0" type="drive" unit="0" />
            <serial>fa8d652f-7b7f-4694-b5fe-8674eb4ac139</serial>
        </disk>

Based on the logs, we write to the the second disk (sda) so I guess
this is the second empty disk.


Migration events:

1. Migration request received

2021-07-15 11:37:19,108+0300 INFO  (jsonrpc/2) [api.virt] START migrate(params={'abortOnError': 'true', 'autoConverge': 'false', 'dst': 'storage-ge3-vdsm3.lab.eng.tlv2.redhat.com:54321', 'method': 'online', 'vmId': 'c8abb557-2bf4-4a31-b8b0-34ff433ae3c6',
 'src': 'storage-ge3-vdsm2.lab.eng.tlv2.redhat.com', 'dstqemu': '10.46.12.147', 'outgoingLimit': 2, 'enableGuestEvents': False, 'tunneled': 'false', 'encrypted': False, 'compressed': 'false', 'incomingLimit': 2}) from=::ffff:10.46.12.144,37672, flow_id=4
7822ad9-f74f-41d2-9bf2-65a326d3f494, vmId=c8abb557-2bf4-4a31-b8b0-34ff433ae3c6 (api:48)

2021-07-15 11:37:19,108+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') using a computed convergence schedule for a legacy migration: {'init': [{'params': ['101'], 'name': 'setDowntime'}], 'stalling': [{'action': {'params':
 ['104'], 'name': 'setDowntime'}, 'limit': 1}, {'action': {'params': ['120'], 'name': 'setDowntime'}, 'limit': 2}, {'action': {'params': ['189'], 'name': 'setDowntime'}, 'limit': 3}, {'action': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 4}, {'a
ction': {'params': ['500'], 'name': 'setDowntime'}, 'limit': 42}, {'action': {'params': [], 'name': 'abort'}, 'limit': -1}]} (migration:161)

I'm not sure what the convergence scheduleparameters mean, Milan can
explain about that.

2. Destination VM was created

2021-07-15 11:37:20,950+0300 INFO  (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Creation of destination VM took: 1 seconds (migration:494)

3. Migrating to destination host started

2021-07-15 11:37:20,950+0300 INFO  (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') starting migration to qemu+tls://storage-ge3-vdsm3.lab.eng.tlv2.redhat.com/system with miguri tcp://10.46.12.147 (migration:543)


4. Requesting extension

2021-07-15 11:37:30,041+0300 INFO  (periodic/4) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Requesting extension for volume 75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344 on domain aaa224bc-d6e7-454b-b715-ca0b5ea338c0 (apparent: 10737418240, capacity: 32212254720, allocated: 10259070976, physical: 10737418240 threshold_state: exceeded) (vm:1348)

Not clear why we request extension, since we did not get block threshold event
from libvirt.

The current volume size 1GiB makes sense.

5. Sending extend message

2021-07-15 11:37:30,042+0300 INFO  (periodic/4) [vdsm.api] START sendExtendMsg(spUUID='bb020b99-8e13-4dea-b4c2-bd581a41d6ff', volDict={'domainID': 'aaa224bc-d6e7-454b-b715-ca0b5ea338c0', 'imageID': 'fa8d652f-7b7f-4694-b5fe-8674eb4ac139', 'internal': False, 'name': 'sda', 'newSize': 11811160064, 'poolID': 'bb020b99-8e13-4dea-b4c2-bd581a41d6ff', 'volumeID': '75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': None}, newSize=11811160064, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f8b4fc980b8>>) from=internal, task_id=553a7399-b7e4-403c-b8ea-1b7185dd5bcf (api:48)

The new size (11811160064 bytes 11 GiB) does not make sense!
With the default chunk size (1024 MiB) we expect to see here
2048.

6. Extending volume on the SPM

2021-07-15 11:37:30,403+0300 INFO  (mailbox-spm/0) [storage.LVM] Extending LV aaa224bc-d6e7-454b-b715-ca0b5ea338c0/75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344 to 11264 megabytes (lvm:1668)

2021-07-15 11:37:30,644+0300 DEBUG (mailbox-spm/0) [storage.SPM.Messages.Extend] new extend msg created: domain: aaa224bc-d6e7-454b-b715-ca0b5ea338c0, volume: 75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344 (mailbox:130)

At this point the volume on the source host was extended to 11 GiB.

7. Trying to extend volume on the destination

2021-07-15 11:37:32,096+0300 INFO  (mailbox-hsm/4) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Volume sda (domainID: aaa224bc-d6e7-454b-b715-ca0b5ea338c0, volumeID: 75ebabfc-4d1b-4d1c-9a8f-94caf9eb5344) was extended during migration, refreshing it on destination host. (vm:1441)

8. Detected that destination host does not support extending and disabling disk extension

2021-07-15 11:37:32,542+0300 WARN  (mailbox-hsm/4) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1586)

9. Warnings about migration stalling

2021-07-15 11:38:00,996+0300 WARN  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration stalling: remaining (41MiB) > lowmark (14MiB). (migration:840)

2021-07-15 11:38:11,007+0300 WARN  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration stalling: remaining (64MiB) > lowmark (14MiB). (migration:840)

...

10. Migration was canceled

2021-07-15 11:39:31,072+0300 WARN  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration stalling: remaining (97MiB) > lowmark (14MiB). (migration:840)
2021-07-15 11:39:31,072+0300 WARN  (migmon/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Aborting migration (migration:889)


Looks like we stressed the vm too much and the migration could not converge
and aborted by vdsm.


I think what we need to do is to run the stress tool only during the start of the 
migration, until we finish to write data to the disk. Once we finish writing, we
know that the disk was extended on the source host, and there is no reason to 
delay the migration.

There are 2 kinds of corruption that may occur when the disk on the destination
is smaller than the disk on the source:

- qcow2 corruption - reported in this bug

- guest data corruption - guest wrote data on source which is not available on
  the destination. This will is likely to corrupt the guest file system.

Reproducing qcow2 corruption looks hard, I could not reproduce it on my setup.
But reproducing guest data corruption is easy, if we compute a checksum
of the disk from the guest while running on the source, and then compute the 
checksum again after the migration was done. If the disk on the destination
was not extended, data written after the end of the disk will read as zeroes
and the checksum will not match.

I'm attaching a reproducer script that you can run in the guest to reproduce
this corruption.

To reproduce you need to do:
1. In the guest, start the script:

[root@f32 ~]# ./reproduce.sh
17:26:27.660876580 Starting stress
Press Enter when migration was started:


2. In engine, start the migration

3. Wait until the migration show first non-zero progress message
   Clicking refresh on engine UI every few seconds helps

4. Press Enter in the guest script to continue the reproducer

17:27:03.735988728 Writing to disk...
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.21757 s, 242 MB/s
17:27:05.966949840 Waiting until disk is extended
17:27:16.040376014 Writing more data to disk...
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.84322 s, 189 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0197118 s, 53.2 MB/s
17:27:19.041140885 Computing checksum...
17:27:29.890113434 Disk checksum 1a5350b04ce65e0ffac9a10d46f282769865e3db
17:27:29.908828362 Stopping stress
17:27:29.944121095 Computing checksum...
17:27:37.706975020 Disk checksum 1a5350b04ce65e0ffac9a10d46f282769865e3db
17:27:37.708425142 Computing checksum...
17:27:43.754118454 Disk checksum 91d50642dd930e9542c39d36f0516d45f4e1af0d
17:27:43.756070975 Disk was corrupted


The scripts writes data to the disk, triggering extension. Then it writes
more data, including the first MiB after the end of the original disk size.

When writing is done, the script compute the first checksum.

When done, it kills the stress processes so the migration can complete
quickly.

Then the script loops, computing the checksum of the disk. When the migration
is done, the checksum on the destination will be different because the
destination volume was never extended.

Attaching the script and logs form my setup during the corruption.

Comment 91 Nir Soffer 2021-07-15 22:19:03 UTC
In attachment 1802104 [details] we can see:

## src.log

1. Migration started

2021-07-16 00:26:38,357+0300 INFO  (jsonrpc/5) [api.virt] START migrate(params={'abortOnError': 'true', 'autoConverge': 'true', 'dst': 'host4:54321', 'method': 'online', 'vmId': '9c4050b1-8070-429f-ae36-9eae0935ccab', 'src': 'host3', 'dstqemu': '192.168.122.23', 'convergenceSchedule': {'init': [{'name': 'setDowntime', 'params': ['100']}], 'stalling': [{'limit': 1, 'action': {'name': 'setDowntime', 'params': ['150']}}, {'limit': 2, 'action': {'name': 'setDowntime', 'params': ['200']}}, {'limit': 3, 'action': {'name': 'setDowntime', 'params': ['300']}}, {'limit': 4, 'action': {'name': 'setDowntime', 'params': ['400']}}, {'limit': 6, 'action': {'name': 'setDowntime', 'params': ['500']}}, {'limit': -1, 'action': {'name': 'abort', 'params': []}}]}, 'outgoingLimit': 2, 'enableGuestEvents': True, 'tunneled': 'false', 'encrypted': False, 'compressed': 'false', 'incomingLimit': 2}) from=::ffff:192.168.122.11,48102, flow_id=199fded5-d0fb-45dd-b53a-e3326ecc1aa9, vmId=9c4050b1-8070-429f-ae36-9eae0935ccab (api:48)
2021-07-16 00:26:38,360+0300 INFO  (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'code': 0, 'message': 'Migration in progress'}, 'progress': 0} from=::ffff:192.168.122.11,48102, flow_id=199fded5-d0fb-45dd-b53a-e3326ecc1aa9, vmId=9c4050b1-8070-429f-ae36-9eae0935ccab (api:54)

2. Migration progress started

2021-07-16 00:27:01,532+0300 INFO  (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 20.012 seconds elapsed, 83% of data processed, total data: 1096MB, processed data: 858MB, remaining data: 188MB, transfer spe
ed 416Mbps, zero pages: 13331MB, compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:886)


3. Block threshold exceeded

2021-07-16 00:27:05,718+0300 INFO  (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Block threshold 536870912 exceeded by 327680 for drive 'sdg' (/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/53594314-c33d-410e-a3c3-910f405b5131/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa) (drivemonitor:173)

4. Drive extended

2021-07-16 00:27:07,137+0300 INFO  (mailbox-spm/0) [storage.LVM] Extending LV feab3738-c158-4d48-8a41-b5a95c057a50/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa to 2048 megabytes (lvm:1668)

5. Detecting that the destination does not support extension, disabling drive monitoring

2021-07-16 00:27:09,493+0300 WARN  (mailbox-hsm/2) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration destination host does not support extending disk during migration, disabling disk extension during migration (vm:1593)

6. Migration continues after drive was extended

2021-07-16 00:27:11,535+0300 INFO  (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 30.014 seconds elapsed, 90% of data processed, total data: 1096MB, processed data: 1373MB, remaining data: 118MB, transfer speed 416Mbps, zero pages: 43933MB, compressed: 0MB, dirty rate: 13953, memory iteration: 3 (migration:886)
...
2021-07-16 00:27:31,545+0300 INFO  (migmon/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Migration Progress: 50.022 seconds elapsed, 86% of data processed, total data: 1096MB, processed data: 2413MB, remaining data: 157MB, transfer speed 416Mbps, zero pages: 45896MB, compressed: 0MB, dirty rate: 12900, memory iteration: 7 (migration:886)

7. Migration completes successfully

2021-07-16 00:27:39,048+0300 INFO  (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU stopped: onSuspend (vm:6018)
2021-07-16 00:27:40,778+0300 INFO  (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') new computed progress 86 < than old value 90, discarded (migration:255)
2021-07-16 00:27:40,780+0300 INFO  (migsrc/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') migration took 62 seconds to complete (migration:580)
2021-07-16 00:27:40,780+0300 INFO  (migsrc/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Changed state to Down: Migration succeeded (code=4) (vm:1902)


## dst.log

1. Migration started

2021-07-16 00:26:38,398+0300 INFO  (jsonrpc/2) [api.virt] START migrationCreate...

2. Destination disk volume size is 1G

2021-07-16 00:26:39,435+0300 INFO  (vm/9c4050b1) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd
-fe810b6aa2aa', options=None) from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:48)

2021-07-16 00:26:39,435+0300 INFO  (vm/9c4050b1) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:54)

3. Migration completed

2021-07-16 00:27:39,199+0300 INFO  (libvirt/events) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU running: onResume (vm:6018)
2021-07-16 00:27:39,214+0300 INFO  (libvirt/events) [vds] Channel state for vm_id=9c4050b1-8070-429f-ae36-9eae0935ccab changed from=disconnected(2) to=connected(1) (qemuguestagent:289)
2021-07-16 00:27:39,221+0300 INFO  (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Starting connection (guestagent:252)
2021-07-16 00:27:39,226+0300 INFO  (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') CPU running: domain initialization (vm:6018)
2021-07-16 00:27:39,236+0300 INFO  (vm/9c4050b1) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') End of migration (vm:4081)

4. Setting block threshold on the extended disk

2021-07-16 00:27:40,365+0300 INFO  (periodic/3) [virt.vm] (vmId='9c4050b1-8070-429f-ae36-9eae0935ccab') Setting block threshold to 536870912 bytes for drive 'sdg' (/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/53594314-c33d-410e-a3c3-910f405b5131/c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa) apparentsize 1073741824 (drivemonitor:116)

The destination host still see 1 GiB disk, since the disk was not refreshed
on the destination.

The disk on the destination is not extended after the migration:

$ egrep -A1 getVolumeSize.+c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa vdsm.log 
2021-07-16 00:26:39,435+0300 INFO  (vm/9c4050b1) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa', options=None) from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:48)
2021-07-16 00:26:39,435+0300 INFO  (vm/9c4050b1) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=31c71d10-c09f-4e7f-ab67-35a8f3343e3d (api:54)
...
2021-07-16 00:42:33,940+0300 INFO  (periodic/0) [vdsm.api] START getVolumeSize(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='53594314-c33d-410e-a3c3-910f405b5131', volUUID='c7be3c53-ce6d-4cd5-92cd-fe810b6aa2aa', options=None) from=internal, task_id=be57bda6-143c-44fb-bf7a-f5bcb5382aa7 (api:48)
2021-07-16 00:42:33,940+0300 INFO  (periodic/0) [vdsm.api] FINISH getVolumeSize return={'apparentsize': '1073741824', 'truesize': '1073741824'} from=internal, task_id=be57bda6-143c-44fb-bf7a-f5bcb5382aa7 (api:54)

This will cause guest data corruption as soon as the guest try to read
from the missing parts of the disk.

Comment 92 Nir Soffer 2021-07-16 01:07:08 UTC
Created attachment 1802156 [details]
Reproducer script

Comment 93 sshmulev 2021-07-18 07:06:54 UTC
(In reply to Nir Soffer from comment #87)
> (In reply to sshmulev from comment #83)
> I looked at attachment 1801788 [details], but there are 7 migrations in the
> log
> and the host was not the SPM host during the entire time.
> 
The host from this log is the SPM according to this in the entire log: for example - "2021-07-15 11:45:46,645+0300 INFO  (jsonrpc/0) [vdsm.api] START getSpmStatus"


> Please specify which migration from this log is relevant to the test for 
> case 6 - destination host does not support disk refresh, source host is SPM.
> 

Migration 7 is relevant to the test for case 6:(I just mentioned it in comment83 although all the rest are the repeats I tried to test that, just chose the last one for the bug documentation)

2021-07-15 11:37:20,950+0300 INFO  (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') starting migration

As I mentioned in comment83: This is the time the migration of the SPM host failed:
2021-07-15 11:39:33,001+0300 ERROR (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Failed to migrate (migration:472)
libvirt.libvirtError: operation aborted: migration out job: canceled by client
2021-07-15 11:39:33,001+0300 INFO  (migsrc/c8abb557) [virt.vm] (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Enabling drive monitoring (drivemonitor:59)

> 
> Migration 1:
> 
> 2021-07-15 10:24:18,732+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.011
> seconds elapsed, 29% of data processed, total data: 1096MB, processed data:
> 301MB, remaining data: 782MB, transfer speed 416Mbps, zero pages: 3405MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 10:28:22,709+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.111
> seconds elapsed, 89% of data processed, total data: 1096MB, processed data:
> 6060MB, remaining data: 120MB, transfer speed 416Mbps, zero pages: 123274MB,
> compressed: 0MB, dirty rate: 13639, memory iteration: 47 (migration:853)
> 
> 
> Migration 2:
> 
> 2021-07-15 10:30:55,568+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.008
> seconds elapsed, 34% of data processed, total data: 1096MB, processed data:
> 353MB, remaining data: 729MB, transfer speed 412Mbps, zero pages: 3779MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 10:32:45,671+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.108
> seconds elapsed, 90% of data processed, total data: 1096MB, processed data:
> 6065MB, remaining data: 110MB, transfer speed 412Mbps, zero pages: 125601MB,
> compressed: 0MB, dirty rate: 13507, memory iteration: 47 (migration:853)
> 
> 
> Migration 3:
> 
> 2021-07-15 10:34:25,980+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.006
> seconds elapsed, 29% of data processed, total data: 1096MB, processed data:
> 306MB, remaining data: 782MB, transfer speed 412Mbps, zero pages: 2019MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 10:36:16,077+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.097
> seconds elapsed, 94% of data processed, total data: 1096MB, processed data:
> 6013MB, remaining data: 68MB, transfer speed 412Mbps, zero pages: 125788MB,
> compressed: 0MB, dirty rate: 14072, memory iteration: 46 (migration:853)
> 
> 
> Migration 4:
> 
> 2021-07-15 10:42:13,642+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.002
> seconds elapsed, 27% of data processed, total data: 1096MB, processed data:
> 286MB, remaining data: 804MB, transfer speed 416Mbps, zero pages: 1811MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 10:46:56,262+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.108
> seconds elapsed, 89% of data processed, total data: 1096MB, processed data:
> 6049MB, remaining data: 120MB, transfer speed 416Mbps, zero pages: 130579MB,
> compressed: 0MB, dirty rate: 13611, memory iteration: 47 (migration:853)
> 
> 
> Migration 5:
> 
> 2021-07-15 11:10:26,866+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.003
> seconds elapsed, 31% of data processed, total data: 1096MB, processed data:
> 327MB, remaining data: 766MB, transfer speed 416Mbps, zero pages: 755MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> 
> 
> Migration 6:
> 
> 2021-07-15 11:29:17,498+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 9.999
> seconds elapsed, 34% of data processed, total data: 1096MB, processed data:
> 358MB, remaining data: 730MB, transfer speed 412Mbps, zero pages: 2185MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 11:31:07,596+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.093
> seconds elapsed, 92% of data processed, total data: 1096MB, processed data:
> 6065MB, remaining data: 91MB, transfer speed 412Mbps, zero pages: 112432MB,
> compressed: 0MB, dirty rate: 13475, memory iteration: 45 (migration:853)
> 
> 
> Migration 7:
> 
> 2021-07-15 11:37:30,960+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 10.003
> seconds elapsed, 27% of data processed, total data: 1096MB, processed data:
> 291MB, remaining data: 803MB, transfer speed 416Mbps, zero pages: 675MB,
> compressed: 0MB, dirty rate: 0, memory iteration: 1 (migration:853)
> ...
> 2021-07-15 11:39:21,069+0300 INFO  (migmon/c8abb557) [virt.vm]
> (vmId='c8abb557-2bf4-4a31-b8b0-34ff433ae3c6') Migration Progress: 120.105
> seconds elapsed, 96% of data processed, total data: 1096MB, processed data:
> 5997MB, remaining data: 45MB, transfer speed 416Mbps, zero pages: 103115MB,
> compressed: 0MB, dirty rate: 13352, memory iteration: 44 (migration:853)

Comment 97 errata-xmlrpc 2021-07-22 15:08:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2864


Note You need to log in before you can comment on or make changes to this bug.