Bug 2057267

Summary: Migration with postcopy fail when vm set with shared memory
Product: Red Hat Enterprise Linux 9 Reporter: yalzhang <yalzhang>
Component: qemu-kvmAssignee: Peter Xu <peterx>
qemu-kvm sub component: Live Migration QA Contact: Li Xiaohui <xiaohli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: coli, dhildenb, djdumas, fjin, jinzhao, juzhang, lcong, leobras, lmen, mrezanin, nilal, peterx, quintela, smitterl, virt-maint, xiaohli, xuzhang, yanghliu
Version: 9.0Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-8.0.0-4.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-07 08:26:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yalzhang@redhat.com 2022-02-23 06:02:58 UTC
Description of problem:
Migration with postcopy fail when vm set with shared memory

Version-Release number of selected component (if applicable):
source and target host:
# rpm -q libvirt qemu-kvm
libvirt-8.0.0-4.el9.x86_64
qemu-kvm-6.2.0-9.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Set the migration environment, and start vm with shared memory as below:
# virsh dumpxml test
...
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <memoryBacking>
    <access mode='shared'/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.6.0'>hvm</type>
    <boot dev='hd'/>
  </os>
...
<cpu>
......
 <numa>
      <cell id='0' cpus='0-7' memory='2097152' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
....

2. Migrate the vm with postcopy:
# virsh migrate test --live --verbose qemu+ssh://10.66.xx.xx/system --p2p   --persistent --undefinesource --postcopy --timeout 5 --timeout-postcopy
Migration: [ 95 %]error: internal error: qemu unexpectedly closed the monitor: 2022-02-20T09:53:52.819361Z qemu-kvm: ram_block_enable_notify userfault register: Invalid argument
2022-02-20T09:53:52.819424Z qemu-kvm: ram_block_enable_notify failed
2022-02-20T09:53:52.819537Z qemu-kvm: cleanup_range: userfault unregister Invalid argument
2022-02-20T09:53:59.337425Z qemu-kvm: load of migration failed: Operation not permitted

# virsh list --all
 Id   Name             State
---------------------------------
 8    test             paused

# cat /var/log/libvirt/qemu/test.log
......
2022-02-20 09:53:47.302+0000: initiating migration
2022-02-20T09:53:59.338270Z qemu-kvm: failed to save SaveStateEntry with id(name): 2(ram)
2022-02-20T09:53:59.338484Z qemu-kvm: Detected IO failure for postcopy. Migration paused.

# virsh event --all --loop
event 'migration-iteration' for domain 'test': iteration: '1'
event 'lifecycle' for domain 'test': Suspended Migrated
event 'lifecycle' for domain 'test': Suspended Post-copy
event 'migration-iteration' for domain 'test': iteration: '2'

check libvirtd log:
2022-02-20 10:11:06.154+0000: 151429: debug : virThreadJobSet:93 : Thread 151429 (rpc-virtqemud) is now running job remoteDispatchDomainGetJobInfo
2022-02-20 10:11:06.154+0000: 151426: debug : qemuMigrationSrcRestoreDomainState:136 : driver=0x7f39fc024410, vm=0x7f39fc09a6b0, pre-mig-state=running, state=paused, reason=post-copy failed

Actual results:
When vm set with shared memory, migrate with postcopy fail

Expected results:
Migration should succeed

Additional info:
With with hugepage configured, postcopy migration can succeed.
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB'/>
    </hugepages>
    <access mode='shared'/>
  </memoryBacking>
If the setting that "sharedmemory without hugepage" is invalid, there should be some error or warning.

Comment 1 Li Xiaohui 2022-02-24 02:49:11 UTC
Hi Yalan,
I'm curious how could you configure shared memory without hugepage. 
Can you provide the full qemu command under below two scenarios so that I could have a try from qemu side:
1) with shared memory and hugepage;
2) with shared memory but no hugepage setting.

Comment 2 yalzhang@redhat.com 2022-02-25 07:35:42 UTC
(In reply to Li Xiaohui from comment #1)
> Hi Yalan,
> I'm curious how could you configure shared memory without hugepage. 
Yes, libvirt and qemu do not report any error when I start the vm.

> Can you provide the full qemu command under below two scenarios so that I
> could have a try from qemu side:
> 1) with shared memory and hugepage;

qemu      147772 99.4  0.2 6280344 75612 ?       Sl   02:32   0:12 /usr/libexec/qemu-kvm -name guest=rhel
......
 -machine pc-q35-rhel9.0.0,usb=off,dump-guest-core=off ……
-m 2048 -overcommit mem-lock=off -smp 8,sockets=8,cores=1,threads=1 
-object {"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/dev/hugepages/libvirt/qemu/17-rhel","share":true,"prealloc":true,"size":2147483648} 
-numa node,nodeid=0,cpus=0-7,memdev=ram-node0 
-uuid 7ffd0ecf-504b-43a9-91f4-f339e98d136b -no-user-config -nodefaults 


> 2) with shared memory but no hugepage setting.

qemu      145371 39.8  1.8 7234220 596756 ?      Sl   02:04   0:26 /usr/libexec/qemu-kvm -name guest=rhel,debug-threads=on -S -object 
……
-machine pc-q35-rhel9.0.0,usb=off,dump-guest-core=off -accel kvm 
……
-m 2048 -overcommit mem-lock=off -smp 8,sockets=8,cores=1,threads=1 
-object {"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/var/lib/libvirt/qemu/ram/17-rhel/ram-node0","share":true,"size":2147483648} 
-numa node,nodeid=0,cpus=0-7,memdev=ram-node0 -uuid 7ffd0ecf-504b-43a9-91f4-f339e98d136b -no-user-config -nodefaults

Comment 3 Li Xiaohui 2022-03-01 09:21:43 UTC
Thanks Yalan for providing qemu commands.

Reproduce bug both through qemu and libvirt on rhel9.0.0 (qemu-kvm-6.2.0-10.el9.x86_64 & libvirt-8.0.0-5.el9.x86_64):
1.If we set hugepage on src and dst host, then postcopy migration succeed;
2.If we don't set hugepage and don't mount the mem-path hugetlbfs on hosts, then postcopy will fail, migration on src host would pause, and qemu on dst host would quit by error. But plain migration succeed under this scenario.

Reproduce step:
1.Create a directory on src and dst host:
# /var/lib/libvirt/qemu/ram/17-rhel
2.Boot vm with qemu cmds:
-machine q35 \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
......
-object memory-backend-file,id=ram-node0,mem-path=/var/lib/libvirt/qemu/ram/17-rhel/ram-node0,share=true,size=4096M \
-numa node,nodeid=0,cpus=0-3,memdev=ram-node0 \
3.Migrate vm from src to dst host through postcopy


Actual result:
Migration would pause during postcopy active stage.

(1)src qemu:
(qemu) 2022-03-01T03:56:48.244543Z qemu-kvm: failed to save SaveStateEntry with id(name): 1(ram)
2022-03-01T03:56:48.244598Z qemu-kvm: Detected IO failure for postcopy. Migration paused.
(2)dst qemu:
(qemu) 2022-03-01T03:56:40.085671Z qemu-kvm: ram_block_enable_notify userfault register: Invalid argument
2022-03-01T03:56:40.085761Z qemu-kvm: ram_block_enable_notify failed
2022-03-01T03:56:40.085943Z qemu-kvm: cleanup_range: userfault unregister Invalid argument
2022-03-01T03:56:48.242290Z qemu-kvm: load of migration failed: Operation not permitted

Comment 4 Li Xiaohui 2022-03-01 09:33:32 UTC
And if postcopy with memory-backend-ram, it also succeeds:
-machine q35 \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
...
-object memory-backend-ram,id=ram-node0,share=true,size=4096M \
-numa node,nodeid=0,cpus=0-3,memdev=ram-node0 \

Comment 5 Li Xiaohui 2022-03-01 09:36:05 UTC
Hi David, can you check Comment 3, whether we support boot vm with memory-backend-file but none hugepage configured? And shall we give some error prompt when we start vm under such scenario?

Comment 6 Dr. David Alan Gilbert 2022-03-01 13:25:15 UTC
(In reply to Li Xiaohui from comment #3)
> Thanks Yalan for providing qemu commands.
> 
> Reproduce bug both through qemu and libvirt on rhel9.0.0
> (qemu-kvm-6.2.0-10.el9.x86_64 & libvirt-8.0.0-5.el9.x86_64):
> 1.If we set hugepage on src and dst host, then postcopy migration succeed;
> 2.If we don't set hugepage and don't mount the mem-path hugetlbfs on hosts,
> then postcopy will fail, migration on src host would pause, and qemu on dst
> host would quit by error. But plain migration succeed under this scenario.
> 
> Reproduce step:
> 1.Create a directory on src and dst host:
> # /var/lib/libvirt/qemu/ram/17-rhel
> 2.Boot vm with qemu cmds:
> -machine q35 \
> -m 4096 \
> -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
> ......
> -object
> memory-backend-file,id=ram-node0,mem-path=/var/lib/libvirt/qemu/ram/17-rhel/
> ram-node0,share=true,size=4096M \
> -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 \
> 3.Migrate vm from src to dst host through postcopy
> 
> 
> Actual result:
> Migration would pause during postcopy active stage.
> 
> (1)src qemu:
> (qemu) 2022-03-01T03:56:48.244543Z qemu-kvm: failed to save SaveStateEntry
> with id(name): 1(ram)
> 2022-03-01T03:56:48.244598Z qemu-kvm: Detected IO failure for postcopy.
> Migration paused.
> (2)dst qemu:
> (qemu) 2022-03-01T03:56:40.085671Z qemu-kvm: ram_block_enable_notify
> userfault register: Invalid argument
> 2022-03-01T03:56:40.085761Z qemu-kvm: ram_block_enable_notify failed
> 2022-03-01T03:56:40.085943Z qemu-kvm: cleanup_range: userfault unregister
> Invalid argument
> 2022-03-01T03:56:48.242290Z qemu-kvm: load of migration failed: Operation
> not permitted


That's an invalid configuration for postcopy; using '/var/lib/libvirt/qemu/ram/17-rhel' only works if it's actually
a shared memory filesystem, not if it's a normal on disc file.

So please retest using a directory on /dev/shm or a separate shmfs mount.

Comment 7 David Hildenbrand 2022-03-01 16:27:45 UTC
(In reply to Li Xiaohui from comment #5)
> Hi David, can you check Comment 3, whether we support boot vm with
> memory-backend-file but none hugepage configured? And shall we give some
> error prompt when we start vm under such scenario?

We do support postcopy with memory-backend-file in case it references shared memory ("virtual file" on shmem). We don't support postcopy with memory-backend-file in case it references an actual file.

I agree that the error message is somehwat suboptimal: it merely states that usefaultfd (use by postcopy) is not supported on that memory backing.

Comment 8 yalzhang@redhat.com 2022-03-03 01:41:37 UTC
(In reply to Dr. David Alan Gilbert from comment #6) 
> That's an invalid configuration for postcopy; using
> '/var/lib/libvirt/qemu/ram/17-rhel' only works if it's actually
> a shared memory filesystem, not if it's a normal on disc file.
> 
> So please retest using a directory on /dev/shm or a separate shmfs mount.

It works well that vm migration with postcopy and shared memory on /dev/shm.

Steps:
1. On source and target host, update the memory backing in qemu.conf:
# grep -v ^# /etc/libvirt/qemu.conf | grep -v '^$'
memory_backing_dir = "/dev/shm"

Restart libvirt related services, and prepare the migration env.

2. Start a vm:
# virsh dumpxml rhel
......
<memoryBacking>
    <source type='file'/>
    <access mode='shared'/>
  </memoryBacking>
......
# virsh start rhel 
Domain 'rhel' started

Qemu command line:
-machine pc-q35-rhel9.0.0,usb=off,dump-guest-core=off,memory-backend=pc.ram 
-accel kvm 
-cpu Haswell-noTSX-IBRS 
-m 1024 
-object {"qom-type":"memory-backend-file","id":"pc.ram","mem-path":"/dev/shm/libvirt/qemu/4-rhel/pc.ram","share":true,"x-use-canonical-path-for-ramblock-id":false,"size":1073741824} 
-overcommit mem-lock=off

3. Migration succeed:
# virsh migrate rhel  qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system --live --verbose --postcopy --timeout 5 --timeout-postcopy   --bandwidth 5  --p2p   --persistent --undefinesource 
Migration: [100 %]

Comment 9 Li Xiaohui 2022-03-03 08:59:53 UTC
Per Comment 6 and Comment 7 and Comment 8, shall we close this bug as NOTABUG? 
Or optimize our error prompts and block postcopy migration before it starting without a shared memory filesystem?

Comment 10 yalzhang@redhat.com 2022-03-07 01:25:31 UTC
And I have another question: 
As it is confirmed that **"mem-path":"/var/lib/libvirt/qemu/ram/17-rhel/ram-node0","share":true"** is a invalid setting and VM can start with it successfully, what could happen for VMs with such setting? the memory can not be shared, right? If that't true, maybe we should forbid to start the VM, otherwise it will be misused, for example, with virtiofs or vhostuser interface which need shared memory to work properly.

Comment 11 Peter Xu 2022-03-07 03:42:49 UTC
IIUC sharing is not the problem, however..  Do/Should we support running VM with memory allocated on a generic file system at all?  Do we have any use case of that?

Comment 12 David Hildenbrand 2022-03-07 08:32:36 UTC
(In reply to Peter Xu from comment #11)
> IIUC sharing is not the problem, however..  Do/Should we support running VM
> with memory allocated on a generic file system at all?  Do we have any use
> case of that?

IIRC yes we support that. The use case is essentially having large VMs and using the pagecache as a replacement for swapping. I remember someone (Dave G. ?) mention that we do have customer installations making use of that. Of course, they are not using postcopy, because it cannot possibly work right now.

Comment 13 David Hildenbrand 2022-03-07 08:36:34 UTC
(In reply to yalzhang from comment #10)
> And I have another question: 
> As it is confirmed that
> **"mem-path":"/var/lib/libvirt/qemu/ram/17-rhel/ram-node0","share":true"**
> is a invalid setting and VM can start with it successfully, what could
> happen for VMs with such setting? the memory can not be shared, right? If
> that't true, maybe we should forbid to start the VM, otherwise it will be
> misused, for example, with virtiofs or vhostuser interface which need shared
> memory to work properly.

IIRC vhost-user should work just fine with shared file mappings. After all, the only requirement they have is communicating with the VM via the guest RAM being mapped into their address space via via mmap(MAP_SHARED). And there are valid use cases for that.

The only limitation is that userfaultfd only supports shmem, not *any* shared memory.

Comment 14 David Hildenbrand 2022-03-07 08:39:08 UTC
(In reply to Li Xiaohui from comment #9)
> Per Comment 6 and Comment 7 and Comment 8, shall we close this bug as
> NOTABUG? 
> Or optimize our error prompts and block postcopy migration before it
> starting without a shared memory filesystem?

While I think it is NOTABUG, and nothing was actually harmed (the VM is paused just like if migration failed IIRC),.

However, an error message that at least says something like "userfaultfd could not be initialized, postcopy is not supported." would be preferable.

Comment 15 Dr. David Alan Gilbert 2022-03-15 14:06:57 UTC
(In reply to David Hildenbrand from comment #13)
> (In reply to yalzhang from comment #10)
> > And I have another question: 
> > As it is confirmed that
> > **"mem-path":"/var/lib/libvirt/qemu/ram/17-rhel/ram-node0","share":true"**
> > is a invalid setting and VM can start with it successfully, what could
> > happen for VMs with such setting? the memory can not be shared, right? If
> > that't true, maybe we should forbid to start the VM, otherwise it will be
> > misused, for example, with virtiofs or vhostuser interface which need shared
> > memory to work properly.
> 
> IIRC vhost-user should work just fine with shared file mappings. After all,
> the only requirement they have is communicating with the VM via the guest
> RAM being mapped into their address space via via mmap(MAP_SHARED). And
> there are valid use cases for that.

What's the valid use case?

> 
> The only limitation is that userfaultfd only supports shmem, not *any*
> shared memory.

Comment 16 David Hildenbrand 2022-03-15 14:41:40 UTC
(In reply to Dr. David Alan Gilbert from comment #15)
> (In reply to David Hildenbrand from comment #13)
> > (In reply to yalzhang from comment #10)
> > > And I have another question: 
> > > As it is confirmed that
> > > **"mem-path":"/var/lib/libvirt/qemu/ram/17-rhel/ram-node0","share":true"**
> > > is a invalid setting and VM can start with it successfully, what could
> > > happen for VMs with such setting? the memory can not be shared, right? If
> > > that't true, maybe we should forbid to start the VM, otherwise it will be
> > > misused, for example, with virtiofs or vhostuser interface which need shared
> > > memory to work properly.
> > 
> > IIRC vhost-user should work just fine with shared file mappings. After all,
> > the only requirement they have is communicating with the VM via the guest
> > RAM being mapped into their address space via via mmap(MAP_SHARED). And
> > there are valid use cases for that.
> 
> What's the valid use case?

I was told that we have some customers that run large VMs on almost swap-less systems. Instead, you use the file as a replacement for the swap, but more fine-grained, targeting only VM RAM. (not sure if there are also use cases for offline-migration of VMs via network storage)

Comment 17 Juan Quintela 2022-07-14 12:06:22 UTC
What happens when the memory is swapped of for migration?

I.e. we have a page, that is swapped to the file, and then we migrate?  To finish migration, we need to re-read the file back to memory, to send it, and we can found that we don't have space on the page cache.  I am not sure that we really want to support migration with non-shared filesystems.

Comment 18 Leonardo Bras 2023-02-09 22:46:33 UTC
Setting priority & severity to 'medium' based on:
- According to comments, seems like a NOTABUG, but may require better error message
- There is an ongoing conversation at the topic, but no updates for a while (6mo+)

Dave, Juan, Peter, David Hildenbrand. Is that ok?
Please change it if you see fit.

Comment 22 David Hildenbrand 2023-04-14 17:44:57 UTC
(In reply to Juan Quintela from comment #17)
> What happens when the memory is swapped of for migration?

[managed to ignore this BZ, somehow]

Just that same as happens with ordinary anonymous memory with swap.

> 
> I.e. we have a page, that is swapped to the file, and then we migrate?  To
> finish migration, we need to re-read the file back to memory, to send it,
> and we can found that we don't have space on the page cache.  I am not sure
> that we really want to support migration with non-shared filesystems.

Again, just the same as happens when we have !shared guest memory swapped out.

If the kernel is running out of memory, it has to reclaim memory (e.g., swapout other anon pages, evict other pagecache pages) such that user space major faults can be served.

Comment 26 Yanan Fu 2023-05-24 09:20:21 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 29 Li Xiaohui 2023-05-30 09:33:34 UTC
Test two scenarios from qemu side (qemu-kvm-8.0.0-4.el9.x86_64)
1. postcopy migration without a shared memory filesystem
(1) enable postcopy on src and dst host: the postcopy can be enabled on src host, but fail to enable on dst host, get prompt:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
{"error": {"class": "GenericError", "desc": "Postcopy is not supported: Host backend files need to be TMPFS or HUGETLBFS only"}}
(2) try to start migration under postcopy enabled on src host but disable on dst host, would get migration failure on src host, and the qemu process would quit by automatically:
a. src host:
(qemu) migrate -d tcp:$dst_host_ip:1234
(qemu) 2023-05-30T06:49:37.280422Z qemu-kvm: Unable to write to socket: Bad file descriptor
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: failed (Unable to write to socket: Bad file descriptor)
total time: 0 ms

b: dst host:
(qemu) 2023-05-30T06:49:37.225877Z qemu-kvm: RAM postcopy is disabled but have 16 byte advise
2023-05-30T06:49:37.226528Z qemu-kvm: load of migration failed: Invalid argument

2. postcopy migration with shared memory filesystem, postcopy succeeds


Also tried above two scenarios through libvirt (libvirt-9.3.0-2.el9.x86_64)
1. Libvirt would report a error if try to do postcopy without shared filesystem
[root@dell-per7525-25 home]# virsh migrate rhel930 --live --verbose qemu+ssh://10.73.2.82/system --p2p   --persistent --undefinesource --postcopy --timeout 5 --timeout-postcopy
error: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported: Host backend files need to be TMPFS or HUGETLBFS only

2. Libvirt succeeds to do postcopy migration with shared filesystem
[root@dell-per7525-25 home]# virsh migrate rhel930 --live --verbose qemu+ssh://10.73.2.82/system --p2p   --persistent --undefinesource --postcopy --timeout 5 --timeout-postcopy
Migration: [100 %]

Note: 
(1) On source and target host, update the memory backing in qemu.conf:
# grep -v ^# /etc/libvirt/qemu.conf | grep -v '^$'
memory_backing_dir = "/dev/shm"

Restart libvirt related services, and prepare the migration env.

(2) Start a vm with memoryBacking configure, so virsh edit the domain:
# virsh dumpxml rhel9.3.0
......
<memoryBacking>
    <source type='file'/>
    <access mode='shared'/>
  </memoryBacking>

Comment 30 Li Xiaohui 2023-05-30 09:35:05 UTC
(In reply to Li Xiaohui from comment #29)
> Test two scenarios from qemu side (qemu-kvm-8.0.0-4.el9.x86_64)
> 1. postcopy migration without a shared memory filesystem
> (1) enable postcopy on src and dst host: the postcopy can be enabled on src
> host, but fail to enable on dst host, get prompt:
> {"execute":"migrate-set-capabilities","arguments":{"capabilities":
> [{"capability":"postcopy-ram","state":true}]}}
> {"error": {"class": "GenericError", "desc": "Postcopy is not supported: Host
> backend files need to be TMPFS or HUGETLBFS only"}}
> (2) try to start migration under postcopy enabled on src host but disable on
> dst host, would get migration failure on src host, and the qemu process
> would quit by automatically:

I mean the qemu process of dst host would quit by automatically.

> a. src host:
> (qemu) migrate -d tcp:$dst_host_ip:1234
> (qemu) 2023-05-30T06:49:37.280422Z qemu-kvm: Unable to write to socket: Bad
> file descriptor
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> clear-bitmap-shift: 18
> Migration status: failed (Unable to write to socket: Bad file descriptor)
> total time: 0 ms
> 
> b: dst host:
> (qemu) 2023-05-30T06:49:37.225877Z qemu-kvm: RAM postcopy is disabled but
> have 16 byte advise
> 2023-05-30T06:49:37.226528Z qemu-kvm: load of migration failed: Invalid
> argument

Comment 31 Li Xiaohui 2023-05-30 10:00:14 UTC
Hi Peter,
I found one new issue when test scenario 1 according to Comment 29, shall we file a bug for it?
> 1. postcopy migration without a shared memory filesystem
> (1) enable postcopy on src and dst host: the postcopy can be enabled on src host, but fail to enable on dst host, get prompt:
> {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
> {"error": {"class": "GenericError", "desc": "Postcopy is not supported: Host backend files need to be TMPFS or HUGETLBFS only"}}
> (2) try to start migration under postcopy enabled on src host but disable on dst host, would get migration failure on src host, and > the qemu process would quit by automatically: 

after (2) step, I continue to do live migration without postcopy enabled:
(3) I would restart a VM with '-incoming defer' on dst host.
(4) disable postcopy capabilities on src host
(5) start to live migration

Actual result:
After step 5, migration would complete, but the qemu on dst host would quit automatically with the below error:
(qemu) 2023-05-30T09:49:59.252257Z qemu-kvm: error while loading state section id 1(ram)
2023-05-30T09:50:15.935819Z qemu-kvm: load of migration failed: Input/output error


BTW, libvirt works well when try to do live migration after postcopy failure without shared filesystem:
[root@dell-per7525-25 home]# virsh migrate rhel930 --live --verbose qemu+ssh://10.73.2.82/system --p2p   --persistent --undefinesource --postcopy --timeout 5 --timeout-postcopy
error: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported: Host backend files need to be TMPFS or HUGETLBFS only

[root@dell-per7525-25 home]# virsh migrate rhel930 --live --verbose qemu+ssh://10.73.2.82/system --p2p   --persistent --undefinesource
Migration: [100 %]

Comment 32 Peter Xu 2023-05-30 15:23:38 UTC
(In reply to Li Xiaohui from comment #29)
> (2) try to start migration under postcopy enabled on src host but disable on
> dst host, would get migration failure on src host, and the qemu process
> would quit by automatically:

Xiaohui,

Let's avoid testing use cases where we apply different migration capabilities on src & dst.

The thing is we always assume both qemus will apply the same capabilities or it's already wrong usage (maybe in the future we can teach migration framework to do handshakes, so they can communicate on the features, but not for now).  Just like when we specify different devices/cmdlines for src/dst and migration can easily go wrong.  We just so far rely on this to make everything work.

Thanks,
Peter

Comment 33 Li Xiaohui 2023-06-01 06:54:55 UTC
(In reply to Peter Xu from comment #32)
> (In reply to Li Xiaohui from comment #29)
> > (2) try to start migration under postcopy enabled on src host but disable on
> > dst host, would get migration failure on src host, and the qemu process
> > would quit by automatically:
> 
> Xiaohui,
> 
> Let's avoid testing use cases where we apply different migration
> capabilities on src & dst.
> 
> The thing is we always assume both qemus will apply the same capabilities or
> it's already wrong usage (maybe in the future we can teach migration
> framework to do handshakes, so they can communicate on the features, but not
> for now).  Just like when we specify different devices/cmdlines for src/dst
> and migration can easily go wrong.  We just so far rely on this to make
> everything work.

I see, thanks for the explanation.

I also tried again on the useful scenario 1 -> postcopy migration without a shared memory filesystem
(1) try to enable postcopy on src and dst host, after finding postcopy can't be enabled on dst host, then disable postcopy on src host;
(2) try to migrate VM from src to dst, migration succeeds, VM works well on dst host.



Per above test results, mark this bug verfied.

Comment 35 errata-xmlrpc 2023-11-07 08:26:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6368