Planned for DPDK v18.11
Hello, You can detect if the vma is locked by searching for "lo" in VmFlags in /proc/pid/smaps. I'm not aware about a way to detect the per-process mlockall flags after the fact. It would be trivial to add such feature to the kernel but there shall be a good reason for why this information can't be saved and transferred by the userland caller. With kernel >=RHEL7.6 and RHEL8 one possible solution would be to always add MCL_ONFAULT to the MCL flags of any mlockall called in the destination node, that should fix the postcopy issue.
Andrea: The reason it's tricky for the userland caller to store is that the userland is a big application (dpdk) in which the userfault part is done in a small plugin (the vhost-user plugin). That plugin can't really know the whims of the application or other plugins for mlocking.
(In reply to Dr. David Alan Gilbert from comment #8) > Andrea: > The reason it's tricky for the userland caller to store is that the > userland is a big application (dpdk) in which the userfault part is done in > a small plugin (the vhost-user plugin). That plugin can't really know the > whims of the application or other plugins for mlocking. I see, so the plugin would need to provide an API that is invoked by the dpdk app to communicate the current mlockall status. So to make this work by extracting the kernel information, we would need either a new syscall mlockall_get or something like /proc/self/coredump_filter, coredump_filter dumps just a couple of those flags: len = snprintf(buffer, sizeof(buffer), "%08lx\n", ((mm->flags & MMF_DUMP_FILTER_MASK) >> MMF_DUMP_FILTER_SHIFT)); This isn't a 1:1 match with the mm->flags to avoid making the mm->flags representation part of the kAPI. We would need the same conversion from in-kernel representation of VM_LOCKED and VM_LOCKONFAULT in mm->def_flags to a /proc/ representation in /proc/self/mlockall or similar. Note that MCL_CURRENT can be already fully transferred to destination following what I suggested in comment #6 with the smaps "lo" field. That's all MCL_CURRENT does: lock all vmas and after it completes there's nothing sticky left in the mm of the process. MCL_CURRENT is already solvable without new kernel features. So this is really only about MCL_FUTURE and MCL_ONFAULT and requires dumping in /proc mm->def_flags VM_LOCKED and VM_LOCKONFAULT respectively. The alternative to detect MCL_FUTURE is to use an heuristic, like VmLck > VmRSS in /proc/2565/status, but that won't be able to reliably tell the difference between MCL_CURRENT and MCL_FUTURE|MCL_CURRENT and _ONFAULT either. It can only infer all ram is locked and likely destination wants MCL_FUTURE|MCL_CURRENT and that's all. And for the duration of the postcopy livemigration again the ideal is to use MCL_FUTURE|MCL_CURRENT|MCL_ONFAULT and to upgrade it immediately after postcopy completes to MCL_FUTURE|MCL_CURRENT. That even if the source hasn't MCL_ONFAULT set, if it has it set there's simplything nothing to do after postcopy completes.
(In reply to Andrea Arcangeli from comment #9) > (In reply to Dr. David Alan Gilbert from comment #8) > > Andrea: > > The reason it's tricky for the userland caller to store is that the > > userland is a big application (dpdk) in which the userfault part is done in > > a small plugin (the vhost-user plugin). That plugin can't really know the > > whims of the application or other plugins for mlocking. > > I see, so the plugin would need to provide an API that is invoked by the > dpdk app to communicate the current mlockall status. Right, and make sure all existing callers use it. (Actually it's a bit more complex as I understand; there's really 3 layers - the plugins, dpdk and then applications built on top of dpdk) > So to make this work by extracting the kernel information, we would need > either a new syscall mlockall_get or something like > /proc/self/coredump_filter, coredump_filter dumps just a couple of those > flags: > > len = snprintf(buffer, sizeof(buffer), "%08lx\n", > ((mm->flags & MMF_DUMP_FILTER_MASK) >> > MMF_DUMP_FILTER_SHIFT)); > > This isn't a 1:1 match with the mm->flags to avoid making the mm->flags > representation part of the kAPI. > > We would need the same conversion from in-kernel representation of VM_LOCKED > and VM_LOCKONFAULT in mm->def_flags to a /proc/ representation in > /proc/self/mlockall or similar. Could adding a new flag to get_mempolixy make sense? Although it seems to be more about numa. Dave > Note that MCL_CURRENT can be already fully transferred to destination > following what I suggested in comment #6 with the smaps "lo" field. That's > all MCL_CURRENT does: lock all vmas and after it completes there's nothing > sticky left in the mm of the process. MCL_CURRENT is already solvable > without new kernel features. > > So this is really only about MCL_FUTURE and MCL_ONFAULT and requires dumping > in /proc mm->def_flags VM_LOCKED and VM_LOCKONFAULT respectively. > > The alternative to detect MCL_FUTURE is to use an heuristic, like VmLck > > VmRSS in /proc/2565/status, but that won't be able to reliably tell the > difference between MCL_CURRENT and MCL_FUTURE|MCL_CURRENT and _ONFAULT > either. It can only infer all ram is locked and likely destination wants > MCL_FUTURE|MCL_CURRENT and that's all. And for the duration of the postcopy > livemigration again the ideal is to use MCL_FUTURE|MCL_CURRENT|MCL_ONFAULT > and to upgrade it immediately after postcopy completes to > MCL_FUTURE|MCL_CURRENT. That even if the source hasn't MCL_ONFAULT set, if > it has it set there's simplything nothing to do after postcopy completes.
(In reply to Dr. David Alan Gilbert from comment #10) > Could adding a new flag to get_mempolixy make sense? Although it seems to be > more about numa. Yes it's more about NUMA and nodes and it won't even be built into the kernel on CONFIG_NUMA=n builds while mlockall is still very available always. In addition to that get_mempolicy is "address" centric, we have no address, it's a address-agnostic flag in the "mm" that we've to dump from the kernel. I'm not aware of any posix API that could extract this info from the kernel, furthermore MCL_ONFAULT seems a recent 4.4 addition. This is why a new proc file in the /proc/pid directory looks the simplest here.
Upstream series to enable postcopy live-migration in OVS-DPDK posted by Samsung: https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/358980.html Next step would be to have a try with QE live-migration setup.
Patch merged upstream: commit 30e834dcb5164dbfe91c017852629c76dd1711d2 Author: Liliia Butorina <l.butorina.com> Date: Tue May 14 16:08:43 2019 +0300 netdev-dpdk: Post-copy Live Migration support for vhost-user-client. Post-copy Live Migration for vHost supported since DPDK 18.11 and QEMU 2.12. New global config option 'vhost-postcopy-support' added to control this feature. Ex.: ovs-vsctl set Open_vSwitch . other_config:vhost-postcopy-support=true Changing this value requires restarting the daemon. It's safe to enable this knob even if QEMU doesn't support post-copy LM. Feature marked as experimental and disabled by default because it may cause PMD thread hang on destination host on page fault for the time of page downloading from the source. Feature is not compatible with 'mlockall' and 'dequeue zero-copy'. Support added only for vhost-user-client. Signed-off-by: Liliia Butorina <l.butorina.com> Co-authored-by: Ilya Maximets <i.maximets> Signed-off-by: Ilya Maximets <i.maximets> Reviewed-by: Maxime Coquelin <maxime.coquelin> Next step is to prepare a brew build and run some internal testing before the backport is done.
Comment to enable synch with JIRA
Testing update: vhost-user postcopy live migration have been supported with openvswitch2.15-2.15.0-26.el8fdp.x86_64. RHEL8.5 versions: 4.18.0-322.el8.x86_64 qemu-kvm-6.0.0-23.module+el8.5.0+11740+35571f13.x86_64 libvirt-7.5.0-1.module+el8.5.0+11664+59f87560.x86_64 openvswitch2.15-2.15.0-26.el8fdp.x86_64 Maxime, can we close this bz as CurrentRelease? Thanks a lot.
(In reply to Pei Zhang from comment #23) > Testing update: > > vhost-user postcopy live migration have been supported with > openvswitch2.15-2.15.0-26.el8fdp.x86_64. > > > RHEL8.5 versions: > 4.18.0-322.el8.x86_64 > qemu-kvm-6.0.0-23.module+el8.5.0+11740+35571f13.x86_64 > libvirt-7.5.0-1.module+el8.5.0+11664+59f87560.x86_64 > openvswitch2.15-2.15.0-26.el8fdp.x86_64 > > > Maxime, can we close this bz as CurrentRelease? Thanks a lot. Thanks for testing the feature, I agree it can be closed as CurrentRelease.