1565952 – vhost-user: support postcopy live-migration in DPDK

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1565952 - vhost-user: support postcopy live-migration in DPDK

Summary: vhost-user: support postcopy live-migration in DPDK

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	FDP 19.B
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	FDP 20.I
Assignee:	Maxime Coquelin
QA Contact:	Pei Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:	1428436
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-11 07:21 UTC by Maxime Coquelin
Modified:	2021-07-13 12:25 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1428436
Environment:
Last Closed:	2021-07-13 12:25:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 3 Maxime Coquelin 2018-06-23 05:58:08 UTC

Planned for DPDK v18.11

Comment 6 Andrea Arcangeli 2018-11-17 03:41:55 UTC

Hello,

You can detect if the vma is locked by searching for "lo" in VmFlags in /proc/pid/smaps.

I'm not aware about a way to detect the per-process mlockall flags after the fact. It would be trivial to add such feature to the kernel but there shall be a good reason for why this information can't be saved and transferred by the userland caller.

With kernel >=RHEL7.6 and RHEL8 one possible solution would be to always add MCL_ONFAULT to the MCL flags of any mlockall called in the destination node, that should fix the postcopy issue.

Comment 8 Dr. David Alan Gilbert 2018-11-19 09:28:21 UTC

Andrea:
  The reason it's tricky for the userland caller to store is that the userland is a big application (dpdk) in which the userfault part is done in a small plugin (the vhost-user plugin).  That plugin can't really know the whims of the application or other plugins for mlocking.

Comment 9 Andrea Arcangeli 2018-11-20 03:46:25 UTC

(In reply to Dr. David Alan Gilbert from comment #8)
> Andrea:
>   The reason it's tricky for the userland caller to store is that the
> userland is a big application (dpdk) in which the userfault part is done in
> a small plugin (the vhost-user plugin).  That plugin can't really know the
> whims of the application or other plugins for mlocking.

I see, so the plugin would need to provide an API that is invoked by the dpdk app to communicate the current mlockall status.

So to make this work by extracting the kernel information, we would need either a new syscall mlockall_get or something like /proc/self/coredump_filter, coredump_filter dumps just a couple of those flags:

		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
			       ((mm->flags & MMF_DUMP_FILTER_MASK) >>
				MMF_DUMP_FILTER_SHIFT));

This isn't a 1:1 match with the mm->flags to avoid making the mm->flags representation part of the kAPI.

We would need the same conversion from in-kernel representation of VM_LOCKED and VM_LOCKONFAULT in mm->def_flags to a /proc/ representation in /proc/self/mlockall or similar.

Note that MCL_CURRENT can be already fully transferred to destination following what I suggested in comment #6 with the smaps "lo" field. That's all MCL_CURRENT does: lock all vmas and after it completes there's nothing sticky left in the mm of the process. MCL_CURRENT is already solvable without new kernel features.

So this is really only about MCL_FUTURE and MCL_ONFAULT and requires dumping in /proc mm->def_flags VM_LOCKED and VM_LOCKONFAULT respectively.

The alternative to detect MCL_FUTURE is to use an heuristic, like VmLck > VmRSS in /proc/2565/status, but that won't be able to reliably tell the difference between MCL_CURRENT and MCL_FUTURE|MCL_CURRENT and _ONFAULT either. It can only infer all ram is locked and likely destination wants MCL_FUTURE|MCL_CURRENT and that's all. And for the duration of the postcopy livemigration again the ideal is to use MCL_FUTURE|MCL_CURRENT|MCL_ONFAULT and to upgrade it immediately after postcopy completes to MCL_FUTURE|MCL_CURRENT. That even if the source hasn't MCL_ONFAULT set, if it has it set there's simplything nothing to do after postcopy completes.

Comment 10 Dr. David Alan Gilbert 2018-11-20 10:21:22 UTC

(In reply to Andrea Arcangeli from comment #9)
> (In reply to Dr. David Alan Gilbert from comment #8)
> > Andrea:
> >   The reason it's tricky for the userland caller to store is that the
> > userland is a big application (dpdk) in which the userfault part is done in
> > a small plugin (the vhost-user plugin).  That plugin can't really know the
> > whims of the application or other plugins for mlocking.
> 
> I see, so the plugin would need to provide an API that is invoked by the
> dpdk app to communicate the current mlockall status.

Right, and make sure all existing callers use it.
(Actually it's a bit more complex as I understand; there's really 3 layers - the  plugins, dpdk and then applications built on top of dpdk)

> So to make this work by extracting the kernel information, we would need
> either a new syscall mlockall_get or something like
> /proc/self/coredump_filter, coredump_filter dumps just a couple of those
> flags:
> 
> 		len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> 			       ((mm->flags & MMF_DUMP_FILTER_MASK) >>
> 				MMF_DUMP_FILTER_SHIFT));
> 
> This isn't a 1:1 match with the mm->flags to avoid making the mm->flags
> representation part of the kAPI.
> 
> We would need the same conversion from in-kernel representation of VM_LOCKED
> and VM_LOCKONFAULT in mm->def_flags to a /proc/ representation in
> /proc/self/mlockall or similar.

Could adding a new flag to get_mempolixy make sense? Although it seems to be more about numa.

Dave
> Note that MCL_CURRENT can be already fully transferred to destination
> following what I suggested in comment #6 with the smaps "lo" field. That's
> all MCL_CURRENT does: lock all vmas and after it completes there's nothing
> sticky left in the mm of the process. MCL_CURRENT is already solvable
> without new kernel features.
> 
> So this is really only about MCL_FUTURE and MCL_ONFAULT and requires dumping
> in /proc mm->def_flags VM_LOCKED and VM_LOCKONFAULT respectively.
> 
> The alternative to detect MCL_FUTURE is to use an heuristic, like VmLck >
> VmRSS in /proc/2565/status, but that won't be able to reliably tell the
> difference between MCL_CURRENT and MCL_FUTURE|MCL_CURRENT and _ONFAULT
> either. It can only infer all ram is locked and likely destination wants
> MCL_FUTURE|MCL_CURRENT and that's all. And for the duration of the postcopy
> livemigration again the ideal is to use MCL_FUTURE|MCL_CURRENT|MCL_ONFAULT
> and to upgrade it immediately after postcopy completes to
> MCL_FUTURE|MCL_CURRENT. That even if the source hasn't MCL_ONFAULT set, if
> it has it set there's simplything nothing to do after postcopy completes.

Comment 11 Andrea Arcangeli 2018-11-20 17:15:16 UTC

(In reply to Dr. David Alan Gilbert from comment #10)
> Could adding a new flag to get_mempolixy make sense? Although it seems to be
> more about numa.

Yes it's more about NUMA and nodes and it won't even be built into the kernel on CONFIG_NUMA=n builds while mlockall is still very available always.

In addition to that get_mempolicy is "address" centric, we have no address, it's a address-agnostic flag in the "mm" that we've to dump from the kernel.

I'm not aware of any posix API that could extract this info from the kernel, furthermore MCL_ONFAULT seems a recent 4.4 addition. This is why a new proc file in the /proc/pid directory looks the simplest here.

Comment 14 Maxime Coquelin 2019-05-21 08:04:22 UTC

Upstream series to enable postcopy live-migration in OVS-DPDK posted by Samsung:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/358980.html

Next step would be to have a try with QE live-migration setup.

Comment 15 Maxime Coquelin 2019-06-24 19:12:52 UTC

Patch merged upstream:

commit 30e834dcb5164dbfe91c017852629c76dd1711d2
Author: Liliia Butorina <l.butorina.com>
Date:   Tue May 14 16:08:43 2019 +0300

    netdev-dpdk: Post-copy Live Migration support for vhost-user-client.
    
    Post-copy Live Migration for vHost supported since DPDK 18.11 and
    QEMU 2.12. New global config option 'vhost-postcopy-support' added
    to control this feature. Ex.:
    
      ovs-vsctl set Open_vSwitch . other_config:vhost-postcopy-support=true
    
    Changing this value requires restarting the daemon. It's safe to
    enable this knob even if QEMU doesn't support post-copy LM.
    
    Feature marked as experimental and disabled by default because it may
    cause PMD thread hang on destination host on page fault for the time
    of page downloading from the source.
    
    Feature is not compatible with 'mlockall' and 'dequeue zero-copy'.
    Support added only for vhost-user-client.
    
    Signed-off-by: Liliia Butorina <l.butorina.com>
    Co-authored-by: Ilya Maximets <i.maximets>
    Signed-off-by: Ilya Maximets <i.maximets>
    Reviewed-by: Maxime Coquelin <maxime.coquelin>

Next step is to prepare a brew build and run some internal testing before
the backport is done.

Comment 17 Karrar Fida 2019-09-30 22:14:11 UTC

Comment to enable synch with JIRA

Comment 23 Pei Zhang 2021-07-13 12:22:30 UTC

Testing update:

vhost-user postcopy live migration have been supported with openvswitch2.15-2.15.0-26.el8fdp.x86_64. 


RHEL8.5 versions:
4.18.0-322.el8.x86_64
qemu-kvm-6.0.0-23.module+el8.5.0+11740+35571f13.x86_64
libvirt-7.5.0-1.module+el8.5.0+11664+59f87560.x86_64
openvswitch2.15-2.15.0-26.el8fdp.x86_64


Maxime, can we close this bz as CurrentRelease? Thanks a lot.

Comment 24 Maxime Coquelin 2021-07-13 12:25:25 UTC

(In reply to Pei Zhang from comment #23)
> Testing update:
> 
> vhost-user postcopy live migration have been supported with
> openvswitch2.15-2.15.0-26.el8fdp.x86_64. 
> 
> 
> RHEL8.5 versions:
> 4.18.0-322.el8.x86_64
> qemu-kvm-6.0.0-23.module+el8.5.0+11740+35571f13.x86_64
> libvirt-7.5.0-1.module+el8.5.0+11664+59f87560.x86_64
> openvswitch2.15-2.15.0-26.el8fdp.x86_64
> 
> 
> Maxime, can we close this bz as CurrentRelease? Thanks a lot.

Thanks for testing the feature, I agree it can be closed as CurrentRelease.

Note You need to log in before you can comment on or make changes to this bug.

aadam
aarcange
ailan
a.perevalov
atragler
chayang
ctrautma
dgilbert
fhallal
fjin
fleitner
jhsiao
jinzhao
juzhang
kfida
knoel
ktraynor
maxime.coquelin
peterx
pezhang
qding
qzhang
tredaelli
virt-maint
yuhuang