1428436 – migration/postcopy+shared memory

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1428436 - migration/postcopy+shared memory

Summary: migration/postcopy+shared memory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	Yumei Huang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1565952
TreeView+	depends on / blocked

Reported:	2017-03-02 15:29 UTC by Dr. David Alan Gilbert
Modified:	2018-11-01 11:01 UTC (History)
CC List:	16 users (show)
Fixed In Version:	qemu-kvm-rhev-2.12.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1565952 (view as bug list)
Environment:
Last Closed:	2018-11-01 11:01:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vhost-user-traces-vring-get-base (31.24 KB, text/plain) 2017-04-07 07:55 UTC, Alexey Perevalov	no flags	Details
vhost-user-traces-set-mem-table (31.79 KB, text/plain) 2017-04-07 07:57 UTC, Alexey Perevalov	no flags	Details
View All

Description Dr. David Alan Gilbert 2017-03-02 15:29:54 UTC

Description of problem:
We know we need to do some stuff to make sure postcopy works with shared memory - e.g. the stuff vhost-user 

The exact detail we need to think about, but includes:
  a) Making sure that the qemu side uses the appropriate madvise/fallocate to clear the memory.
  b) The other process that's sharing the RAM also needs to userfault and somehow tell qemu to ask for the pages
  c) The qemu needs to WAKE when pages arrive on the other process


and that's what we've thought of so far.  More thought needed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Alexey Perevalov 2017-03-30 07:14:14 UTC

I reproduced it with reverted "postcopy: Check for shared
memory",
and found ioctl with UFFDIO_COPY returns EEXIST error, it's due to mmap
of the same hugetlbfs file in ovs-vswitchd. Such remmap in ovs-vswitchd
is required for vhost-user port.

Right now QEMU is not accepting EEXIST error while handling UFFDIO_COPY ioctl,
it looks like correct.

I was managed to complete post copy migration with work around where I
accepting EEXIST error and reverted check for shared mem in
QEMU, vhost-user based network has continued to work.

EEXIST error could be avoided on QEMU side if client, I mean the process
who do
VHOST_USER_SET_MEM_TABLE handle, should also call fallocate with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE
arguments and of coarse avoid using VRINGS/PORT associated with QEMU.

Here could be a pitfall, after making a hole (fallocate/madvise) in
memory, thread
could accidentally access this memory and will put into wait queue
(interruptible sleep).
Wait queue it's a context attribute and ctx is per userfault file
descriptor & vmas, so once
QEMU will populate pages with ioclt UFFDIO_COPY another process could
wait infinitely.
Looks like VRING is disabling during migration for appropriate vhost-user
port (vhost_virtqueue_stop is getting called).

I'm going to extend VHOST_USER_SET_MEM_TABLE and pass into client (ovs-vswitchd)
information about necessarily of fallocate, make that allocate and check whether
pmd thread will be put into sleep state due to pagefault.
If it will not and VRING is excluded correctly I'll provide patches to qemu-devel mailing
list.

Comment 3 Alexey Perevalov 2017-04-07 07:55:54 UTC

Created attachment 1269605 [details]
vhost-user-traces-vring-get-base

Comment 4 Alexey Perevalov 2017-04-07 07:57:32 UTC

Created attachment 1269607 [details]
vhost-user-traces-set-mem-table

Comment 5 Alexey Perevalov 2017-04-07 07:58:39 UTC

I would like to share my results:

The problem with vhost-user is following. Virtual switch, in our case it's
ovs-vswitchd remap memory region used by QEMU, it happen in VHOST_USER_SET_MEM_TABLE handler.
After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was done,
ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another process remap the same memory region ioctl returns EEXIST error. It's not happen when another process call fallocate for remmaped region.

1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table,
due to asynchronous nature of the page fault processing there is a gap where UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate with ovs-vswitchd in this case (any memory access to region which was "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY runs only QEMU thread due to wait queue is associated with userfault_ctx which is associated with userfault fd, so thread of ovs-vswitchd is not in that wait queue.

2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's quite bad idea, due to it's too late, QEMU already makes a lot of ioctl UFFDIO_COPY (see vhost-user-traces-vring-get-base)

VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and polymorphic) call of vhost_user_get_vring_base, so I found only one way (see vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was initiated due to KVM's raise triple fault, I don't know yet where QEMU places IDT, but I suspect it's missing during page fault processing. I don't have another ideas of it. So if anybody could point me how to track down IDT/SIDT/LIDT I would appreciate so much.

I think, it's better to avoid holes for pages with IDT (maybe for pages with vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram block)
It will help to avoid triple fault and reset. But it will not solve vhost-user problem. Just will get out reset.

QEMU still need to stop VRING when appropriate page is not available.

The correct scheme is following:

Comment 6 Dr. David Alan Gilbert 2017-04-07 12:07:17 UTC

(In reply to Alexey Perevalov from comment #5)
> I would like to share my results:
> 
> The problem with vhost-user is following. Virtual switch, in our case it's
> ovs-vswitchd remap memory region used by QEMU, it happen in
> VHOST_USER_SET_MEM_TABLE handler.
> After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was
> done,
> ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another
> process remap the same memory region ioctl returns EEXIST error. It's not
> happen when another process call fallocate for remmaped region.
> 
> 1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table,
> due to asynchronous nature of the page fault processing there is a gap where
> UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate
> with ovs-vswitchd in this case (any memory access to region which was
> "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY
> runs only QEMU thread  due to wait queue is associated with userfault_ctx
> which is associated with userfault fd, so thread of ovs-vswitchd is not in
> that wait queue.
> 
> 2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's
> quite bad idea, due to it's too late, QEMU already makes a lot of ioctl
> UFFDIO_COPY (see vhost-user-traces-vring-get-base)
> 
> VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and
> polymorphic) call of vhost_user_get_vring_base, so I found only one way (see
> vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was
> initiated due to KVM's raise triple fault, I don't know yet where QEMU
> places IDT, but I suspect it's missing during page fault processing. I don't
> have another ideas of it. So if anybody could point me how to track down
> IDT/SIDT/LIDT I would appreciate so much.
> 
> I think, it's better to avoid holes for pages with IDT (maybe for pages with
> vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram
> block)
> It will help to avoid triple fault and reset. But it will not solve
> vhost-user problem. Just will get out reset.
> 
> QEMU still need to stop VRING when appropriate page is not available.
> 
> The correct scheme is following:
> 
> QEMU                |        OVS-VSWITCHD
>                     |
> mmap of mem backend |
>                     |
> vhost_net_start   ->|       mmap
> send stop vring
> (vhost_user_get_vring_base)
>                     |
> after vring was stopped
> fallocate           |
> copy pages          |
> after vring pages copied
> send start vring    |
> 
> 
> 
> Right now we have
> QEMU                |      OVS-VSWITCHD
>                     |
> mmap                |
> fallocate           |
>                     |      mmap
> copy (EEXIST)       |

I haven't looked at the insides of vhost yet - my plan is to let our vhost people look at that side.  The current idea is to also call the uffdio register in the vhost process and for that to send page requests back to qemu and for it to do 'wake' ioctls to cause the vhost process to wake up.

I'm hoping that if the userfault is registered in the vhost process then there's no need to do any magic with the rings - because they'll be protected by userfault on both processes.

Comment 7 Alexey Perevalov 2017-04-07 12:31:18 UTC

Yes it's possible to register userfault in another process, in this case it's better to pass userfault fd from QEMU, to be able to wake up side process from QEMU's ioctl UFFDIO_COPY. So no objections here it could be an alternative for vring stopping.

But the main problem it's mmap in ovs-vswitchd process for the same region after punch hole was done in QEMU for mem backend. UFFDIO_COPY will fail after it. So dev start & ram discard should be somehow arranged and synchronized.

Comment 8 Dr. David Alan Gilbert 2017-04-28 19:09:36 UTC

I'm starting to play with vhost-user-bridge as a simple way of understanding how this all goes together.

Comment 10 Dr. David Alan Gilbert 2018-03-20 17:22:20 UTC

postcopy+vhost qemu code merged upstream for the 2.12 freeze; upstream merge is 
ed627b2ad37469eeba9e

Comment 19 errata-xmlrpc 2018-11-01 11:01:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443

Note You need to log in before you can comment on or make changes to this bug.