Bug 1428436 - migration/postcopy+shared memory
Summary: migration/postcopy+shared memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Yumei Huang
URL:
Whiteboard:
Depends On:
Blocks: 1565952
TreeView+ depends on / blocked
 
Reported: 2017-03-02 15:29 UTC by Dr. David Alan Gilbert
Modified: 2018-11-01 11:01 UTC (History)
16 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1565952 (view as bug list)
Environment:
Last Closed: 2018-11-01 11:01:10 UTC
Target Upstream Version:


Attachments (Terms of Use)
vhost-user-traces-vring-get-base (31.24 KB, text/plain)
2017-04-07 07:55 UTC, Alexey Perevalov
no flags Details
vhost-user-traces-set-mem-table (31.79 KB, text/plain)
2017-04-07 07:57 UTC, Alexey Perevalov
no flags Details

Description Dr. David Alan Gilbert 2017-03-02 15:29:54 UTC
Description of problem:
We know we need to do some stuff to make sure postcopy works with shared memory - e.g. the stuff vhost-user 

The exact detail we need to think about, but includes:
  a) Making sure that the qemu side uses the appropriate madvise/fallocate to clear the memory.
  b) The other process that's sharing the RAM also needs to userfault and somehow tell qemu to ask for the pages
  c) The qemu needs to WAKE when pages arrive on the other process


and that's what we've thought of so far.  More thought needed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Alexey Perevalov 2017-03-30 07:14:14 UTC
I reproduced it with reverted "postcopy: Check for shared
memory",
and found ioctl with UFFDIO_COPY returns EEXIST error, it's due to mmap
of the same hugetlbfs file in ovs-vswitchd. Such remmap in ovs-vswitchd
is required for vhost-user port.

Right now QEMU is not accepting EEXIST error while handling UFFDIO_COPY ioctl,
it looks like correct.

I was managed to complete post copy migration with work around where I
accepting EEXIST error and reverted check for shared mem in
QEMU, vhost-user based network has continued to work.

EEXIST error could be avoided on QEMU side if client, I mean the process
who do
VHOST_USER_SET_MEM_TABLE handle, should also call fallocate with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE
arguments and of coarse avoid using VRINGS/PORT associated with QEMU.

Here could be a pitfall, after making a hole (fallocate/madvise) in
memory, thread
could accidentally access this memory and will put into wait queue
(interruptible sleep).
Wait queue it's a context attribute and ctx is per userfault file
descriptor & vmas, so once
QEMU will populate pages with ioclt UFFDIO_COPY another process could
wait infinitely.
Looks like VRING is disabling during migration for appropriate vhost-user
port (vhost_virtqueue_stop is getting called).

I'm going to extend VHOST_USER_SET_MEM_TABLE and pass into client (ovs-vswitchd)
information about necessarily of fallocate, make that allocate and check whether
pmd thread will be put into sleep state due to pagefault.
If it will not and VRING is excluded correctly I'll provide patches to qemu-devel mailing
list.

Comment 3 Alexey Perevalov 2017-04-07 07:55:54 UTC
Created attachment 1269605 [details]
vhost-user-traces-vring-get-base

Comment 4 Alexey Perevalov 2017-04-07 07:57:32 UTC
Created attachment 1269607 [details]
vhost-user-traces-set-mem-table

Comment 5 Alexey Perevalov 2017-04-07 07:58:39 UTC
I would like to share my results:

The problem with vhost-user is following. Virtual switch, in our case it's
ovs-vswitchd remap memory region used by QEMU, it happen in VHOST_USER_SET_MEM_TABLE handler.
After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was done,
ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another process remap the same memory region ioctl returns EEXIST error. It's not happen when another process call fallocate for remmaped region.

1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table,
due to asynchronous nature of the page fault processing there is a gap where UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate with ovs-vswitchd in this case (any memory access to region which was "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY runs only QEMU thread  due to wait queue is associated with userfault_ctx which is associated with userfault fd, so thread of ovs-vswitchd is not in that wait queue.

2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's quite bad idea, due to it's too late, QEMU already makes a lot of ioctl UFFDIO_COPY (see vhost-user-traces-vring-get-base)

VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and polymorphic) call of vhost_user_get_vring_base, so I found only one way (see vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was initiated due to KVM's raise triple fault, I don't know yet where QEMU places IDT, but I suspect it's missing during page fault processing. I don't have another ideas of it. So if anybody could point me how to track down IDT/SIDT/LIDT I would appreciate so much.

I think, it's better to avoid holes for pages with IDT (maybe for pages with vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram block)
It will help to avoid triple fault and reset. But it will not solve vhost-user problem. Just will get out reset.

QEMU still need to stop VRING when appropriate page is not available.

The correct scheme is following:

QEMU                |        OVS-VSWITCHD
                    |
mmap of mem backend |
                    |
vhost_net_start   ->|       mmap
send stop vring
(vhost_user_get_vring_base)
                    |
after vring was stopped
fallocate           |
copy pages          |
after vring pages copied
send start vring    |



Right now we have
QEMU                |      OVS-VSWITCHD
                    |
mmap                |
fallocate           |
                    |      mmap
copy (EEXIST)       |

Comment 6 Dr. David Alan Gilbert 2017-04-07 12:07:17 UTC
(In reply to Alexey Perevalov from comment #5)
> I would like to share my results:
> 
> The problem with vhost-user is following. Virtual switch, in our case it's
> ovs-vswitchd remap memory region used by QEMU, it happen in
> VHOST_USER_SET_MEM_TABLE handler.
> After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was
> done,
> ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another
> process remap the same memory region ioctl returns EEXIST error. It's not
> happen when another process call fallocate for remmaped region.
> 
> 1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table,
> due to asynchronous nature of the page fault processing there is a gap where
> UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate
> with ovs-vswitchd in this case (any memory access to region which was
> "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY
> runs only QEMU thread  due to wait queue is associated with userfault_ctx
> which is associated with userfault fd, so thread of ovs-vswitchd is not in
> that wait queue.
> 
> 2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's
> quite bad idea, due to it's too late, QEMU already makes a lot of ioctl
> UFFDIO_COPY (see vhost-user-traces-vring-get-base)
> 
> VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and
> polymorphic) call of vhost_user_get_vring_base, so I found only one way (see
> vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was
> initiated due to KVM's raise triple fault, I don't know yet where QEMU
> places IDT, but I suspect it's missing during page fault processing. I don't
> have another ideas of it. So if anybody could point me how to track down
> IDT/SIDT/LIDT I would appreciate so much.
> 
> I think, it's better to avoid holes for pages with IDT (maybe for pages with
> vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram
> block)
> It will help to avoid triple fault and reset. But it will not solve
> vhost-user problem. Just will get out reset.
> 
> QEMU still need to stop VRING when appropriate page is not available.
> 
> The correct scheme is following:
> 
> QEMU                |        OVS-VSWITCHD
>                     |
> mmap of mem backend |
>                     |
> vhost_net_start   ->|       mmap
> send stop vring
> (vhost_user_get_vring_base)
>                     |
> after vring was stopped
> fallocate           |
> copy pages          |
> after vring pages copied
> send start vring    |
> 
> 
> 
> Right now we have
> QEMU                |      OVS-VSWITCHD
>                     |
> mmap                |
> fallocate           |
>                     |      mmap
> copy (EEXIST)       |

I haven't looked at the insides of vhost yet - my plan is to let our vhost people look at that side.  The current idea is to also call the uffdio register in the vhost process and for that to send page requests back to qemu and for it to do 'wake' ioctls to cause the vhost process to wake up.

I'm hoping that if the userfault is registered in the vhost process then there's no need to do any magic with the rings - because they'll be protected by userfault on both processes.

Comment 7 Alexey Perevalov 2017-04-07 12:31:18 UTC
Yes it's possible to register userfault in another process, in this case it's better to pass userfault fd from QEMU, to be able to wake up side process from QEMU's ioctl UFFDIO_COPY. So no objections here it could be an alternative for vring stopping.

But the main problem it's mmap in ovs-vswitchd process for the same region after punch hole was done in QEMU for mem backend. UFFDIO_COPY will fail after it. So dev start & ram discard should be somehow arranged and synchronized.

Comment 8 Dr. David Alan Gilbert 2017-04-28 19:09:36 UTC
I'm starting to play with vhost-user-bridge as a simple way of understanding how this all goes together.

Comment 10 Dr. David Alan Gilbert 2018-03-20 17:22:20 UTC
postcopy+vhost qemu code merged upstream for the 2.12 freeze; upstream merge is 
ed627b2ad37469eeba9e

Comment 19 errata-xmlrpc 2018-11-01 11:01:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443


Note You need to log in before you can comment on or make changes to this bug.