Bug 1428436
| Summary: | migration/postcopy+shared memory | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dr. David Alan Gilbert <dgilbert> | ||||||
| Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Yumei Huang <yuhuang> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 7.4 | CC: | a.perevalov, chayang, dgilbert, fjin, hhuang, jinzhao, juzhang, knoel, maxime.coquelin, michen, mrezanin, pagupta, peterx, qzhang, virt-maint, yuhuang | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | qemu-kvm-rhev-2.12.0-1.el7 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1565952 (view as bug list) | Environment: | |||||||
| Last Closed: | 2018-11-01 11:01:10 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1565952 | ||||||||
| Attachments: |
|
||||||||
|
Description
Dr. David Alan Gilbert
2017-03-02 15:29:54 UTC
I reproduced it with reverted "postcopy: Check for shared memory", and found ioctl with UFFDIO_COPY returns EEXIST error, it's due to mmap of the same hugetlbfs file in ovs-vswitchd. Such remmap in ovs-vswitchd is required for vhost-user port. Right now QEMU is not accepting EEXIST error while handling UFFDIO_COPY ioctl, it looks like correct. I was managed to complete post copy migration with work around where I accepting EEXIST error and reverted check for shared mem in QEMU, vhost-user based network has continued to work. EEXIST error could be avoided on QEMU side if client, I mean the process who do VHOST_USER_SET_MEM_TABLE handle, should also call fallocate with FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE arguments and of coarse avoid using VRINGS/PORT associated with QEMU. Here could be a pitfall, after making a hole (fallocate/madvise) in memory, thread could accidentally access this memory and will put into wait queue (interruptible sleep). Wait queue it's a context attribute and ctx is per userfault file descriptor & vmas, so once QEMU will populate pages with ioclt UFFDIO_COPY another process could wait infinitely. Looks like VRING is disabling during migration for appropriate vhost-user port (vhost_virtqueue_stop is getting called). I'm going to extend VHOST_USER_SET_MEM_TABLE and pass into client (ovs-vswitchd) information about necessarily of fallocate, make that allocate and check whether pmd thread will be put into sleep state due to pagefault. If it will not and VRING is excluded correctly I'll provide patches to qemu-devel mailing list. Created attachment 1269605 [details]
vhost-user-traces-vring-get-base
Created attachment 1269607 [details]
vhost-user-traces-set-mem-table
I would like to share my results:
The problem with vhost-user is following. Virtual switch, in our case it's
ovs-vswitchd remap memory region used by QEMU, it happen in VHOST_USER_SET_MEM_TABLE handler.
After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was done,
ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another process remap the same memory region ioctl returns EEXIST error. It's not happen when another process call fallocate for remmaped region.
1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table,
due to asynchronous nature of the page fault processing there is a gap where UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate with ovs-vswitchd in this case (any memory access to region which was "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY runs only QEMU thread due to wait queue is associated with userfault_ctx which is associated with userfault fd, so thread of ovs-vswitchd is not in that wait queue.
2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's quite bad idea, due to it's too late, QEMU already makes a lot of ioctl UFFDIO_COPY (see vhost-user-traces-vring-get-base)
VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and polymorphic) call of vhost_user_get_vring_base, so I found only one way (see vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was initiated due to KVM's raise triple fault, I don't know yet where QEMU places IDT, but I suspect it's missing during page fault processing. I don't have another ideas of it. So if anybody could point me how to track down IDT/SIDT/LIDT I would appreciate so much.
I think, it's better to avoid holes for pages with IDT (maybe for pages with vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram block)
It will help to avoid triple fault and reset. But it will not solve vhost-user problem. Just will get out reset.
QEMU still need to stop VRING when appropriate page is not available.
The correct scheme is following:
QEMU | OVS-VSWITCHD
|
mmap of mem backend |
|
vhost_net_start ->| mmap
send stop vring
(vhost_user_get_vring_base)
|
after vring was stopped
fallocate |
copy pages |
after vring pages copied
send start vring |
Right now we have
QEMU | OVS-VSWITCHD
|
mmap |
fallocate |
| mmap
copy (EEXIST) |
(In reply to Alexey Perevalov from comment #5) > I would like to share my results: > > The problem with vhost-user is following. Virtual switch, in our case it's > ovs-vswitchd remap memory region used by QEMU, it happen in > VHOST_USER_SET_MEM_TABLE handler. > After memory was mapped in QEMU and fallocate with FALLOC_FL_PUNCH_HOLE was > done, > ioctl with UFFDIO_COPY in QEMU could populate page, but in case when another > process remap the same memory region ioctl returns EEXIST error. It's not > happen when another process call fallocate for remmaped region. > > 1. fallocate at VHOST_USER_SET_MEM_TABLE see vhost-user-traces-set-mem-table, > due to asynchronous nature of the page fault processing there is a gap where > UFFDIO_COPY is failed. And dst QEMU is hangs up when trying to communicate > with ovs-vswitchd in this case (any memory access to region which was > "fallocated" puth thread into interruptible sleep, but ioctl UFFDIO_COPY > runs only QEMU thread due to wait queue is associated with userfault_ctx > which is associated with userfault fd, so thread of ovs-vswitchd is not in > that wait queue. > > 2. fallocate at VHOST_USER_VRING_GET_BASE - stop vring at ovs-vswitchd. It's > quite bad idea, due to it's too late, QEMU already makes a lot of ioctl > UFFDIO_COPY (see vhost-user-traces-vring-get-base) > > VHOST_USER_VRING_GET_BASE it's another story, I didn't find explicit (and > polymorphic) call of vhost_user_get_vring_base, so I found only one way (see > vhost-user-get-vring-base-callstack). It's qemu_devices_reset, and it was > initiated due to KVM's raise triple fault, I don't know yet where QEMU > places IDT, but I suspect it's missing during page fault processing. I don't > have another ideas of it. So if anybody could point me how to track down > IDT/SIDT/LIDT I would appreciate so much. > > I think, it's better to avoid holes for pages with IDT (maybe for pages with > vAPIC, here could be a problem with 1G hugepages, I saw vAPIC was in ram > block) > It will help to avoid triple fault and reset. But it will not solve > vhost-user problem. Just will get out reset. > > QEMU still need to stop VRING when appropriate page is not available. > > The correct scheme is following: > > QEMU | OVS-VSWITCHD > | > mmap of mem backend | > | > vhost_net_start ->| mmap > send stop vring > (vhost_user_get_vring_base) > | > after vring was stopped > fallocate | > copy pages | > after vring pages copied > send start vring | > > > > Right now we have > QEMU | OVS-VSWITCHD > | > mmap | > fallocate | > | mmap > copy (EEXIST) | I haven't looked at the insides of vhost yet - my plan is to let our vhost people look at that side. The current idea is to also call the uffdio register in the vhost process and for that to send page requests back to qemu and for it to do 'wake' ioctls to cause the vhost process to wake up. I'm hoping that if the userfault is registered in the vhost process then there's no need to do any magic with the rings - because they'll be protected by userfault on both processes. Yes it's possible to register userfault in another process, in this case it's better to pass userfault fd from QEMU, to be able to wake up side process from QEMU's ioctl UFFDIO_COPY. So no objections here it could be an alternative for vring stopping. But the main problem it's mmap in ovs-vswitchd process for the same region after punch hole was done in QEMU for mem backend. UFFDIO_COPY will fail after it. So dev start & ram discard should be somehow arranged and synchronized. I'm starting to play with vhost-user-bridge as a simple way of understanding how this all goes together. postcopy+vhost qemu code merged upstream for the 2.12 freeze; upstream merge is ed627b2ad37469eeba9e Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3443 |