Bug 1533408
| Summary: | Migration doesn't work well in PVP testing with vIOMMU | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> | ||||
| Component: | openvswitch | Assignee: | Matteo Croce <mcroce> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | ovs-qe | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.5 | CC: | atragler, chayang, fleitner, juzhang, knoel, maxime.coquelin, michen, peterx, pvauter, tredaelli, virt-maint | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1538953 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-05-03 15:02:21 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1538953 | ||||||
| Attachments: |
|
||||||
|
Description
Pei Zhang
2018-01-11 09:51:12 UTC
...
> (3) Sometimes guest lost response. Sometimes can recover after several
> minutes, sometimes not. And sometimes, on des host, testpmd print lost of
> below info:
> ..
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
These messages are really suspicious to me. It's from vhost_user_iotlb_pending_insert():
ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
if (ret) {
RTE_LOG(INFO, VHOST_CONFIG,
"IOTLB pool empty, clear pending misses\n");
vhost_user_iotlb_pending_remove_all(vq);
ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
if (ret) {
RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
return;
}
}
It seems that here we can't allocate the iotlb even after we flushed pending requests.
Maxime, do you have any quick idea on what might have happened when both two error messages are triggered?
PS. I have two questions:
1. do we need to report and stop when vhost_user_iotlb_pending_insert() failed somehow? For now IIUC we only get a failure message but we'll still continue, assuming it's successful.
2. can there be race condition in __vhost_iova_to_vva()? Say, we have this:
if (!vhost_user_iotlb_pending_miss(vq, iova + tmp_size, perm)) {
/*
* iotlb_lock is read-locked for a full burst,
* but it only protects the iotlb cache.
* In case of IOTLB miss, we might block on the socket,
* which could cause a deadlock with QEMU if an IOTLB update
* is being handled. We can safely unlock here to avoid it.
*/
vhost_user_iotlb_rd_unlock(vq);
vhost_user_iotlb_pending_insert(vq, iova + tmp_size, perm);
vhost_user_iotlb_miss(dev, iova + tmp_size, perm);
vhost_user_iotlb_rd_lock(vq);
}
is it possible that two or more threads get vhost_user_iotlb_pending_miss() returned zero in parallel, and then they insert the same iova twice?
Thanks,
Peter
(In reply to Peter Xu from comment #2) > ... > > > (3) Sometimes guest lost response. Sometimes can recover after several > > minutes, sometimes not. And sometimes, on des host, testpmd print lost of > > below info: > > .. > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > VHOST_CONFIG: IOTLB pool still empty, failure > > These messages are really suspicious to me. It's from > vhost_user_iotlb_pending_insert(): > > ret = rte_mempool_get(vq->iotlb_pool, (void **)&node); > if (ret) { > RTE_LOG(INFO, VHOST_CONFIG, > "IOTLB pool empty, clear pending misses\n"); > vhost_user_iotlb_pending_remove_all(vq); > ret = rte_mempool_get(vq->iotlb_pool, (void **)&node); > if (ret) { > RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n"); > return; > } > } > > It seems that here we can't allocate the iotlb even after we flushed pending > requests. > > Maxime, do you have any quick idea on what might have happened when both two > error messages are triggered? I think there is a problem when rte_mempool_get fails. We try to just release the pending list, but iotlb entries are allocated also in the same mempool, so we should also release some IOTLB entries in the cache. As the memory size isn't specified in the testpmd command line, I don't know how large it is. But the IOTLB mempool size is fixed to 2048 entries, so 4GB with 2MB entries if IOTLB entries are well formed (i.e. matching page boundary as we fixed last year in QEMU) It should be large enough, but we could enlarge it further if needed. However, in this case, I would be surprised the memory used by testpmd in guest is larger than 4GB. We may have another bug, and dumping the IOLB cache entries should help to understand. For this issue at least, I think we can move the component to openvswitch component, and assign it to me. > PS. I have two questions: > > 1. do we need to report and stop when vhost_user_iotlb_pending_insert() > failed somehow? For now IIUC we only get a failure message but we'll still > continue, assuming it's successful. Maybe having a dedicated pool for pending misses would fix the issue. > 2. can there be race condition in __vhost_iova_to_vva()? Say, we have this: > > if (!vhost_user_iotlb_pending_miss(vq, iova + tmp_size, perm)) { > /* > * iotlb_lock is read-locked for a full burst, > * but it only protects the iotlb cache. > * In case of IOTLB miss, we might block on the socket, > * which could cause a deadlock with QEMU if an IOTLB update > * is being handled. We can safely unlock here to avoid it. > */ > vhost_user_iotlb_rd_unlock(vq); > > vhost_user_iotlb_pending_insert(vq, iova + tmp_size, perm); > vhost_user_iotlb_miss(dev, iova + tmp_size, perm); > > vhost_user_iotlb_rd_lock(vq); > } > > is it possible that two or more threads get vhost_user_iotlb_pending_miss() > returned zero in parallel, and then they insert the same iova twice? Acutally, we have an IOTLB cache list and IOTLB pending list per virtqueue, so that performance scales linearly with multi-queue. So I think we are safe here. > Thanks, > Peter (In reply to Maxime Coquelin from comment #3) > (In reply to Peter Xu from comment #2) > > ... > > > > > (3) Sometimes guest lost response. Sometimes can recover after several > > > minutes, sometimes not. And sometimes, on des host, testpmd print lost of > > > below info: > > > .. > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > VHOST_CONFIG: IOTLB pool empty, clear pending misses > > > VHOST_CONFIG: IOTLB pool still empty, failure > > > > These messages are really suspicious to me. It's from > > vhost_user_iotlb_pending_insert(): > > > > ret = rte_mempool_get(vq->iotlb_pool, (void **)&node); > > if (ret) { > > RTE_LOG(INFO, VHOST_CONFIG, > > "IOTLB pool empty, clear pending misses\n"); > > vhost_user_iotlb_pending_remove_all(vq); > > ret = rte_mempool_get(vq->iotlb_pool, (void **)&node); > > if (ret) { > > RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n"); > > return; > > } > > } > > > > It seems that here we can't allocate the iotlb even after we flushed pending > > requests. > > > > Maxime, do you have any quick idea on what might have happened when both two > > error messages are triggered? > > I think there is a problem when rte_mempool_get fails. > We try to just release the pending list, but iotlb entries are allocated also > in the same mempool, so we should also release some IOTLB entries in the > cache. > > As the memory size isn't specified in the testpmd command > line, I don't know how large it is. But the IOTLB mempool size is fixed to > 2048 > entries, so 4GB with 2MB entries if IOTLB entries are well formed (i.e. > matching > page boundary as we fixed last year in QEMU) > > It should be large enough, but we could enlarge it further if needed. > However, in this case, I would be surprised the memory used by testpmd > in guest is larger than 4GB. We may have another bug, and dumping the IOLB > cache > entries should help to understand. > > For this issue at least, I think we can move the component to openvswitch > component, > and assign it to me. Thanks for the analysis. Then I'm moving the component and assigned-to accordingly. Let's open another bug if the problem still exists afterwards. Peter Hi Pei, I tried to reproduce locally with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and DPDK v17.11, using a single host by creating an IO loop in testpmd instead of using an external packet generator. With this setup, the 10 migrations I tried all succeed. One other difference I have with your setup is the guest Kernel, I'm using Fedora 27 one for now, but will try with 3.10.0-827.el7.x86_64 this afternoon. Looking at the traces you provided, we identified a problem with IOTLB pending misses that can already be fixed. However, in my opinion, this crash happens because something goes wrong earlier, too many IOTLB entries are in the cache. If I don't manage to reproduce, I could provide you a debug patch for DPDK that would dump the IOTLB cache entries when the pending misses issue happens, in order to see if IOTLB cache is consistent. Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream git repo directly? Thanks, Maxime (In reply to Maxime Coquelin from comment #5) > Hi Pei, > > I tried to reproduce locally with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and > DPDK > v17.11, using a single host by creating an IO loop in testpmd instead of > using an > external packet generator. > > With this setup, the 10 migrations I tried all succeed. > > One other difference I have with your setup is the guest Kernel, > I'm using Fedora 27 one for now, but will try with 3.10.0-827.el7.x86_64 > this afternoon. > > Looking at the traces you provided, we identified a problem with IOTLB > pending > misses that can already be fixed. > > However, in my opinion, this crash happens because something goes wrong > earlier, > too many IOTLB entries are in the cache. > > If I don't manage to reproduce, I could provide you a debug patch for DPDK > that > would dump the IOTLB cache entries when the pending misses issue happens, in > order > to see if IOTLB cache is consistent. > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream > git > repo directly? Maxime, using dpdk brew build makes the steps easier, especially when testing multiple runs with automation. So I prefer dpdk brew build. Thanks a lot. Best Regards, Pei > Thanks, > Maxime Hi Pei, (In reply to Pei Zhang from comment #6) > (In reply to Maxime Coquelin from comment #5) ... > > If I don't manage to reproduce, I could provide you a debug patch for DPDK > > that > > would dump the IOTLB cache entries when the pending misses issue happens, in > > order > > to see if IOTLB cache is consistent. > > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream > > git > > repo directly? > > Maxime, using dpdk brew build makes the steps easier, especially when > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a > lot. I generated the dpdk brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693 This package is only to be installed in host. When triggering OOM in the IOTLB mempool, it will dump the IOTLB cache entries as below example: VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 START: VHOST_CONFIG: [0]: iova ffffc000 - uaddr 7ff3d5ea8000 - size 1000 - perm 3 VHOST_CONFIG: [1]: iova ffffd000 - uaddr 7ff3d5ea9000 - size 1000 - perm 3 VHOST_CONFIG: [2]: iova ffffe000 - uaddr 7ff3d5d72000 - size 1000 - perm 3 VHOST_CONFIG: [3]: iova fffff000 - uaddr 7ff3d5d73000 - size 1000 - perm 3 VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 END Thanks, Maxime (In reply to Maxime Coquelin from comment #7) > Hi Pei, > > (In reply to Pei Zhang from comment #6) > > (In reply to Maxime Coquelin from comment #5) > ... > > > If I don't manage to reproduce, I could provide you a debug patch for DPDK > > > that > > > would dump the IOTLB cache entries when the pending misses issue happens, in > > > order > > > to see if IOTLB cache is consistent. > > > > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream > > > git > > > repo directly? > > > > Maxime, using dpdk brew build makes the steps easier, especially when > > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a > > lot. > > I generated the dpdk brew build: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693 Sorry, I forgot to call the patch command in the spec file, so the debug patch is not applied in above brew build. I fixed it, the build is available here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827 Maxime (In reply to Maxime Coquelin from comment #8) > (In reply to Maxime Coquelin from comment #7) > > Hi Pei, > > > > (In reply to Pei Zhang from comment #6) > > > (In reply to Maxime Coquelin from comment #5) > > ... > > > > If I don't manage to reproduce, I could provide you a debug patch for DPDK > > > > that > > > > would dump the IOTLB cache entries when the pending misses issue happens, in > > > > order > > > > to see if IOTLB cache is consistent. > > > > > > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream > > > > git > > > > repo directly? > > > > > > Maxime, using dpdk brew build makes the steps easier, especially when > > > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a > > > lot. > > > > I generated the dpdk brew build: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693 > > Sorry, I forgot to call the patch command in the spec file, so the debug > patch > is not applied in above brew build. > > I fixed it, the build is available here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827 Hi Maxime, I don't know why this build is expired so quickly. I can not download this build. Could you please check? Thanks. Best Regards, Pei > Maxime (In reply to Pei Zhang from comment #9) > (In reply to Maxime Coquelin from comment #8) > > (In reply to Maxime Coquelin from comment #7) > > > Hi Pei, > > > > > > (In reply to Pei Zhang from comment #6) > > > > (In reply to Maxime Coquelin from comment #5) > > > ... > > > > > If I don't manage to reproduce, I could provide you a debug patch for DPDK > > > > > that > > > > > would dump the IOTLB cache entries when the pending misses issue happens, in > > > > > order > > > > > to see if IOTLB cache is consistent. > > > > > > > > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream > > > > > git > > > > > repo directly? > > > > > > > > Maxime, using dpdk brew build makes the steps easier, especially when > > > > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a > > > > lot. > > > > > > I generated the dpdk brew build: > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693 > > > > Sorry, I forgot to call the patch command in the spec file, so the debug > > patch > > is not applied in above brew build. > > > > I fixed it, the build is available here: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827 > > Hi Maxime, I don't know why this build is expired so quickly. I can not > download this build. Could you please check? Thanks. Hi Pei, I think I did not follow the process, as I built it sending the SRPM directly. I have generated a new one using git: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15067666 Hope it will still be valid when you'll have a try. Thanks, Maxime > Best Regards, > Pei > > > > Maxime (In reply to Maxime Coquelin from comment #10) [...] > > Hi Pei, > > I think I did not follow the process, as I built it sending the SRPM > directly. > > I have generated a new one using git: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15067666 > > Hope it will still be valid when you'll have a try. Hi Maxime, I can download this build. Thanks for your building again. However the migration still fails with this build. 4/4 fails. And the issue still looks like with Description. Versions: 3.10.0-832.el7.x86_64 dpdk-17.11-5.el7.bz1533408.1.x86_64 qemu-kvm-rhev-2.10.0-17.el7.x86_64 tuned-2.9.0-1.el7.noarch I didn't see below info in des host with all my 4 testing(possibly because the iotlb issue is not 100% reproduced): VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 START: VHOST_CONFIG: [0]: iova ffffc000 - uaddr 7ff3d5ea8000 - size 1000 - perm 3 VHOST_CONFIG: [1]: iova ffffd000 - uaddr 7ff3d5ea9000 - size 1000 - perm 3 VHOST_CONFIG: [2]: iova ffffe000 - uaddr 7ff3d5d72000 - size 1000 - perm 3 VHOST_CONFIG: [3]: iova fffff000 - uaddr 7ff3d5d73000 - size 1000 - perm 3 VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 END Should I continue testing in order to get above info(as it's not 100% reproduced)? Also, if you need QE environments, please let me know in advance, I can prepare them for you. Best Regards, Pei > Thanks, > Maxime > > Best Regards, > > Pei > > > > > > > Maxime Thanks Pei for sharing your environnment, I managed to reproduce the issue. In both source and destination hosts, 1G huge pages are used both in host and guest. In source testpmd, I can see the IOTLB entries sizes are 1G, which is expected. Then migration is triggered and succeeds. But in destination testpmd, I can see that the IOTLB entries sizes are 4KB, which is unexpected and causes the OOM. Peter, any idea what could go wrong in destination so that IOTLB entries are 4KB whereas host and guest use 1G hugepages? I need to try again on my local setup using the same QEMU version whether IOTLB entries are at the expected size in destination. Please note that in my local setup, I used 2MB hugepages in host & guest, and the guest is running Fedora 27, not RHEL 7.5. I think we need to open another bug in qemu-kvm-rhev component to tacket this issue. In DPDK, I propose the following changes to make it more robust, and improve performance for pages smaller than 1G: 1. Have a dedicated pool for IOTLB pending misses 2. When inserting IOTLB entries, merge the ones that are contiguous both in guest IOVA and host virtual address spaces. Thanks, Maxime (In reply to Maxime Coquelin from comment #12) > In DPDK, I propose the following changes to make it more robust, and improve > performance for pages smaller than 1G: > 1. Have a dedicated pool for IOTLB pending misses I actually fixed the OOM differently, by doing a random evict in the IOTLB cache if the pending list is empty. The patch also does the opposite when the OOM happens when inserting IOTLB entry in the cache. Patch tested on Pei setup. With this, no more OOM, but this is still slow because of the 4K granularity of the IOTLB entries (instead of the 1G expected). I generated a brew build [0] and posted fix upstream for review [1]. > 2. When inserting IOTLB entries, merge the ones that are contiguous both in > guest IOVA and host virtual address spaces. Thinking again, this is more an optimization than a fix, and the schedule might be tight to have it in RHEL 7.5. I propose this change for RHEL 7.6. Maxime [0]: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15100469 [1]: http://dpdk.org/dev/patchwork/patch/34476/ (In reply to Maxime Coquelin from comment #12) > Thanks Pei for sharing your environnment, I managed to reproduce the issue. > > In both source and destination hosts, 1G huge pages are used both in host > and guest. > In source testpmd, I can see the IOTLB entries sizes are 1G, which is > expected. > Then migration is triggered and succeeds. > But in destination testpmd, I can see that the IOTLB entries sizes are 4KB, > which > is unexpected and causes the OOM. > > Peter, any idea what could go wrong in destination so that IOTLB entries are > 4KB > whereas host and guest use 1G hugepages? This is strange. Logically speaking this should only matter with how guest IOMMU driver setup the IOMMU page tables for the device, and QEMU will read from there. Meanwhile, the page table inside guest should not change after the migration IMHO, including the information on whether huge pages are used. Is there more information? E.g., are the mapping still consistent at least? Say, if IOVA1 maps to GPA1 in source, will that still be correct on destination (though the huge page information is lost)? > > I need to try again on my local setup using the same QEMU version whether > IOTLB > entries are at the expected size in destination. Please note that in my local > setup, I used 2MB hugepages in host & guest, and the guest is running Fedora > 27, not RHEL 7.5. > > I think we need to open another bug in qemu-kvm-rhev component to tacket this > issue. I agree. Let's open another bug to track this. Please feel free to assign it to me. Peter Hi, DPDK vhost-user fixes accepted upstream, will be part of v17.11-rc3: http://dpdk.org/dev/patchwork/patch/34663/ http://dpdk.org/dev/patchwork/patch/34664/ Cheers, Maxime Maxime, Can you confirm that they are included in v17.11? If so, we can close this as dup of bz#1522700 Thanks, fbl Hi Flavio, Sorry, my mistake, I meant the patches are part of v18.02-rc3, not v17.11-rc3: (In reply to Maxime Coquelin from comment #15) > Hi, > > DPDK vhost-user fixes accepted upstream, will be part of *v18.02-rc3*: > http://dpdk.org/dev/patchwork/patch/34663/ > http://dpdk.org/dev/patchwork/patch/34664/ > > Cheers, > Maxime The patches are also queued in v17.11 LTS branch, so will be part of next v17.11 stable release. I asked Matteo whether we should pick it for next release. As the patches aren't very urgent and the next release deadline is close, we agreed to postpone it to next release. Regards, Maxime Thanks, reassigning to move this further. Hi Maxime, aren't these two patches the same of bug 1541881 ? If so, I'll change the state to MODIFIED Hi Matteo, Yes, it is the same patches. Cheers, Maxime *** This bug has been marked as a duplicate of bug 1541881 *** |