Bug 1738751
Summary: | NFV live migration fails with dpdk "--iova-mode va": Failed to load virtio-net:virtio | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Pei Zhang <pezhang> | |
Component: | dpdk | Assignee: | Adrián Moreno <amorenoz> | |
Status: | CLOSED WONTFIX | QA Contact: | Pei Zhang <pezhang> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 8.1 | CC: | aadam, amorenoz, chayang, dmarchan, jinzhao, juzhang, jwboyer, kanderso, maxime.coquelin, ovs-qe, tredaelli | |
Target Milestone: | rc | Keywords: | Regression | |
Target Release: | 8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1763815 1764000 (view as bug list) | Environment: | ||
Last Closed: | 2019-11-21 08:18:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1763815, 1764000 |
Description
Pei Zhang
2019-08-08 05:08:23 UTC
Do you know which last revision of upstream version was working fine? This would help bisect the issue. (In reply to David Marchand from comment #1) > Do you know which last revision of upstream version was working fine? > This would help bisect the issue. Hi David, v19.08-rc2 is the first version hit this problem. commit 07efd6ddc0499688eb11ae4866d3532295d6db2b (tag: v19.05, origin/releases) works well commit cc091931dc05212db32ddbd7da3031104ca4963f (tag: v19.08-rc1) works well commit 83a124fb73c50b051ee20ef6b1998c81be7e65df (tag: v19.08-rc2) fail commit 0710d87b7f5d0a2cd01861d44c4689efd4714b5f (tag: v19.08-rc4) fail Best regards, Pei We also file a bug in upstream dpdk: https://bugs.dpdk.org/show_bug.cgi?id=337 I have been able to reproduce this issue and bisected it to the following commit: commit bbe29a9bd7ab6feab9a52051c32092a94ee886eb Author: Jerin Jacob <jerinj> Date: Mon Jul 22 14:56:53 2019 +0200 eal/linux: select IOVA as VA mode for default case When bus layer reports the preferred mode as RTE_IOVA_DC then select the RTE_IOVA_VA mode: - All drivers work in RTE_IOVA_VA mode, irrespective of physical address availability. - By default, a mempool asks for IOVA-contiguous memory using RTE_MEMZONE_IOVA_CONTIG. This is slow in RTE_IOVA_PA mode and it may affect the application boot time. Signed-off-by: Jerin Jacob <jerinj> Acked-by: Anatoly Burakov <anatoly.burakov> Signed-off-by: David Marchand <david.marchand> This commit only changes the default IOVA mode, from IOVA_PA to IOVA_VA so this is just revealing an underlying problem. Confirmed this by verifying that upstream dpdk with "--iova-mode pa" works fine and stable downstream dpdk fails in the same manner if "--iova-mode va" is used. Going to qemu, the code that detecting the error is: vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx - vdev->vq[i].used_idx); if (vdev->vq[i].inuse > vdev->vq[i].vring.num) { error_report("VQ %d size 0x%x < last_avail_idx 0x%x - " "used_idx 0x%x", i, vdev->vq[i].vring.num, vdev->vq[i].last_avail_idx, vdev->vq[i].used_idx); return -1; } One of the times I've reproduced it, I looked at the index values on the sending qemu just before sending the vmstates: size 0x100 | last_avail_idx 0x3aa0 | used_idx 0x3aa0 And just after loading the vmstates at the receiving qemu: VQ 0 size 0x100 < last_avail_idx 0x3aa0 - used_idx 0xbda0 At first I suspected an endianes issue but then confirmed that virtio_lduw_phys_cached handles it properly. So, it might be that the memory caches don't get properly synchronized before the migration takes place. Although the problem was detected as a regression on guest dpdk (therefore RHEL product), the problem was actually in the host-side. I have sent a patch upstream that fixes it [1], so I suggest moving this bug to the FD stream. With regards to the possibility of clients being affected by this problem when upgrading to rhel8.1.1, I suggest adding a note in the documentation explaining the workaround which is: a) upgrade the host to the FD version that contains the fix b) add "--iova-mode pa" to the EAL's parameters Closing this bug as the issue has to be fixed in the host and BZ 1763815 will take care of that. |