Bug 1538953

Summary: IOTLB entry size mismatch before/after migration during DPDK PVP testing
Product: Red Hat Enterprise Linux 7 Reporter: Maxime Coquelin <maxime.coquelin>
Component: qemu-kvm-rhevAssignee: Peter Xu <peterx>
Status: CLOSED ERRATA QA Contact: Pei Zhang <pezhang>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: atragler, chayang, juzhang, knoel, lmiksik, maxime.coquelin, michen, mrezanin, mtessun, ovs-qe, peterx, pezhang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.10.0-21.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1533408 Environment:
Last Closed: 2018-04-11 00:58:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1533408, 1541881    
Bug Blocks:    

Comment 2 Peter Xu 2018-02-01 11:16:32 UTC
I traced vtd_switch_address_space(), then I found that after migration the two vhost-user devices were having DMAR disabled, that's where the 4K pages come from (QEMU didn't really go through the IOMMU page tables, but returned as if there is no vIOMMU).

After some more debugging, I found an interesting truth: we are migrating pcie-root-ports after IOMMU, but that may be problematic - since the PCI bus number information is stored there in the configuration space of the root port.  So that leads to the result that IOMMU will fetch wrong pci bus number duing vmstate load, and things are messed up (e.g., context entries are not correct any more since that depends on a correct bus number).

I tried to boost the pcie-root-port devices' migration priority and a smoke test shows that the problem solved.  With the fix Pei and I can migrate the VM back and forth without seeing any error.

I'll post the fix soon upstream for review.

Peter

Comment 3 Peter Xu 2018-02-01 11:21:46 UTC
Posted patch upstream for review:

[PATCH] pcie-root-port: let it has higher migrate priority

Peter

Comment 5 Miroslav Rezanina 2018-02-20 13:41:15 UTC
Fix included in qemu-kvm-rhev-2.10.0-21.el7

Comment 7 Pei Zhang 2018-02-27 14:18:53 UTC
==Verification==

Versions:
3.10.0-855.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
dpdk-17.11-7.el7.x86_64

Steps:
Same steps as Description. All 10 migration runs work well, no any error in host and guest. The migration testing results looks like below:

===========Stream Rate: 1Mpps===========
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      249     15489       581   12321816.0
 1       1Mpps      242     14675       281    6190466.0
 2       1Mpps      256     14560       282    3971495.0
 3       1Mpps      250     15533       282   13343215.0
 4       1Mpps      252     15276       282    9351827.0
 5       1Mpps      244     15395       280   11437595.0
 6       1Mpps      245     14553       281    5710854.0
 7       1Mpps      244     15476       281   12295376.0
 8       1Mpps      255     15095       282    4554012.0
 9       1Mpps      249     14658       282    6444174.0

1. We found ping loss and trex packets loss are high during the whole migration process, however from QE perspective, we think this bug has been fixed, the packets loss should be a new issue, we will file new bugs later to track them. Thanks.

Move this bug to 'VERIFIED'. Please feel free add comment or change status if you disagree. Thanks.

Comment 9 Pei Zhang 2018-02-28 07:21:59 UTC
(In reply to Pei Zhang from comment #7)
 
> 1. We found ping loss and trex packets loss are high during the whole
> migration process, however from QE perspective, we think this bug has been
> fixed, the packets loss should be a new issue, we will file new bugs later
> to track them. Thanks.

Update;

QE File a new bug:

Bug 1549955 - During PVP live migration, ping packets loss become higher with vIOMMU

Comment 10 errata-xmlrpc 2018-04-11 00:58:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104