I traced vtd_switch_address_space(), then I found that after migration the two vhost-user devices were having DMAR disabled, that's where the 4K pages come from (QEMU didn't really go through the IOMMU page tables, but returned as if there is no vIOMMU).
After some more debugging, I found an interesting truth: we are migrating pcie-root-ports after IOMMU, but that may be problematic - since the PCI bus number information is stored there in the configuration space of the root port. So that leads to the result that IOMMU will fetch wrong pci bus number duing vmstate load, and things are messed up (e.g., context entries are not correct any more since that depends on a correct bus number).
I tried to boost the pcie-root-port devices' migration priority and a smoke test shows that the problem solved. With the fix Pei and I can migrate the VM back and forth without seeing any error.
I'll post the fix soon upstream for review.
Posted patch upstream for review:
[PATCH] pcie-root-port: let it has higher migrate priority
Fix included in qemu-kvm-rhev-2.10.0-21.el7
Same steps as Description. All 10 migration runs work well, no any error in host and guest. The migration testing results looks like below:
===========Stream Rate: 1Mpps===========
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 249 15489 581 12321816.0
1 1Mpps 242 14675 281 6190466.0
2 1Mpps 256 14560 282 3971495.0
3 1Mpps 250 15533 282 13343215.0
4 1Mpps 252 15276 282 9351827.0
5 1Mpps 244 15395 280 11437595.0
6 1Mpps 245 14553 281 5710854.0
7 1Mpps 244 15476 281 12295376.0
8 1Mpps 255 15095 282 4554012.0
9 1Mpps 249 14658 282 6444174.0
1. We found ping loss and trex packets loss are high during the whole migration process, however from QE perspective, we think this bug has been fixed, the packets loss should be a new issue, we will file new bugs later to track them. Thanks.
Move this bug to 'VERIFIED'. Please feel free add comment or change status if you disagree. Thanks.
(In reply to Pei Zhang from comment #7)
> 1. We found ping loss and trex packets loss are high during the whole
> migration process, however from QE perspective, we think this bug has been
> fixed, the packets loss should be a new issue, we will file new bugs later
> to track them. Thanks.
QE File a new bug:
Bug 1549955 - During PVP live migration, ping packets loss become higher with vIOMMU
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.