Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
I traced vtd_switch_address_space(), then I found that after migration the two vhost-user devices were having DMAR disabled, that's where the 4K pages come from (QEMU didn't really go through the IOMMU page tables, but returned as if there is no vIOMMU).
After some more debugging, I found an interesting truth: we are migrating pcie-root-ports after IOMMU, but that may be problematic - since the PCI bus number information is stored there in the configuration space of the root port. So that leads to the result that IOMMU will fetch wrong pci bus number duing vmstate load, and things are messed up (e.g., context entries are not correct any more since that depends on a correct bus number).
I tried to boost the pcie-root-port devices' migration priority and a smoke test shows that the problem solved. With the fix Pei and I can migrate the VM back and forth without seeing any error.
I'll post the fix soon upstream for review.
Peter
==Verification==
Versions:
3.10.0-855.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
dpdk-17.11-7.el7.x86_64
Steps:
Same steps as Description. All 10 migration runs work well, no any error in host and guest. The migration testing results looks like below:
===========Stream Rate: 1Mpps===========
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 249 15489 581 12321816.0
1 1Mpps 242 14675 281 6190466.0
2 1Mpps 256 14560 282 3971495.0
3 1Mpps 250 15533 282 13343215.0
4 1Mpps 252 15276 282 9351827.0
5 1Mpps 244 15395 280 11437595.0
6 1Mpps 245 14553 281 5710854.0
7 1Mpps 244 15476 281 12295376.0
8 1Mpps 255 15095 282 4554012.0
9 1Mpps 249 14658 282 6444174.0
1. We found ping loss and trex packets loss are high during the whole migration process, however from QE perspective, we think this bug has been fixed, the packets loss should be a new issue, we will file new bugs later to track them. Thanks.
Move this bug to 'VERIFIED'. Please feel free add comment or change status if you disagree. Thanks.
(In reply to Pei Zhang from comment #7)
> 1. We found ping loss and trex packets loss are high during the whole
> migration process, however from QE perspective, we think this bug has been
> fixed, the packets loss should be a new issue, we will file new bugs later
> to track them. Thanks.
Update;
QE File a new bug:
Bug 1549955 - During PVP live migration, ping packets loss become higher with vIOMMU
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2018:1104