Bug 1533408

Summary:

Migration doesn't work well in PVP testing with vIOMMU

Product:

Red Hat Enterprise Linux 7

Reporter:

Pei Zhang <pezhang>

Component:

openvswitch

Assignee:

Matteo Croce <mcroce>

Status:

CLOSED DUPLICATE

QA Contact:

ovs-qe

Severity:

high

Docs Contact:

Priority:

high

Version:

7.5

CC:

atragler, chayang, fleitner, juzhang, knoel, maxime.coquelin, michen, peterx, pvauter, tredaelli, virt-maint

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1538953 (view as bug list)

Environment:

Last Closed:

2018-05-03 15:02:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1538953

Attachments:

Description	Flags
XML of VM	none

Description Pei Zhang 2018-01-11 09:51:12 UTC

Created attachment 1379909 [details]
XML of VM

Description of problem:
In pvp testing with vIOMMU, migration doesn't work well. 

Version-Release number of selected component (if applicable):
3.10.0-827.el7.x86_64
qemu-kvm-rhev-2.10.0-16.el7.x86_64
libvirt-3.9.0-7.el7.x86_64
dpdk-17.11-4.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. In src and des host, boot testpmd with iommu-support=1, see [1]

2. In src host, boot VM with vIOMMU, see[2]. Full XML, refer to the attachment.

3. In guest, load vfio and start testpmd.

4. In the third host, start MoonGen. Guest can receive packets.
./build/MoonGen examples/l2-load-latency.lua 0 1 64

5. Do live migration. 
# /bin/virsh migrate --verbose --persistent --live rhel7.5_nonrt qemu+ssh://10.73.72.154/system

6. Check status. QE hit below issues:
(1) Guest can not receive packets after migrating from src to des host.

(2) In src host, finishing migraiton takes long time. Like below, guest is already running in des, however src host still stops at "Migration: [100 %]" for a long time(sometimes even more than 10 minutes).

== src host ==
[root@dell-per730-11 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 55    rhel7.5_nonrt                  paused

# /bin/virsh migrate --verbose --persistent --live rhel7.5_nonrt qemu+ssh://192.168.1.2/system
Migration: [100 %]
Migration: [100 %]
Migration: [100 %]
Migration: [100 %]

== des host ==
[root@dell-per730-12 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 28    rhel7.5_nonrt                  running

(3) Sometimes guest lost response. Sometimes can recover after several minutes, sometimes not. And sometimes, on des host, testpmd print lost of below info:
..
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool still empty, failure
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: IOTLB pool empty, clear pending misses
VHOST_CONFIG: read message VHOST_USER_GET_VRING_BASE
PMD: Connection closed

Port 3: LSC event
VHOST_CONFIG: vring base idx:0 file:21560
VHOST_CONFIG: read message VHOST_USER_GET_VRING_BASE
VHOST_CONFIG: vring base idx:1 file:61795
VHOST_CONFIG: recvmsg failed
VHOST_CONFIG: vhost peer closed
VHOST_CONFIG: recvmsg failed
VHOST_CONFIG: vhost peer closed
PMD: Connection closed

Port 4: LSC event
VHOST_CONFIG: recvmsg failed
VHOST_CONFIG: vhost peer closed
PMD: Connection closed

Port 5: LSC event


Actual results:
Guest can not receive packets after migration. And migration can not finish quickly, should be less than 20 seconds based on history testing. Also, guest should not hang.


Expected results:
Migration should work well with vIOMMU.


Additional info:
1. Without vIOMMU, we reported a migration bug in dpdk component:
Bug 1525446 - Host dpdk's testpmd "Segmentation fault" when migrating VM with vhost-user and packets flow

2. I'm not quite sure if this bug is in the correct component. Please help to fix it if it's wrong. Thanks.

Reference:
[1]
/usr/bin/testpmd \
-l 2,4,6,8,10 \
--socket-mem 1024,1024 \
-n 4 \
--vdev net_vhost0,iface=/tmp/vhost-user1,client=0,iommu-support=1 \
--vdev net_vhost1,iface=/tmp/vhost-user2,client=0,iommu-support=1 \
-- \
--portmask=f \
--disable-hw-vlan \
-i \
--rxq=1 --txq=1 \
--nb-cores=4 \
--forward-mode=io

testpmd> set portlist 0,2,1,3
testpmd> start 

[2]
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:03'/>
      <source type='unix' path='/tmp/vhost-user2' mode='client'/>
      <model type='virtio'/>
      <driver name='vhost' iommu='on' ats='on'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>

[3]
# modprobe vfio
# modprobe vfio-pci

# /usr/bin/testpmd \
-l 1,2,3 \
-n 4 \
-d /usr/lib64/librte_pmd_virtio.so.1 \
-w 0000:03:00.0 -w 0000:04:00.0 \
-- \
--nb-cores=2 \
--disable-hw-vlan \
-i \
--disable-rss \
--rxq=1 --txq=1

Comment 2 Peter Xu 2018-01-16 07:37:16 UTC

...

> (3) Sometimes guest lost response. Sometimes can recover after several
> minutes, sometimes not. And sometimes, on des host, testpmd print lost of
> below info:
> ..
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure
> VHOST_CONFIG: IOTLB pool empty, clear pending misses
> VHOST_CONFIG: IOTLB pool still empty, failure

These messages are really suspicious to me.  It's from vhost_user_iotlb_pending_insert():

	ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
	if (ret) {
		RTE_LOG(INFO, VHOST_CONFIG,
				"IOTLB pool empty, clear pending misses\n");
		vhost_user_iotlb_pending_remove_all(vq);
		ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
		if (ret) {
			RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
			return;
		}
	}

It seems that here we can't allocate the iotlb even after we flushed pending requests.

Maxime, do you have any quick idea on what might have happened when both two error messages are triggered?

PS. I have two questions:

1. do we need to report and stop when vhost_user_iotlb_pending_insert() failed somehow?  For now IIUC we only get a failure message but we'll still continue, assuming it's successful.

2. can there be race condition in __vhost_iova_to_vva()?  Say, we have this:

	if (!vhost_user_iotlb_pending_miss(vq, iova + tmp_size, perm)) {
		/*
		 * iotlb_lock is read-locked for a full burst,
		 * but it only protects the iotlb cache.
		 * In case of IOTLB miss, we might block on the socket,
		 * which could cause a deadlock with QEMU if an IOTLB update
		 * is being handled. We can safely unlock here to avoid it.
		 */
		vhost_user_iotlb_rd_unlock(vq);

		vhost_user_iotlb_pending_insert(vq, iova + tmp_size, perm);
		vhost_user_iotlb_miss(dev, iova + tmp_size, perm);

		vhost_user_iotlb_rd_lock(vq);
	}

is it possible that two or more threads get vhost_user_iotlb_pending_miss() returned zero in parallel, and then they insert the same iova twice?

Thanks,
Peter

Comment 3 Maxime Coquelin 2018-01-16 09:21:27 UTC

(In reply to Peter Xu from comment #2)
> ...
> 
> > (3) Sometimes guest lost response. Sometimes can recover after several
> > minutes, sometimes not. And sometimes, on des host, testpmd print lost of
> > below info:
> > ..
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > VHOST_CONFIG: IOTLB pool still empty, failure
> 
> These messages are really suspicious to me.  It's from
> vhost_user_iotlb_pending_insert():
> 
> 	ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
> 	if (ret) {
> 		RTE_LOG(INFO, VHOST_CONFIG,
> 				"IOTLB pool empty, clear pending misses\n");
> 		vhost_user_iotlb_pending_remove_all(vq);
> 		ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
> 		if (ret) {
> 			RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
> 			return;
> 		}
> 	}
> 
> It seems that here we can't allocate the iotlb even after we flushed pending
> requests.
> 
> Maxime, do you have any quick idea on what might have happened when both two
> error messages are triggered?

I think there is a problem when rte_mempool_get fails.
We try to just release the pending list, but iotlb entries are allocated also
in the same mempool, so we should also release some IOTLB entries in the cache.

As the memory size isn't specified in the testpmd command
line, I don't know how large it is. But the IOTLB mempool size is fixed to 2048
entries, so 4GB with 2MB entries if IOTLB entries are well formed (i.e. matching
page boundary as we fixed last year in QEMU)

It should be large enough, but we could enlarge it further if needed.
However, in this case, I would be surprised the memory used by testpmd
in guest is larger than 4GB. We may have another bug, and dumping the IOLB cache
entries should help to understand.

For this issue at least, I think we can move the component to openvswitch component,
and assign it to me.

> PS. I have two questions:
> 
> 1. do we need to report and stop when vhost_user_iotlb_pending_insert()
> failed somehow?  For now IIUC we only get a failure message but we'll still
> continue, assuming it's successful.

Maybe having a dedicated pool for pending misses would fix the issue.

> 2. can there be race condition in __vhost_iova_to_vva()?  Say, we have this:
> 
> 	if (!vhost_user_iotlb_pending_miss(vq, iova + tmp_size, perm)) {
> 		/*
> 		 * iotlb_lock is read-locked for a full burst,
> 		 * but it only protects the iotlb cache.
> 		 * In case of IOTLB miss, we might block on the socket,
> 		 * which could cause a deadlock with QEMU if an IOTLB update
> 		 * is being handled. We can safely unlock here to avoid it.
> 		 */
> 		vhost_user_iotlb_rd_unlock(vq);
> 
> 		vhost_user_iotlb_pending_insert(vq, iova + tmp_size, perm);
> 		vhost_user_iotlb_miss(dev, iova + tmp_size, perm);
> 
> 		vhost_user_iotlb_rd_lock(vq);
> 	}
> 
> is it possible that two or more threads get vhost_user_iotlb_pending_miss()
> returned zero in parallel, and then they insert the same iova twice?

Acutally, we have an IOTLB cache list and IOTLB pending list per virtqueue, so
that performance scales linearly with multi-queue.

So I think we are safe here.


> Thanks,
> Peter

Comment 4 Peter Xu 2018-01-16 12:38:42 UTC

(In reply to Maxime Coquelin from comment #3)
> (In reply to Peter Xu from comment #2)
> > ...
> > 
> > > (3) Sometimes guest lost response. Sometimes can recover after several
> > > minutes, sometimes not. And sometimes, on des host, testpmd print lost of
> > > below info:
> > > ..
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > > VHOST_CONFIG: IOTLB pool empty, clear pending misses
> > > VHOST_CONFIG: IOTLB pool still empty, failure
> > 
> > These messages are really suspicious to me.  It's from
> > vhost_user_iotlb_pending_insert():
> > 
> > 	ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
> > 	if (ret) {
> > 		RTE_LOG(INFO, VHOST_CONFIG,
> > 				"IOTLB pool empty, clear pending misses\n");
> > 		vhost_user_iotlb_pending_remove_all(vq);
> > 		ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
> > 		if (ret) {
> > 			RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
> > 			return;
> > 		}
> > 	}
> > 
> > It seems that here we can't allocate the iotlb even after we flushed pending
> > requests.
> > 
> > Maxime, do you have any quick idea on what might have happened when both two
> > error messages are triggered?
> 
> I think there is a problem when rte_mempool_get fails.
> We try to just release the pending list, but iotlb entries are allocated also
> in the same mempool, so we should also release some IOTLB entries in the
> cache.
> 
> As the memory size isn't specified in the testpmd command
> line, I don't know how large it is. But the IOTLB mempool size is fixed to
> 2048
> entries, so 4GB with 2MB entries if IOTLB entries are well formed (i.e.
> matching
> page boundary as we fixed last year in QEMU)
> 
> It should be large enough, but we could enlarge it further if needed.
> However, in this case, I would be surprised the memory used by testpmd
> in guest is larger than 4GB. We may have another bug, and dumping the IOLB
> cache
> entries should help to understand.
> 
> For this issue at least, I think we can move the component to openvswitch
> component,
> and assign it to me.

Thanks for the analysis.  Then I'm moving the component and assigned-to accordingly.  Let's open another bug if the problem still exists afterwards.

Peter

Comment 5 Maxime Coquelin 2018-01-18 10:52:02 UTC

Hi Pei,

I tried to reproduce locally with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and DPDK 
v17.11, using a single host by creating an IO loop in testpmd instead of using an 
external packet generator.

With this setup, the 10 migrations I tried all succeed.

One other difference I have with your setup is the guest Kernel,
I'm using Fedora 27 one for now, but will try with 3.10.0-827.el7.x86_64
this afternoon.

Looking at the traces you provided, we identified a problem with IOTLB pending 
misses that can already be fixed.

However, in my opinion, this crash happens because something goes wrong earlier,
too many IOTLB entries are in the cache.

If I don't manage to reproduce, I could provide you a debug patch for DPDK that
would dump the IOTLB cache entries when the pending misses issue happens, in order
to see if IOTLB cache is consistent.

Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream git
repo directly?

Thanks,
Maxime

Comment 6 Pei Zhang 2018-01-18 13:11:19 UTC

(In reply to Maxime Coquelin from comment #5)
> Hi Pei,
> 
> I tried to reproduce locally with qemu-kvm-rhev-2.10.0-16.el7.x86_64 and
> DPDK 
> v17.11, using a single host by creating an IO loop in testpmd instead of
> using an 
> external packet generator.
> 
> With this setup, the 10 migrations I tried all succeed.
> 
> One other difference I have with your setup is the guest Kernel,
> I'm using Fedora 27 one for now, but will try with 3.10.0-827.el7.x86_64
> this afternoon.
> 
> Looking at the traces you provided, we identified a problem with IOTLB
> pending 
> misses that can already be fixed.
> 
> However, in my opinion, this crash happens because something goes wrong
> earlier,
> too many IOTLB entries are in the cache.
> 
> If I don't manage to reproduce, I could provide you a debug patch for DPDK
> that
> would dump the IOTLB cache entries when the pending misses issue happens, in
> order
> to see if IOTLB cache is consistent.
> 
> Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream
> git
> repo directly?

Maxime, using dpdk brew build makes the steps easier, especially when testing multiple runs with automation. So I prefer dpdk brew build. Thanks a lot.


Best Regards,
Pei 

> Thanks,
> Maxime

Comment 7 Maxime Coquelin 2018-01-19 14:46:25 UTC

Hi Pei,

(In reply to Pei Zhang from comment #6)
> (In reply to Maxime Coquelin from comment #5)
...
> > If I don't manage to reproduce, I could provide you a debug patch for DPDK
> > that
> > would dump the IOTLB cache entries when the pending misses issue happens, in
> > order
> > to see if IOTLB cache is consistent.
> > 
> > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream
> > git
> > repo directly?
> 
> Maxime, using dpdk brew build makes the steps easier, especially when
> testing multiple runs with automation. So I prefer dpdk brew build. Thanks a
> lot.

I generated the dpdk brew build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693

This package is only to be installed in host.
When triggering OOM in the IOTLB mempool, it will dump the IOTLB cache entries 
as below example:

VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 START:
VHOST_CONFIG: [0]: iova ffffc000 - uaddr 7ff3d5ea8000 - size 1000 - perm 3
VHOST_CONFIG: [1]: iova ffffd000 - uaddr 7ff3d5ea9000 - size 1000 - perm 3
VHOST_CONFIG: [2]: iova ffffe000 - uaddr 7ff3d5d72000 - size 1000 - perm 3
VHOST_CONFIG: [3]: iova fffff000 - uaddr 7ff3d5d73000 - size 1000 - perm 3
VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 END

Thanks,
Maxime

Comment 8 Maxime Coquelin 2018-01-19 16:30:06 UTC

(In reply to Maxime Coquelin from comment #7)
> Hi Pei,
> 
> (In reply to Pei Zhang from comment #6)
> > (In reply to Maxime Coquelin from comment #5)
> ...
> > > If I don't manage to reproduce, I could provide you a debug patch for DPDK
> > > that
> > > would dump the IOTLB cache entries when the pending misses issue happens, in
> > > order
> > > to see if IOTLB cache is consistent.
> > > 
> > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream
> > > git
> > > repo directly?
> > 
> > Maxime, using dpdk brew build makes the steps easier, especially when
> > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a
> > lot.
> 
> I generated the dpdk brew build:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693

Sorry, I forgot to call the patch command in the spec file, so the debug patch 
is not applied in above brew build.

I fixed it, the build is available here:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827

Maxime

Comment 9 Pei Zhang 2018-01-22 20:24:05 UTC

(In reply to Maxime Coquelin from comment #8)
> (In reply to Maxime Coquelin from comment #7)
> > Hi Pei,
> > 
> > (In reply to Pei Zhang from comment #6)
> > > (In reply to Maxime Coquelin from comment #5)
> > ...
> > > > If I don't manage to reproduce, I could provide you a debug patch for DPDK
> > > > that
> > > > would dump the IOTLB cache entries when the pending misses issue happens, in
> > > > order
> > > > to see if IOTLB cache is consistent.
> > > > 
> > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream
> > > > git
> > > > repo directly?
> > > 
> > > Maxime, using dpdk brew build makes the steps easier, especially when
> > > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a
> > > lot.
> > 
> > I generated the dpdk brew build:
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693
> 
> Sorry, I forgot to call the patch command in the spec file, so the debug
> patch 
> is not applied in above brew build.
> 
> I fixed it, the build is available here:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827

Hi Maxime, I don't know why this build is expired so quickly. I can not download this build. Could you please check? Thanks.

Best Regards,
Pei


> Maxime

Comment 10 Maxime Coquelin 2018-01-22 21:05:31 UTC

(In reply to Pei Zhang from comment #9)
> (In reply to Maxime Coquelin from comment #8)
> > (In reply to Maxime Coquelin from comment #7)
> > > Hi Pei,
> > > 
> > > (In reply to Pei Zhang from comment #6)
> > > > (In reply to Maxime Coquelin from comment #5)
> > > ...
> > > > > If I don't manage to reproduce, I could provide you a debug patch for DPDK
> > > > > that
> > > > > would dump the IOTLB cache entries when the pending misses issue happens, in
> > > > > order
> > > > > to see if IOTLB cache is consistent.
> > > > > 
> > > > > Do you prefer I prepare you a dpdk brew build, or you can use DPDK upstream
> > > > > git
> > > > > repo directly?
> > > > 
> > > > Maxime, using dpdk brew build makes the steps easier, especially when
> > > > testing multiple runs with automation. So I prefer dpdk brew build. Thanks a
> > > > lot.
> > > 
> > > I generated the dpdk brew build:
> > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15053693
> > 
> > Sorry, I forgot to call the patch command in the spec file, so the debug
> > patch 
> > is not applied in above brew build.
> > 
> > I fixed it, the build is available here:
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15054827
> 
> Hi Maxime, I don't know why this build is expired so quickly. I can not
> download this build. Could you please check? Thanks.

Hi Pei,

I think I did not follow the process, as I built it sending the SRPM directly.

I have generated a new one using git:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15067666

Hope it will still be valid when you'll have a try.

Thanks,
Maxime
> Best Regards,
> Pei
> 
> 
> > Maxime

Comment 11 Pei Zhang 2018-01-23 14:38:28 UTC

(In reply to Maxime Coquelin from comment #10)
[...]
> 
> Hi Pei,
> 
> I think I did not follow the process, as I built it sending the SRPM
> directly.
> 
> I have generated a new one using git:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15067666
> 
> Hope it will still be valid when you'll have a try.

Hi Maxime, 

I can download this build. Thanks for your building again.

However the migration still fails with this build. 4/4 fails. And the issue still looks like with Description.

Versions:
3.10.0-832.el7.x86_64
dpdk-17.11-5.el7.bz1533408.1.x86_64
qemu-kvm-rhev-2.10.0-17.el7.x86_64
tuned-2.9.0-1.el7.noarch


I didn't see below info in des host with all my 4 testing(possibly because the iotlb issue is not 100% reproduced): 

VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 START:
VHOST_CONFIG: [0]: iova ffffc000 - uaddr 7ff3d5ea8000 - size 1000 - perm 3
VHOST_CONFIG: [1]: iova ffffd000 - uaddr 7ff3d5ea9000 - size 1000 - perm 3
VHOST_CONFIG: [2]: iova ffffe000 - uaddr 7ff3d5d72000 - size 1000 - perm 3
VHOST_CONFIG: [3]: iova fffff000 - uaddr 7ff3d5d73000 - size 1000 - perm 3
VHOST_CONFIG: vhost_user_iotlb_cache_dump: vq@0x7ff3e9c95640 END

Should I continue testing in order to get above info(as it's not 100% reproduced)?

Also, if you need QE environments, please let me know in advance, I can prepare them for you.


Best Regards,
Pei

> Thanks,
> Maxime
> > Best Regards,
> > Pei
> > 
> > 
> > > Maxime

Comment 12 Maxime Coquelin 2018-01-25 11:12:22 UTC

Thanks Pei for sharing your environnment, I managed to reproduce the issue.

In both source and destination hosts, 1G huge pages are used both in host and guest.
In source testpmd, I can see the IOTLB entries sizes are 1G, which is expected.
Then migration is triggered and succeeds.
But in destination testpmd, I can see that the IOTLB entries sizes are 4KB, which
is unexpected and causes the OOM.

Peter, any idea what could go wrong in destination so that IOTLB entries are 4KB
whereas host and guest use 1G hugepages?

I need to try again on my local setup using the same QEMU version whether IOTLB
entries are at the expected size in destination. Please note that in my local
setup, I used 2MB hugepages in host & guest, and the guest is running Fedora
27, not RHEL 7.5.

I think we need to open another bug in qemu-kvm-rhev component to tacket this
issue.

In DPDK, I propose the following changes to make it more robust, and improve
performance for pages smaller than 1G:
1. Have a dedicated pool for IOTLB pending misses
2. When inserting IOTLB entries, merge the ones that are contiguous both in
guest IOVA and host virtual address spaces.

Thanks,
Maxime

Comment 13 Maxime Coquelin 2018-01-25 15:22:10 UTC

(In reply to Maxime Coquelin from comment #12)
> In DPDK, I propose the following changes to make it more robust, and improve
> performance for pages smaller than 1G:
> 1. Have a dedicated pool for IOTLB pending misses

I actually fixed the OOM differently, by doing a random evict in the IOTLB cache
if the pending list is empty. The patch also does the opposite when the OOM
happens when inserting IOTLB entry in the cache.

Patch tested on Pei setup. With this, no more OOM, but this is still slow
because of the 4K granularity of the IOTLB entries (instead of the 1G expected).

I generated a brew build [0] and posted fix upstream for review [1].

> 2. When inserting IOTLB entries, merge the ones that are contiguous both in
> guest IOVA and host virtual address spaces.

Thinking again, this is more an optimization than a fix, and the schedule might
be tight to have it in RHEL 7.5. I propose this change for RHEL 7.6.

Maxime

[0]: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=15100469
[1]: http://dpdk.org/dev/patchwork/patch/34476/

Comment 14 Peter Xu 2018-01-26 05:40:34 UTC

(In reply to Maxime Coquelin from comment #12)
> Thanks Pei for sharing your environnment, I managed to reproduce the issue.
> 
> In both source and destination hosts, 1G huge pages are used both in host
> and guest.
> In source testpmd, I can see the IOTLB entries sizes are 1G, which is
> expected.
> Then migration is triggered and succeeds.
> But in destination testpmd, I can see that the IOTLB entries sizes are 4KB,
> which
> is unexpected and causes the OOM.
> 
> Peter, any idea what could go wrong in destination so that IOTLB entries are
> 4KB
> whereas host and guest use 1G hugepages?

This is strange.  Logically speaking this should only matter with how guest IOMMU driver setup the IOMMU page tables for the device, and QEMU will read from there.  Meanwhile, the page table inside guest should not change after the migration IMHO, including the information on whether huge pages are used.

Is there more information?  E.g., are the mapping still consistent at least? Say, if IOVA1 maps to GPA1 in source, will that still be correct on destination (though the huge page information is lost)?

> 
> I need to try again on my local setup using the same QEMU version whether
> IOTLB
> entries are at the expected size in destination. Please note that in my local
> setup, I used 2MB hugepages in host & guest, and the guest is running Fedora
> 27, not RHEL 7.5.
> 
> I think we need to open another bug in qemu-kvm-rhev component to tacket this
> issue.

I agree.  Let's open another bug to track this.  Please feel free to assign it to me.

Peter

Comment 15 Maxime Coquelin 2018-02-03 07:04:03 UTC

Hi,

DPDK vhost-user fixes accepted upstream, will be part of v17.11-rc3:
http://dpdk.org/dev/patchwork/patch/34663/
http://dpdk.org/dev/patchwork/patch/34664/

Cheers,
Maxime

Comment 16 Flavio Leitner 2018-02-16 16:24:21 UTC

Maxime,

Can you confirm that they are included in v17.11?
If so, we can close this as dup of bz#1522700
Thanks,
fbl

Comment 17 Maxime Coquelin 2018-02-16 16:30:10 UTC

Hi Flavio,

Sorry, my mistake, I meant the patches are part of v18.02-rc3, not v17.11-rc3:

(In reply to Maxime Coquelin from comment #15)
> Hi,
> 
> DPDK vhost-user fixes accepted upstream, will be part of *v18.02-rc3*:
> http://dpdk.org/dev/patchwork/patch/34663/
> http://dpdk.org/dev/patchwork/patch/34664/
> 
> Cheers,
> Maxime

The patches are also queued in v17.11 LTS branch, so will be part of next v17.11 
stable release.

I asked Matteo whether we should pick it for next release.
As the patches aren't very urgent and the next release deadline is close,
we agreed to postpone it to next release.

Regards,
Maxime

Comment 18 Flavio Leitner 2018-02-16 17:24:32 UTC

Thanks, reassigning to move this further.

Comment 19 Matteo Croce 2018-03-02 19:25:27 UTC

Hi Maxime,

aren't these two patches the same of bug 1541881 ?

If so, I'll change the state to MODIFIED

Comment 20 Maxime Coquelin 2018-03-05 10:22:10 UTC

Hi Matteo,

Yes, it is the same patches.

Cheers,
Maxime

Comment 21 Timothy Redaelli 2018-05-03 15:02:21 UTC


*** This bug has been marked as a duplicate of bug 1541881 ***