Bug 1486758

Summary: testpmd pmd threads getting page faults
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: dpdkAssignee: Eelco Chaudron <echaudro>
Status: CLOSED UPSTREAM QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.5CC: bhu, dgilbert, fleitner, linville, maxime.coquelin, pezhang
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-10 07:47:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full trace output none

Description Luiz Capitulino 2017-08-30 13:45:07 UTC
Description of problem:

We have a PVP setup where we use testpmd as a vswitch in the host (plus testpmd in the guest). We have a test-case using this setup where we generate high I/O load in the guest. This test-case causes some very reproducible packet drops in 30 minutes.

When debugging the packet drops, we can see PMD threads entering the kernel due to page faults on mmapped files:

lcore-slave-2-12459 [002]   460.774942: user_exit:            
lcore-slave-2-12459 [002]   460.774943: page_fault_user:      address=0x7f56d7f84f30f ip=0x7f56d7f84f30f error_code=0x14
lcore-slave-2-12459 [002]   460.774944: xfs_filemap_fault:    dev 253:0 ino 0x2008817
lcore-slave-2-12459 [002]   460.774944: xfs_ilock:            dev 253:0 ino 0x2008817 flags MMAPLOCK_SHARED caller xfs_filemap_fault
lcore-slave-2-12459 [002]   460.774946: xfs_iunlock:          dev 253:0 ino 0x2008817 flags MMAPLOCK_SHARED caller xfs_filemap_fault
lcore-slave-2-12459 [002]   460.774947: user_enter:

The PMD threads get several page faults in a row, which in total can interrupt the PMD threads for dozens of microseconds. The files being mapped in are drivers such as /usr/lib64/librte_pmd_ixgbe.so.1 and /usr/lib64/librte_pmd_vhost.so.1.

My hypothesis is that as testpmd is dynamically linked to the DPDK libraries, it's probably not prefaulting those mapped files. So, it may be possible that the PMD threads are the first to go through some unmapped libary code and get the faults. Also, I couldn't find any call to mlockall() in the dpdk source code. This means that testpmd pages are not locked into memory, so the possibility also exits that pages containing library code could be evicted from memory.

While testpmd is not used in production and hence no product should be affected, this issue greatly affects testing. To fix it, we could link testpmd statically and add support for mlockall() in testpmd or in the EAL itself.

NOTE: I've checked with Flavio Leitner and OVS-DPDK should not have this issue, since it links statically to DPDK libs and has mlockall() support.

Version-Release number of selected component (if applicable): dpdk-17.05-3.el7fdb.x86_64


How reproducible:

Reproducing with our test-case is complex. But a simpler reproducer would be to trace for page faults when launching testpmd and seeing that there are page faults happening in the main thread even after startup.

Comment 2 Luiz Capitulino 2017-08-30 13:47:07 UTC
Created attachment 1320079 [details]
full trace output

This is the full trace output. Note that there's even page allocation going on when the PMD thread is in kernel-space.

Comment 4 Luiz Capitulino 2017-09-02 02:27:02 UTC
I took a quick look at this today. I was hoping that there would be an option in DPDK config that would allow me to build testpmd statically but still provide the dynamic version of the libraries (which I think is what we want for RHEL, no?). But the only option I found was to have everything dynamic or everything static.

Do we only provide dynamic versions of the libraries in RHEL7? Does this mean that apps developed in RHEL7 will have to support prefaulting by themselves? Is this actually possible given that the dynamic libs is loaded by the eal lib itself? I'm feeling I'm missing something important because my findings seem to indicate that we may want to do the opposite of what we're doing today.

Anyways, as I can't be blocked by this I've built a dpdk package for my own testing which is statically built and I've also added a call to mlockall() in testpmd initialization. I can't reproduce my issue with that one. Another very interesting test I could do to compare static vs. dynamic testpmd is to check if the static built one can achieve higher throughout in the zero-loss test-case in the PVP scenario. But this will have to wait several days, as I have other tests to run.

Comment 6 Luiz Capitulino 2017-09-05 13:28:30 UTC
Eelco,

The most important is having testpmd statically linked. Getting page faults because of the missing mlockall() should be extremely hard: you'd need memory pressure and the kernel would have to evict pages containing code testpmd is using. Still, having mlockall() is necessary and correct.

I think that the first step for this BZ is to acknowledge (or confirm) that having testpmd dynamically linked is the root cause for the page faults I'm seeing. Once this is done, the next step would be to fix the dpdk package to have a statically linked testpmd. Finally, I think we have to offer static libs for customers developing dpdk apps in RHEL7.

The "feeling of missing something important" I mentioned in comment 4 is because there's a big machinery in DPDK build system to allow shared libs but if they can incurr page faults, why offer this in the first place?

Anyways, I'm CC'ing package maintainers as this is mostly about packaging.

Comment 7 Eelco Chaudron 2017-09-12 14:02:51 UTC
Did a lot of experiments, and I see these page faults with both a statically and dynamically linked version of testpmd.

Linux will load pages on demand and probably for the statically linked version we are lucky and the pages get loaded due to other "close" code being executed.

Adding the mlockall() will fix the pagefaults for both versions as the .text pages get loaded and locked at startup. You can see this behaviour by examining /proc/<id>/smap.

I do think DPDK should not force the mlockall() in their libraries, but the NFV application should do it as part of their startup code. Also for other tasks than performance testing mlockall() might not make much sense, as over time the correct pages are in memory, and will probably stay in on a well dimensioned system.

However for testpmd I have send out a patch to include mlockall(), lets see the upstream response on this. Patch can be found here:

  http://dpdk.org/ml/archives/dev/2017-September/075176.html

Comment 8 Luiz Capitulino 2017-09-12 14:16:12 UTC
(In reply to Eelco Chaudron from comment #7)
> Did a lot of experiments, and I see these page faults with both a statically
> and dynamically linked version of testpmd.
> 
> Linux will load pages on demand and probably for the statically linked
> version we are lucky and the pages get loaded due to other "close" code
> being executed.
> 
> Adding the mlockall() will fix the pagefaults for both versions as the .text
> pages get loaded and locked at startup. You can see this behaviour by
> examining /proc/<id>/smap.

That's an awesome finding! When doing the static vs. dynamic comparisons, I always had mlockall() for the static version of testpmd. But I assumed I wasn't getting page faults because it was statically linked and hence the kernel would load all text section at once at startup. I did not think mlockall() had anything to do with it. I think I've learned something today :)

> I do think DPDK should not force the mlockall() in their libraries, but the
> NFV application should do it as part of their startup code. Also for other
> tasks than performance testing mlockall() might not make much sense, as over
> time the correct pages are in memory, and will probably stay in on a well
> dimensioned system.

I disagree with that part. I think that at the very least EAL should have a command-line option to call mlockall() from the DPDK application. But that's an upstream discussion. The most important for the immediate term is to get testpmd fixed for those using it for testing.
 
> However for testpmd I have send out a patch to include mlockall(), lets see
> the upstream response on this. Patch can be found here:
> 
>   http://dpdk.org/ml/archives/dev/2017-September/075176.html

Thanks a lot, that patch certainly solves this BZ.

Comment 9 Maxime Coquelin 2017-09-12 14:23:48 UTC
Hi Eelco,

Thanks for the patch.
I have one question about using mlockall() with regard to postcopy live-migration (adding David in cc:).

When mlockall(CL_CURRENT | MCL_FUTURE) is called, do you know whether later mmap() calls on shared memory without MAP_POPULATE flag will or not prefault the mmap'ed pages?

Comment 10 Eelco Chaudron 2017-09-12 14:51:59 UTC
(In reply to Maxime Coquelin from comment #9)
> Hi Eelco,
> 
> Thanks for the patch.
> I have one question about using mlockall() with regard to postcopy
> live-migration (adding David in cc:).
> 
> When mlockall(CL_CURRENT | MCL_FUTURE) is called, do you know whether later
> mmap() calls on shared memory without MAP_POPULATE flag will or not prefault
> the mmap'ed pages?

I do not know the behaviour, but found this online (assuming code has not changed)

http://www.spinics.net/lists/linux-man/msg09821.html

Comment 11 Dr. David Alan Gilbert 2017-09-12 15:04:00 UTC
(In reply to Maxime Coquelin from comment #9)
> Hi Eelco,
> 
> Thanks for the patch.
> I have one question about using mlockall() with regard to postcopy
> live-migration (adding David in cc:).
> 
> When mlockall(CL_CURRENT | MCL_FUTURE) is called, do you know whether later
> mmap() calls on shared memory without MAP_POPULATE flag will or not prefault
> the mmap'ed pages?

I'm not sure (I always ask aarcange for this type of thing).
(Note in the postcopy code in qemu I do an munlock before the postcopy and turn it back on later)

Comment 12 Maxime Coquelin 2017-09-13 09:14:31 UTC
(In reply to Eelco Chaudron from comment #10)
> (In reply to Maxime Coquelin from comment #9)
> > Hi Eelco,
> > 
> > Thanks for the patch.
> > I have one question about using mlockall() with regard to postcopy
> > live-migration (adding David in cc:).
> > 
> > When mlockall(CL_CURRENT | MCL_FUTURE) is called, do you know whether later
> > mmap() calls on shared memory without MAP_POPULATE flag will or not prefault
> > the mmap'ed pages?
> 
> I do not know the behaviour, but found this online (assuming code has not
> changed)
> 
> http://www.spinics.net/lists/linux-man/msg09821.html


Thanks for the link, that's interesting.
IIUC, MCL_ONFAULT could do the trick, but we may want to do
mlockall(MCL_CURRENT);
mlockall(MCL_FUTURE | MCL_ONFAULT);

But it may be better to keep it as you proposed and do an munlock()/mlock() at postcopy time as suggests David:

(In reply to Dr. David Alan Gilbert from comment #11)
> I'm not sure (I always ask aarcange for this type of thing).
> (Note in the postcopy code in qemu I do an munlock before the postcopy and
> turn it back on later)

I'll keep this in mind when implementing postcopy, for now your patch is fine to me.

Thanks,
Maxime

Comment 13 Eelco Chaudron 2017-10-10 07:47:38 UTC
Patch for testpmd and documentation got accepted:


http://dpdk.org/ml/archives/dev/2017-October/078316.html
http://dpdk.org/ml/archives/dev/2017-October/078317.html