RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1786923 - testpmd stuck on real time kernel
Summary: testpmd stuck on real time kernel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: dpdk
Version: 8.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 8.0
Assignee: David Marchand
QA Contact: Jean-Tsung Hsiao
URL:
Whiteboard:
: 1689876 (view as bug list)
Depends On:
Blocks: 1755139 1771572 1883636
TreeView+ depends on / blocked
 
Reported: 2019-12-29 09:55 UTC by Sebastian Scheinkman
Modified: 2020-12-24 08:22 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-21 16:57:32 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sebastian Scheinkman 2019-12-29 09:55:43 UTC
Description of problem:

testpmd stuck on real time kernel

Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. install rhel8/rhcos with a real time kernel
2. install dpdk package
3. run the testpmd binary

Actual results:
sh-4.4# testpmd -l 4,40,42,44 -w 0000:19:00.6 --iova-mode=va -- -i
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: PCI device 0000:19:00.6 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
net_mlx5: flow rules relying on switch offloads will not be supported: netlink: failed to remove ingress qdisc: Operation not permitted
Interactive-mode selected

< stuck here for about 1 min >

testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Port 0 is now not stopped
Please stop the ports first
Done


Expected results:
This output is on a regular rhcos machine.

testpmd -l 4,40,42,44 -w 0000:19:00.6 --iova-mode=va -- -i
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: PCI device 0000:19:00.6 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
net_mlx5: flow rules relying on switch offloads will not be supported: netlink: failed to remove ingress qdisc: Operation not permitted
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Configuring Port 0 (socket 0)
Port 0: 06:3E:FF:3B:B7:7A
Checking link statuses...
Done


Additional info:
Real time kernel version: 4.18.0-147.3.1.rt24.96.el8_1.x86_64
Regular kernel: 4.18.0-147.3.1.el8_1.x86_64
Dpdk package version: 18.11

Stuck just before this message:

testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0

Comment 3 Maxime Coquelin 2020-01-06 13:33:34 UTC
Hi Sebastian,

It looks like bz1689876.
Do you have a tuned profile enabled?

You could check this is the same issue with using strace:
# strace -T -e trace=mlockall ./install/bin/testpmd -l 4,40,42,44 -w 0000:19:00.6 --iova-mode=va -- -i

Comment 4 Sebastian Scheinkman 2020-01-06 16:21:26 UTC
Hi Maxime,

comment inline

(In reply to Maxime Coquelin from comment #3)
> Hi Sebastian,
> 
> It looks like bz1689876.
> Do you have a tuned profile enabled?

This is the tune we are using:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-realtime-node
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Optimize systems running OpenShift realtime nodes
      include=openshift-node-network-latency

      [selinux]
      avc_cache_threshold=8192

      [net]
      nf_conntrack_hashsize=131072

      [sysctl]
      kernel.hung_task_timeout_secs = 600
      kernel.nmi_watchdog = 0
      kernel.sched_rt_runtime_us = -1
      vm.stat_interval = 10
      kernel.timer_migration = 0

      [sysfs]
      /sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1   

      [scheduler]
      isolated_cores=8,9,10,11,12,13,14,15,30,31,32,33,34,35,36,37,38,39,40,50,51,52,53,54
    name: openshift-realtime-node
  recommend:
  - priority: 10
    profile: openshift-realtime-node
    match:
    - label: node-role.kubernetes.io/worker-rt


> 
> You could check this is the same issue with using strace:
> # strace -T -e trace=mlockall ./install/bin/testpmd -l 4,40,42,44 -w
> 0000:19:00.6 --iova-mode=va -- -i

sure here is the output:

sh-4.4# strace -T -e trace=mlockall testpmd -l 2,4,40,44 -w 0000:19:01.0 --iova-mode=va -- -i
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: PCI device 0000:19:01.0 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
net_mlx5: flow rules relying on switch offloads will not be supported: netlink: failed to remove ingress qdisc: Operation not permitted
Interactive-mode selected
mlockall(MCL_CURRENT|MCL_FUTURE)        = 0 <160.870205>
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Configuring Port 0 (socket 0)
Port 0: 1E:C1:C2:0F:73:1E
Checking link statuses...
Done
testpmd> quit

Stopping port 0...
Stopping ports...
Done

Shutting down port 0...
Closing ports...
Done

Bye...
+++ exited with 0 +++


btw also the "Bye" take almost 2 min


Thanks!

Comment 5 Luiz Capitulino 2020-01-06 21:49:30 UTC
(In reply to Maxime Coquelin from comment #3)
 
> It looks like bz1689876.
> Do you have a tuned profile enabled?

Indeed, I think you're right that this is looking like bug 1689876.

Now, the initial findings on bug 1689876 comment 10 is that testpmd
is trying to lock around 200GB on startup, even when instructed to
allocate -m 1024. So, my questions would be:

- Why is testpmd trying to lock 200GB?
- Has it always worked like that or is it a recent change?
- Can we reproduce this issue without the real-time kernel and/or
  without the real-time profile? If we can't reproduce it, then
  finding out what in the RT kernel is causing this may shed some
  light on what is going on here

Comment 6 Neil Horman 2020-01-07 20:15:15 UTC
some answers to the questions in comment 5:

- Why is testpmd trying to lock 200GB
   Its ostensibly for performance reasons.  Those segments listed in comment 10 of bz1689876 are what dpdk uses for internal memory management.  Instead of using glibc's malloc, they preallocate file backed shared memory in slab sizes for each cpu core, and manage it internally so that they don't have to allocate pages on demand.  I think the values of those ranges can be controlled with the testpmd --socket-mem option (which is distinct from the -m option, though I'm not 100% clear on how)

- Has it always worked like this
     More or less.  They've done some rewriting of how the segment areas are tracked, but I think they've always tried to allocate memory like this.  I will note that the dpdk core library isn't responsible for locking the memory, only allocating the address space.  Testpmd specifically does the locking.  There should be some sort of config option that allows for the application to not lock the memory at run time.  Point being that, while testpmd locks all the ram in place, other applications may choose not to.

- Can we reproduce without the RT kernel
     I honestly don't know, but bz1689876 seems to suggest that you are able to do so, at least in part.

Comment 7 Peter Xu 2020-01-07 20:34:53 UTC
(In reply to Neil Horman from comment #6)
> - Why is testpmd trying to lock 200GB
>    Its ostensibly for performance reasons.  Those segments listed in comment
> 10 of bz1689876 are what dpdk uses for internal memory management.  Instead
> of using glibc's malloc, they preallocate file backed shared memory in slab
> sizes for each cpu core, and manage it internally so that they don't have to
> allocate pages on demand.  I think the values of those ranges can be
> controlled with the testpmd --socket-mem option (which is distinct from the
> -m option, though I'm not 100% clear on how)

Hi, Neil,

Thanks for answering the question.  I can totally understand that DPDK wants to avoid demand paging, however to me it still does not make much sense to reserve 200G anonymous memory as default, even if we only specified 1 * 1G huge page (-m 1024).  If the program runs, will it really eat up the 200G mem?  Would it make sense to make it smaller, or at least linear to how many huge page memory we will use?

(Side note: I'm still confused on why DPDK would need a lot of 4K pages after all, because IIUC most of the data should be on the huge pages, and if the accesses to those 4K pages are on IO hot path so we don't want demand paging, why not use huge pages too which makes TLB hit easier and also by default pinned?)

Thanks,
Peter

Comment 8 Neil Horman 2020-01-09 01:34:46 UTC
Hey Peter-
     I agree with you, reserving that much memory for the use case being targeted here doesn't make much sense - nominally dpdk use was expected to be on a whole bare metal host, in which dpdk effectively became the operating system for all intents and purposes.  That really changes when you move it into a container environment.  I agree it seems like you should be able to have a single knob to control how much memory is allocated, but that seems to be an artifact of how dpdk evolved in its development.  The --socket-mem option I think should help you tune this more finely (at least for the testpmd application)

In regards to your question over 4k pages, it uses those for network I/O.  Since DPDK effectively controls network hardware directly from user space, it does all its tx/rx offload directly to that memory, so coalesced frames (i.e. GRO, etc), gets dma-ed directly into that space, and for multiqueue NICS, they need lots of 4k pages

Comment 14 Marcelo Tosatti 2020-01-09 22:33:23 UTC
From BZ 1689876:

"DPDK testmpd startup time increases a lot when tuned realtime-virtual-host
profile is enabled.

Analysis on testpmd side shows that it's mlockall() syscall duration
that is increased (from 0.04 seconds with tuned disabled to 5+ seconds when
realtime-virtual-host profile is enabled for a single 1GB hugepage provided to testpmd)."

This is probably the slowdown that the perf team efforts have produced improvements
(locking was slower).

So i'd start with:

1) Check whether the recent kernel-rt speedups fix this (Juri knows more details).

2) If still suboptimal, debug mlockall() and see where the problem is.

Comment 26 Peter Xu 2020-01-10 20:12:06 UTC
(In reply to Luiz Capitulino from comment #25)
> Guys,
> 
> Just to be sure I'm following, can you confirm the following:
> 
> 1. We have proven that what causes this issue is disabling THP
>    (ie. we have reproduced the issue only by disabling THP and
>     without having any tuned profile applied)

I was testing with RHEL8 (non-RT), and the comparison is between:

  (1) throughput-performance profile, mlockall() took 0.023104 seconds (accroding to comment 19)
  (2) throughput-performance profile and disabling THP, mlockall() took 6.004354 (according to comment 23)

So I think it proves on RHEL8 (non-RT) we got such difference because of disabling THP (which we did in "network-latency" profile).

> 
> 2. The issue also reproduces with non-RT, ie. it's not RT specific

Yes I think it should be something different from tuned for RHEL8-RT.  However it could be the same issue behind, depending on whether RHEL8-RT disables THP by default.

> 
> Is this correct?
> 
> If this is correct, then while I think it's worthwhile understanding
> why we disable THP in the profiles, we should not incur this issue
> just because THP is disabled. This is our bug, IMHO.

Disabling THP will let mlockall() to walk into each PMD (which contains 512 extra PTEs to scan and operate), then it makes sense to take longer time (e.g. 6 seconds for 200G).  So it's still possible that it's not a RHEL8 kernel bug.

> 
> May I suggest we try with the upstream kernel? Maybe it's a RHEL8 kernel
> regression.

Yes it would be good to try.

As a summary, we still have at least two things to make sure:

  (1) Why RHEL7 does not have this problem
  (2) Why RHEL8-RT has this problem even without tuned

I'll continue with (1) (probably next Monday, though...).  Would be good if someone wants to dig (2) at the same time, or I'll continue after I figure out (1).

Comment 27 Peter Xu 2020-01-10 21:55:07 UTC
(In reply to Peter Xu from comment #26)
>   (1) Why RHEL7 does not have this problem

Well... I got the same testpmd hang on a RHEL7 host...

[root@virtlab422 ~]# uname -r
3.10.0-1062.el7.x86_64
[root@virtlab422 ~]# tuned-adm active
Current active profile: network-latency
[root@virtlab422 ~]# tuned-adm profile throughput-performance 
[root@virtlab422 ~]# tuned-adm active
Current active profile: throughput-performance
[root@virtlab422 ~]# strace -T -e mlockall testpmd -m 1024 2>&1 | grep mlockall
mlockall(MCL_CURRENT|MCL_FUTURE)        = 0 <0.169838>                   <---------------------- no hang
[root@virtlab422 ~]# tuned-adm profile network-latency 
[root@virtlab422 ~]# tuned-adm active
Current active profile: network-latency
[root@virtlab422 ~]# strace -T -e mlockall testpmd -m 1024 2>&1 | grep mlockall
mlockall(MCL_CURRENT|MCL_FUTURE)        = 0 <11.832785>                  <---------------------- hang for 11 seconds
[root@virtlab422 ~]# tuned-adm --version
tuned-adm 2.10.0
[root@virtlab422 ~]# cat /sys/kernel/mm/transparent_hugepage/enabled 
always madvise [never]

Maxime, do you still remember which kernel/tuned version were you using when you tested with RHEL7?

Thanks,

Comment 29 Juri Lelli 2020-01-13 07:49:30 UTC
(In reply to Peter Xu from comment #26)
> (In reply to Luiz Capitulino from comment #25)
> > Guys,
> > 
> > Just to be sure I'm following, can you confirm the following:
> > 
> > 1. We have proven that what causes this issue is disabling THP
> >    (ie. we have reproduced the issue only by disabling THP and
> >     without having any tuned profile applied)
> 
> I was testing with RHEL8 (non-RT), and the comparison is between:
> 
>   (1) throughput-performance profile, mlockall() took 0.023104 seconds
> (accroding to comment 19)
>   (2) throughput-performance profile and disabling THP, mlockall() took
> 6.004354 (according to comment 23)
> 
> So I think it proves on RHEL8 (non-RT) we got such difference because of
> disabling THP (which we did in "network-latency" profile).
> 
> > 
> > 2. The issue also reproduces with non-RT, ie. it's not RT specific
> 
> Yes I think it should be something different from tuned for RHEL8-RT. 
> However it could be the same issue behind, depending on whether RHEL8-RT
> disables THP by default.

Yes, RHEL-RT does disable THP by default (I believe reason is to avoid
this kinds of problems).

> 
> > 
> > Is this correct?
> > 
> > If this is correct, then while I think it's worthwhile understanding
> > why we disable THP in the profiles, we should not incur this issue
> > just because THP is disabled. This is our bug, IMHO.
> 
> Disabling THP will let mlockall() to walk into each PMD (which contains 512
> extra PTEs to scan and operate), then it makes sense to take longer time
> (e.g. 6 seconds for 200G).  So it's still possible that it's not a RHEL8
> kernel bug.
> 
> > 
> > May I suggest we try with the upstream kernel? Maybe it's a RHEL8 kernel
> > regression.
> 
> Yes it would be good to try.

Upstream RT disables THP as well (we followed along).

> As a summary, we still have at least two things to make sure:
> 
>   (1) Why RHEL7 does not have this problem
>   (2) Why RHEL8-RT has this problem even without tuned
> 
> I'll continue with (1) (probably next Monday, though...).  Would be good if
> someone wants to dig (2) at the same time, or I'll continue after I figure
> out (1).

Comment 30 Peter Xu 2020-01-13 14:30:44 UTC
(In reply to Marcelo Tosatti from comment #28)
> THP is disabled because khugepaged will scan looking for 4k pages to merge
> in 2M pages.
> This activity can interfere with packet processing negatively.

Yeh it makes total sense to disable THP for determinism. 

Also with Juri's comment 29 (thanks for following up!) I think we're pretty sure about why RHEL8-RT suffers from this even without tuned, and we also know why RHEL8 suffers too if with tuned realtime-virtual-host applied (which is actually network-latency behind).  

Then this bug could be even more close to NOTABUG, and the solution could be that we use "--socket-mem" always to avoid this boot delay in the container scripts when needed (I didn't try this, but I'm referring to the comment 6 that Neil provided).

I think the only tiny missing piece of the puzzle is why RHEL7 didn't hang for Maxime when tested initially, because logically it should as long as THP disabled (and in my RHEL7 test it did hang as expected, according to comment 27, so I didn't reproduce that).

Comment 31 David Marchand 2020-01-13 14:38:25 UTC
(In reply to Peter Xu from comment #30)
> (In reply to Marcelo Tosatti from comment #28)
> > THP is disabled because khugepaged will scan looking for 4k pages to merge
> > in 2M pages.
> > This activity can interfere with packet processing negatively.
> 
> Yeh it makes total sense to disable THP for determinism. 
> 
> Also with Juri's comment 29 (thanks for following up!) I think we're pretty
> sure about why RHEL8-RT suffers from this even without tuned, and we also
> know why RHEL8 suffers too if with tuned realtime-virtual-host applied
> (which is actually network-latency behind).  
> 
> Then this bug could be even more close to NOTABUG, and the solution could be
> that we use "--socket-mem" always to avoid this boot delay in the container
> scripts when needed (I didn't try this, but I'm referring to the comment 6
> that Neil provided).

--socket-mem won't prevent the dpdk allocator from reserving those huge holes in the process mapping.
It just tells the EAL how much memory should be populated at init.

# testpmd -w 0000:07:00.0 -w 0000:08:00.0 --socket-mem 2048,0 -- -i
...

VmLck:	134467112 kB


I am looking at some change in testpmd, but as I said, the problem is that we must duplicate this in OVS too, and potentially customers will face the same issue if they have their own DPDK application.

Comment 32 Maxime Coquelin 2020-01-13 15:49:41 UTC
(In reply to Peter Xu from comment #30)
> I think the only tiny missing piece of the puzzle is why RHEL7 didn't hang
> for Maxime when tested initially, because logically it should as long as THP
> disabled (and in my RHEL7 test it did hang as expected, according to comment
> 27, so I didn't reproduce that).

I just tried again on a rather old RHEL7 kernel (3.10.0-862.el7.x86_64), I also reproduce the issue with it.
I think I did a mistake when testing on RHEL7 last year, leading to me thinking it did not reproduce on RHEL7.
Sorry, for the confusion.

Comment 33 Luiz Capitulino 2020-01-13 20:31:15 UTC
Adding Andrea for THP expertise.

Andrea, the very short explanation of this issue is that testpmd (a DPDK test app) apparently
takes very long to initialize (several seconds? minutes?) when THP is disabled in the kernel.
This seems to happen because testpmd is trying to mlock() 200G. Others will have more details.

The question is whether this is an issue that should be fixed in the kernel.

Comment 34 Peter Xu 2020-01-16 20:58:50 UTC
(In reply to David Marchand from comment #31)
> I am looking at some change in testpmd, but as I said, the problem is that
> we must duplicate this in OVS too, and potentially customers will face the
> same issue if they have their own DPDK application.

Thanks for working on this!  IMHO it would be good to put the link of the DPDK patchset into the bz so in the future people can reference that as a solution.

Though I still have one thing unclear about this: the 200G memory that we observed seem to be read-only anonymous memory.  Do you (or Maxime) know why DPDK needs to allocate that huge chunk of read-only memories?  IIUC they'll be all zeros (because it's both anonymous, and read-only) regardless of the size.  With that, I failed to figure out myself on how that chunk of buffer will be used in any useful way...  The question could be a bit out-of-topic regarding to this bz, but both Luiz and I are confused about this fact, so I'm still posting this out just in case there's quick answer.

Thanks,
Peter

Comment 35 David Marchand 2020-01-16 21:06:14 UTC
DPDK has a feature where multiple processes have the same internal mappings (and remap the hugepages files to share memory).
Reserving those big ranges is used to ensure that a process can remap at known places (well, with nothing but dpdk stuff at those addresses).

Comment 36 David Marchand 2020-01-16 21:07:39 UTC
Additional info, hugepages memory can be added at runtime, those reserved mappings are updated, and a synchronisation happens between the multiple processes.

Comment 37 Peter Xu 2020-01-16 21:23:52 UTC
(In reply to David Marchand from comment #35)
> DPDK has a feature where multiple processes have the same internal mappings
> (and remap the hugepages files to share memory).
> Reserving those big ranges is used to ensure that a process can remap at
> known places (well, with nothing but dpdk stuff at those addresses).

I see the point, thanks for the quick answer!  Then it also make sense in a fix that we don't pin these buffers at all.

Comment 38 David Marchand 2020-01-16 21:29:48 UTC
DPDK itself is not asking for locking those ranges.
The problem is in the app using dpdk, here testpmd.


Little history, in v17.11, Eelco added the locking to achieve deterministic/reproductible benchmarks.
But later in v18.05, the memory hotplug was added, with those mappings, and we now end up with locking all those unused mappings.


I wrote a patch to lock only used mappings, and let the rest with MCL_ONFAULT but I did not get a chance to rerun this in Eelco setup.
I still suspect that we will have issues... seeing the original commit from Eelco talking about prefaulting the driver code pages.
I might have to identify the ranges where the code has been mapped...
https://github.com/david-marchand/dpdk/commit/f9e1b9fa101c9f4f16c0717401a55790aecc6484

Comment 39 Luiz Capitulino 2020-02-12 02:29:10 UTC
David,

Do you have an update on your patches?

Comment 40 David Marchand 2020-02-12 09:02:27 UTC
I have been out and busy those last weeks.

I started looking at Eelco environment yesterday, and I am not able to reproduce his issue for now.
So work in progress for a workaround on dpdk side.

Sebastian, do you know of the --no-mlockall option to testpmd?
Did not mention it earlier, but it should be enough for now if you are blocked.

Comment 41 Luiz Capitulino 2020-02-13 22:26:14 UTC
*** Bug 1689876 has been marked as a duplicate of this bug. ***

Comment 42 David Marchand 2020-03-06 15:31:22 UTC
I posted a RFC for testpmd, http://patchwork.dpdk.org/patch/66347/
And for reference, the original bz that got mlockall introduced: bz1486758.

Comment 43 Luiz Capitulino 2020-03-06 16:15:55 UTC
(In reply to David Marchand from comment #42)
> I posted a RFC for testpmd, http://patchwork.dpdk.org/patch/66347/
> And for reference, the original bz that got mlockall introduced: bz1486758.

That's a nice fix and I think it's what Peter suggested.

Is a backport necessary? If yes, and if you plan doing it, would
you take the BZ? It's assigned to a manager today :)

Comment 44 David Marchand 2020-03-06 16:23:18 UTC
The conclusion on the kernel side is not clear to me.
I still see this bz as a change in behavior on the kernel side that I avoid with a workaround on the dpdk side.

Comment 45 Luiz Capitulino 2020-03-06 16:53:57 UTC
So, I as far as I can understand, the kernel issue side of the
story is that locking hundreds of GBs when using 4K pages is
a slow operation in the kernel (ie. it can take several seconds).

We could try to optmize this if there was a clear customer need.
However, in this case it shouldn't be necessary since DPDK apps
should be using hugepages anyways which don't require locking. So,
DPDK apps should not run into this issue.

Also, the behavior change between the non-RT and real-time kernel
is due to THP (Transparent Huge Pages). The non-RT kernel uses THP
by default, which means testpmd is setup to use 2M pages instead
of 4K (which speeds up the locking process). However, the real-time
kernel disables THP defaulting to 4K (back to our original problem).

So, my conclusion is: since there's no need for DPDK apps to lock
hundreds of GBs in 4K pages, your workaround is good enough to
resolve this issue.

Peter, please do jump in if this summary is wrong :)

Comment 46 Peter Xu 2020-03-06 23:33:47 UTC
Yes it should be fixable from userspace.

Today I got a chance to chat with Andrea, and Andrea pointed out a very important fact that for PROT_NONE memories mlockall() will automatically skip it (get_user_page will silently fail with that when doing mlockall, not sure whether it's by design).

So...

David, if DPDK only used mmap(PROT_READ, MAP_PRIVATE|MAP_ANNONYMOUS) for reserving VAs, would you mind try to use:

  mmap(PROT_NONE, MAP_PRIVATE|MAP_ANNONYMOUS);

Instead?  As I mentioned, PROT_NONE typed memory regions will still be able to reserve VAs, but at the meantime mlockall() will skip that automatically.  Maybe this is even better solution than the ONFAULT one.

It can be verified with below program:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>

int main(void)
{
    int ret;
    void *addr = mmap(NULL, 4096 * 100000000UL, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (addr == MAP_FAILED)
        return -1;

    ret = mlockall(MCL_CURRENT);
    if (ret)
        return -2;

    printf("PID %d, ADDR %p\n", getpid(), addr);
    return 0;
}

This will try to reserve 400G mem and hang.  If you change it to PROT_NONE, it'll complete immediately.

With this, I'm helping to remove needinfo for Andrea too.

Comment 48 David Marchand 2020-03-07 17:43:53 UTC
If this behavior is not documented, it would be worth making this clear so that we can indeed rely on it for the long term.

This does sound a good way to handle it, thanks for the idea, I'll have a try next week!

Comment 53 David Marchand 2020-03-19 15:15:09 UTC
Proposed a different fix upstream, following suggestion by Andrea.
https://git.dpdk.org/dpdk/commit/?id=8a4baf06c17a806696fb10aba36fce7471983028

I will wait for more validation upstream before working downsteam.

Comment 61 Yaniv Joseph 2020-12-24 08:22:34 UTC
Clearing needinfo see comment #59 & #60.


Note You need to log in before you can comment on or make changes to this bug.