Bug 1213835 - numa_node_to_cpus() returns cached data
Summary: numa_node_to_cpus() returns cached data
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: numactl
Version: 7.1
Hardware: All
OS: Linux
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Pingfan Liu
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On:
Blocks: 1400961 1711208
TreeView+ depends on / blocked
 
Reported: 2015-04-21 12:02 UTC by Andrea Bolognani
Modified: 2020-04-04 13:36 UTC (History)
8 users (show)

Fixed In Version: numactl-2.0.12-4.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-31 20:08:17 UTC
Target Upstream Version:


Attachments (Terms of Use)
suggested patch to update the cpu to node mapping (3.10 KB, application/mbox)
2019-07-15 09:06 UTC, Pingfan Liu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:1163 0 None None None 2020-03-31 20:08:27 UTC

Description Andrea Bolognani 2015-04-21 12:02:45 UTC
Description of problem:

  numa_node_to_cpus() caches the number of CPUs present in the system, which
  means that applications using it to query for this information will be given
  stale data if the hardware configuration has changed between calls.

  An example of this behavior can be seen in the libvirtd daemon, which is
  a long-running process linked against libnuma.

Version-Release number of selected component (if applicable):

  numactl-libs-2.0.9-4.el7.x86_64

How reproducible:

  Always.

Steps to Reproduce:

  1. The host needs to have at least two CPUs

  2. Query libvirt for host capabilities, both CPUs will be reported:

     [test@localhost ~]$ sudo virsh capabilities | grep '<cpu id='
         <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
         <cpu id='1' socket_id='1' core_id='0' siblings='1'/>

  3. Now bring one of the CPUs offline:

     [test@localhost ~]$ echo 0 | sudo tee /sys/devices/system/cpu/cpu1/online

  4. Query again libvirt for host capabilities, the second CPU is still
     present in the output but all the information is missing because they
     can't be retrieved for offline CPUs:

     [test@localhost ~]$ sudo virsh capabilities | grep '<cpu id='
         <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
         <cpu id='1'/>

  5. It works the other way around, too: if only one CPU is online when
     libvirtd is started, and you bring the other one online later, only one
     CPU will be reported by libvirt.

Actual results:

  The number of CPUs reported by numa_node_to_cpus() doesn't reflect the
  changes occured in the system since the process was started.

Expected results:

  The data reported by libnuma is always current.

Additional info:

  None

Comment 3 Petr Holasek 2015-05-05 11:07:49 UTC
Hi Andrea,

it seems like a quite big problem since numactl wasn't prepared for cpu/node runtime changes between on/offline.

Do you need to disable all caching for libvirtd? Or just particular calls?

Comment 4 Andrea Bolognani 2015-05-18 12:29:39 UTC
Hi Petr,

right now libvirt is just calling into libnuma, so we're not reacting in any
way to host topology changes, which can lead to weird output as evidenced in
the bug report.

The current implementation is quite puzzling: a short-lived utility like
numactl is probably going to query the data just once or twice, which makes a
cache not very useful; on the other hand, a long-lived process like libvirtd
is going to access the data several times and having a cache could really
speed things up, but the fact that the data in the cache might be outdated
kinda defeats the purpose.

Ideally libnuma should always report the current state of the host. I guess
there are two ways to ensure this is the case: get rid of the cache completely,
or rewrite it so that the data cointained in it is kept up-to-date. The latter
would be more difficult to get right, and I'm not sure the benefit would be
noticeable since we're talking about a really small amount of data here.

Please note that there might be other cached data in the library, and some of
it might be accessed directly insead of through a function call, which would
make both approaches impossible without an API change. I haven't investigated
this in depth.

Cheers.

Comment 5 Petr Holasek 2015-07-01 13:06:34 UTC
Hi Andrea,

I've sent two patches upstream , but no acceptance so far (http://comments.gmane.org/gmane.linux.kernel.numa/934). Based on that, deferring the BZ to 7.3.

Comment 11 Pingfan Liu 2018-10-23 04:03:19 UTC
Hi Andrea,

I just takes the bug, and wonder how to fix it. Is it possible to add a udev rule by libvirt, so each time cpu online/offline, it will be notified by daemon. Then it re-call numactl to get the info. Or any other suggestion?

Thanks,
Pingfan

Comment 12 Andrea Bolognani 2018-11-01 14:04:53 UTC
(In reply to Pingfan Liu from comment #11)
> Hi Andrea,
> 
> I just takes the bug, and wonder how to fix it. Is it possible to add a udev
> rule by libvirt, so each time cpu online/offline, it will be notified by
> daemon. Then it re-call numactl to get the info. Or any other suggestion?

libvirt doesn't spawn numactl, but rather it links against libnuma
directly, which means your suggestion wouldn't really work...

Personally I would just get rid of the cache, but as I understand it
that approach was rejected upstream the last time it was proposed.

Comment 13 Pingfan Liu 2019-07-03 03:15:20 UTC
Hi Andrea,

For the suggested method in comment#5, the maintainer cared about the performance issue, so we can not drop the cache. And I am thinking about other method like by which udev works. But before that, do you try the method in comment#5? And does it work?

I just want to assure "/sys/devices/system/node/node%d/cpumap" can reflect the online/offline cpu, not the present cpu.

Thanks,
  Pingfan

Comment 14 Pingfan Liu 2019-07-15 09:03:05 UTC
(In reply to Pingfan Liu from comment #13)
> Hi Andrea,
> 
> For the suggested method in comment#5, the maintainer cared about the
> performance issue, so we can not drop the cache. And I am thinking about
> other method like by which udev works. But before that, do you try the
> method in comment#5? And does it work?
> 
> I just want to assure "/sys/devices/system/node/node%d/cpumap" can reflect
> the online/offline cpu, not the present cpu.
> 
Check the kernel code, it reflect the mapping of possible cpu.

Comment 15 Pingfan Liu 2019-07-15 09:05:08 UTC
Hi Andrea,

What about exporting a API numa_node_to_cpu_outdate() to update the cache date after cpu online/offline?

You can get the suggested patch in attachment.

Thanks,
Pingfan

Comment 16 Pingfan Liu 2019-07-15 09:06:13 UTC
Created attachment 1590658 [details]
suggested patch to update the cpu to node mapping

Comment 17 Andrea Bolognani 2019-07-18 13:57:44 UTC
Hi, sorry for taking a while to get back to you.

The patch looks okay (though I'd prefer if the function was called
numa_node_to_cpu_mark_outdated() or something like that), and
reacting to CPU topology changes in user code using udev rules
doesn't too terrible either.

That said, I still stand by what I said four years (!) ago in
Comment 4: the basic idea of caching this data in libnuma seems
misguided, considering that short-lived processes will not stick
around long enough for it to matter, and long-lived processes on the
other hand will be hindered by the fact that they're served stale
data.

I tracked down an alternative URL for the patch posting mentioned in
Comment 5[1] and it doesn't look to me like upstream has really
NACKed the suggested approach of ditching the cache but merely asked
for some mitigation, which Petr promptly implemented; the discussion
simply stopped after that.

How about re-submitting Petr's patches and see whether that results
in a better outcome this time around?


[1] https://www.spinics.net/lists/linux-numa/msg01134.html

Comment 18 Pingfan Liu 2019-07-19 01:53:48 UTC
(In reply to Andrea Bolognani from comment #17)
> Hi, sorry for taking a while to get back to you.
> 
> The patch looks okay (though I'd prefer if the function was called
> numa_node_to_cpu_mark_outdated() or something like that), and
> reacting to CPU topology changes in user code using udev rules
> doesn't too terrible either.
> 
> That said, I still stand by what I said four years (!) ago in
> Comment 4: the basic idea of caching this data in libnuma seems
> misguided, considering that short-lived processes will not stick
> around long enough for it to matter, and long-lived processes on the
> other hand will be hindered by the fact that they're served stale
> data.
Yes, that should be the best design. But it is unavoidable to introduce something like udev in libnuma, or to caller.
So I think a slight change is more possible to be accepted by maintainer
> 
> I tracked down an alternative URL for the patch posting mentioned in
> Comment 5[1] and it doesn't look to me like upstream has really
> NACKed the suggested approach of ditching the cache but merely asked
> for some mitigation, which Petr promptly implemented; the discussion
> simply stopped after that.
But I can not understand how a timeout can solve performance issue. (https://www.spinics.net/lists/linux-numa/msg01137.html)
On the other side, exporting a API numa_node_to_cpu_mark_outdated() can avoid the performance drop totally, considering online/offline cpu is not very frequent event on product environment.

> 
> How about re-submitting Petr's patches and see whether that results
> in a better outcome this time around?
> 
> 
> [1] https://www.spinics.net/lists/linux-numa/msg01134.html

Comment 19 Pingfan Liu 2019-07-23 04:11:14 UTC
I have created a pull request: https://github.com/numactl/numactl/pull/73
And add some comment about https://www.spinics.net/lists/linux-numa/msg01134.html

Thanks,
Pingfan

Comment 20 Andrea Bolognani 2019-07-23 10:52:44 UTC
(In reply to Pingfan Liu from comment #19)
> I have created a pull request: https://github.com/numactl/numactl/pull/73

Alright, let's see how this goes :)

Good luck, and thanks for pushing this forward!

Comment 21 Pingfan Liu 2019-09-17 07:59:09 UTC
Hi Andrea,

The suggested patch has been merged and backport.
Could you have a test on https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23531514 to see whether it works for you?

Thanks,
Pingfan

Comment 24 Andrea Bolognani 2019-09-18 13:34:20 UTC
(In reply to Pingfan Liu from comment #21)
> Hi Andrea,
> 
> The suggested patch has been merged and backport.
> Could you have a test on
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23531514 to see
> whether it works for you?

Hi,

unfortunately I don't have time to try this right now, but it's good
to know the API has made it upstream (and downstream)!

I'll try to find some time to add support for it in libvirt, but it
might be a while before I actually get around to it...

Comment 25 Pingfan Liu 2019-09-24 10:22:25 UTC
Hi,

Due to the schedule of 7.8, I think this fix should be tested as soon as possible.

Thanks,
Pingfan

Comment 28 errata-xmlrpc 2020-03-31 20:08:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1163


Note You need to log in before you can comment on or make changes to this bug.