Bug 1213835
| Summary: | numa_node_to_cpus() returns cached data | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Andrea Bolognani <abologna> | ||||
| Component: | numactl | Assignee: | Pingfan Liu <piliu> | ||||
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.1 | CC: | abologna, ovasik, pdancak, perfbz, pportant, qe-baseos-daemons, ruyang, skozina | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | numactl-2.0.12-4.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-03-31 20:08:17 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1400961, 1711208 | ||||||
| Attachments: |
|
||||||
Hi Andrea, it seems like a quite big problem since numactl wasn't prepared for cpu/node runtime changes between on/offline. Do you need to disable all caching for libvirtd? Or just particular calls? Hi Petr, right now libvirt is just calling into libnuma, so we're not reacting in any way to host topology changes, which can lead to weird output as evidenced in the bug report. The current implementation is quite puzzling: a short-lived utility like numactl is probably going to query the data just once or twice, which makes a cache not very useful; on the other hand, a long-lived process like libvirtd is going to access the data several times and having a cache could really speed things up, but the fact that the data in the cache might be outdated kinda defeats the purpose. Ideally libnuma should always report the current state of the host. I guess there are two ways to ensure this is the case: get rid of the cache completely, or rewrite it so that the data cointained in it is kept up-to-date. The latter would be more difficult to get right, and I'm not sure the benefit would be noticeable since we're talking about a really small amount of data here. Please note that there might be other cached data in the library, and some of it might be accessed directly insead of through a function call, which would make both approaches impossible without an API change. I haven't investigated this in depth. Cheers. Hi Andrea, I've sent two patches upstream , but no acceptance so far (http://comments.gmane.org/gmane.linux.kernel.numa/934). Based on that, deferring the BZ to 7.3. Hi Andrea, I just takes the bug, and wonder how to fix it. Is it possible to add a udev rule by libvirt, so each time cpu online/offline, it will be notified by daemon. Then it re-call numactl to get the info. Or any other suggestion? Thanks, Pingfan (In reply to Pingfan Liu from comment #11) > Hi Andrea, > > I just takes the bug, and wonder how to fix it. Is it possible to add a udev > rule by libvirt, so each time cpu online/offline, it will be notified by > daemon. Then it re-call numactl to get the info. Or any other suggestion? libvirt doesn't spawn numactl, but rather it links against libnuma directly, which means your suggestion wouldn't really work... Personally I would just get rid of the cache, but as I understand it that approach was rejected upstream the last time it was proposed. Hi Andrea, For the suggested method in comment#5, the maintainer cared about the performance issue, so we can not drop the cache. And I am thinking about other method like by which udev works. But before that, do you try the method in comment#5? And does it work? I just want to assure "/sys/devices/system/node/node%d/cpumap" can reflect the online/offline cpu, not the present cpu. Thanks, Pingfan (In reply to Pingfan Liu from comment #13) > Hi Andrea, > > For the suggested method in comment#5, the maintainer cared about the > performance issue, so we can not drop the cache. And I am thinking about > other method like by which udev works. But before that, do you try the > method in comment#5? And does it work? > > I just want to assure "/sys/devices/system/node/node%d/cpumap" can reflect > the online/offline cpu, not the present cpu. > Check the kernel code, it reflect the mapping of possible cpu. Hi Andrea, What about exporting a API numa_node_to_cpu_outdate() to update the cache date after cpu online/offline? You can get the suggested patch in attachment. Thanks, Pingfan Created attachment 1590658 [details]
suggested patch to update the cpu to node mapping
Hi, sorry for taking a while to get back to you. The patch looks okay (though I'd prefer if the function was called numa_node_to_cpu_mark_outdated() or something like that), and reacting to CPU topology changes in user code using udev rules doesn't too terrible either. That said, I still stand by what I said four years (!) ago in Comment 4: the basic idea of caching this data in libnuma seems misguided, considering that short-lived processes will not stick around long enough for it to matter, and long-lived processes on the other hand will be hindered by the fact that they're served stale data. I tracked down an alternative URL for the patch posting mentioned in Comment 5[1] and it doesn't look to me like upstream has really NACKed the suggested approach of ditching the cache but merely asked for some mitigation, which Petr promptly implemented; the discussion simply stopped after that. How about re-submitting Petr's patches and see whether that results in a better outcome this time around? [1] https://www.spinics.net/lists/linux-numa/msg01134.html (In reply to Andrea Bolognani from comment #17) > Hi, sorry for taking a while to get back to you. > > The patch looks okay (though I'd prefer if the function was called > numa_node_to_cpu_mark_outdated() or something like that), and > reacting to CPU topology changes in user code using udev rules > doesn't too terrible either. > > That said, I still stand by what I said four years (!) ago in > Comment 4: the basic idea of caching this data in libnuma seems > misguided, considering that short-lived processes will not stick > around long enough for it to matter, and long-lived processes on the > other hand will be hindered by the fact that they're served stale > data. Yes, that should be the best design. But it is unavoidable to introduce something like udev in libnuma, or to caller. So I think a slight change is more possible to be accepted by maintainer > > I tracked down an alternative URL for the patch posting mentioned in > Comment 5[1] and it doesn't look to me like upstream has really > NACKed the suggested approach of ditching the cache but merely asked > for some mitigation, which Petr promptly implemented; the discussion > simply stopped after that. But I can not understand how a timeout can solve performance issue. (https://www.spinics.net/lists/linux-numa/msg01137.html) On the other side, exporting a API numa_node_to_cpu_mark_outdated() can avoid the performance drop totally, considering online/offline cpu is not very frequent event on product environment. > > How about re-submitting Petr's patches and see whether that results > in a better outcome this time around? > > > [1] https://www.spinics.net/lists/linux-numa/msg01134.html I have created a pull request: https://github.com/numactl/numactl/pull/73 And add some comment about https://www.spinics.net/lists/linux-numa/msg01134.html Thanks, Pingfan (In reply to Pingfan Liu from comment #19) > I have created a pull request: https://github.com/numactl/numactl/pull/73 Alright, let's see how this goes :) Good luck, and thanks for pushing this forward! Hi Andrea, The suggested patch has been merged and backport. Could you have a test on https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23531514 to see whether it works for you? Thanks, Pingfan (In reply to Pingfan Liu from comment #21) > Hi Andrea, > > The suggested patch has been merged and backport. > Could you have a test on > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23531514 to see > whether it works for you? Hi, unfortunately I don't have time to try this right now, but it's good to know the API has made it upstream (and downstream)! I'll try to find some time to add support for it in libvirt, but it might be a while before I actually get around to it... Hi, Due to the schedule of 7.8, I think this fix should be tested as soon as possible. Thanks, Pingfan Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1163 |
Description of problem: numa_node_to_cpus() caches the number of CPUs present in the system, which means that applications using it to query for this information will be given stale data if the hardware configuration has changed between calls. An example of this behavior can be seen in the libvirtd daemon, which is a long-running process linked against libnuma. Version-Release number of selected component (if applicable): numactl-libs-2.0.9-4.el7.x86_64 How reproducible: Always. Steps to Reproduce: 1. The host needs to have at least two CPUs 2. Query libvirt for host capabilities, both CPUs will be reported: [test@localhost ~]$ sudo virsh capabilities | grep '<cpu id=' <cpu id='0' socket_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='1' core_id='0' siblings='1'/> 3. Now bring one of the CPUs offline: [test@localhost ~]$ echo 0 | sudo tee /sys/devices/system/cpu/cpu1/online 4. Query again libvirt for host capabilities, the second CPU is still present in the output but all the information is missing because they can't be retrieved for offline CPUs: [test@localhost ~]$ sudo virsh capabilities | grep '<cpu id=' <cpu id='0' socket_id='0' core_id='0' siblings='0'/> <cpu id='1'/> 5. It works the other way around, too: if only one CPU is online when libvirtd is started, and you bring the other one online later, only one CPU will be reported by libvirt. Actual results: The number of CPUs reported by numa_node_to_cpus() doesn't reflect the changes occured in the system since the process was started. Expected results: The data reported by libnuma is always current. Additional info: None