Bug 1748051
Summary: | Rules in 40-redhat.rules file for SUBSYSTEM==memory are suboptimal and may lead to timing issues | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Renaud Métrich <rmetrich> | ||||
Component: | systemd | Assignee: | Jan Synacek <jsynacek> | ||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 7.7 | CC: | aklimov, bugproxy, hannsj_uhl, ikent, jsynacek, kwalker, prajnoha, systemd-maint-list | ||||
Target Milestone: | rc | ||||||
Target Release: | 7.8 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | systemd-219-72.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-03-31 20:02:28 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1762679 | ||||||
Bug Blocks: | 1689150 | ||||||
Attachments: |
|
Description
Renaud Métrich
2019-09-02 14:43:35 UTC
I'm pretty sure that we have already discussed this in a bug elsewhere (I can't find it right now). The original problem was that systemd-detect-virt was called multiple times per node, which resulted in ~10000 calls and *that* was a big problem. So we got it to the state where it is now and couldn't find any better solution. The current state is clearly suboptimal, but good enough. The problem is that there is no way to tell udev to cache the result somewhere in memory and would simply require extending udev, which we might actually do if this shows to be a huge performance hog in the future. I believe that Kyle was looking into that but I don't remember the result. Hi Jan, This was discussed in BZ #1666612. However at this time, we didn't realize that there could be thousands of Memory Devices in a system. I'm currently working on a case where I suspect this is what happens (waiting for data from customer to confirm). Thanks for finding the bug! Ok, I will look into it more and see if we can easily patch udev to turn the detection into a built-in. Yep, 1666612 was the one, I'm still a fan of pursuing a "runonce" rule set [^1]. Even with a builtin, we would churn through that operation a LOT, when we really just need to check our arch and virtualization state once and have that information available for subsequent operations. What do you think Jan? If you think it is worth spending time on, I will see if I can mock up a POC today. [^1]: https://bugzilla.redhat.com/show_bug.cgi?id=1666612#c15 I got some customer data, where there were 1022 Memory Devices in udev DB and 8 CPUs. It appears that these 2 rules ("uname -p" and "systemd-detect-virt") consume something like 10 seconds of the boot: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- $ egrep '(PROGRAM|RUN)' journal.log | cut -f2- -d"'" | sort | uniq -c | sort -nr | head -2 2044 /bin/uname -p' /usr/lib/udev/rules.d/40-redhat.rules:9 2044 /bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:12 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- $ grep -w PROGRAM journal.log | egrep "(uname|systemd-detect)" | head -1 Sep 06 13:20:21.122356 XXX systemd-udevd[410]: PROGRAM '/bin/uname -p' /usr/lib/udev/rules.d/40-redhat.rules:9 $ grep -w PROGRAM journal.log | egrep "(uname|systemd-detect)" | tail -1 Sep 06 13:20:33.741127 XXX systemd-udevd[2957]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:12 -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This is still reasonable, but for a system with 12000 Memory Devices and 8 CPUs, this could be a real issue: we can expect 120 seconds doing this ... (In reply to Kyle Walker from comment #6) > Yep, 1666612 was the one, > > I'm still a fan of pursuing a "runonce" rule set [^1]. Even with a builtin, > we would churn through that operation a LOT, when we really just need to > check our arch and virtualization state once and have that information > available for subsequent operations. What do you think Jan? If you think it > is worth spending time on, I will see if I can mock up a POC today. I'm already working on it. The runonce would be the most general, but also the most problematic to implement. I think it might actually be best to introduce something like ARCH and VIRT keys, which would be special keys with their respective constant values. I'll dig into it and see how things go. Hi Jan, Didn't check the code, but is there a way for some component outside systemd to provide his own "builtin" rule? I'm asking because in the description I mentioned over components from RHEL can trigger this issue: " This issue affects multiple packages as well, for example trace-cmd ("/lib/udev/rules.d/98-trace-cmd.rules") and kexec-tools ("/lib/udev/rules.d/98-kexec.rules"). " This is not really a design document, but let's call it a design bugzilla entry. I did some research of the current state and these are the possible solutions with their pluses and minuses pinpointed. We could: 1) Use a builtin RUN{builtin}="virt" would be nice, however the RUN commands are executed *after* *all* the rules have been processed, which means one can't write a rule that would look like this: RUN{builtin}="virt", RESULT=="kvm", RUN="do-something-kvm-specific" 2) RUNONCE="command" That would be the most general approach. The problem is, however, that it might possibly bloat the udevd's memory usage, as there would have to be a global maphash (the cache) where keys would be the commands and values their output. The commands might potentially be numerous. Something similar is already implemented for builtins, but there are only a handful of them, which makes the memory footprint of their cache negligible. 3) New CONST key The most simple and clean solution in our case. However, udev has no notion of "constants", only keys and values. Parser extension is needed. In our 40-redhat.rules, the following lines are the most problematic, as they repeatedly call the external binary for every device: > PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" > PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable" These lines would be rewritten as follows: > CONST="arch", RESULT=="s390*", GOTO="memory_hotplug_end" > CONST="virt", RESULT=="none", ENV{.state}="online_movable" The semantics are basically the same as of PROGRAM, but instead of running an external command, a constant is supplied. In this case, uname_architecture() and detect_virtualization() would be the constants. The results of those functions are cached after the first call. (In reply to Renaud Métrich from comment #9) > Hi Jan, > > Didn't check the code, but is there a way for some component outside systemd > to provide his own "builtin" rule? > I'm asking because in the description I mentioned over components from RHEL > can trigger this issue: > > " > This issue affects multiple packages as well, for example trace-cmd > ("/lib/udev/rules.d/98-trace-cmd.rules") and kexec-tools > ("/lib/udev/rules.d/98-kexec.rules"). > " AFAIK, no, there is not. Would the above proposed solution solve the problem? Perhaps by adding more "constants"? No it will not work for these particular rules: /lib/udev/rules.d/98-trace-cmd.rules: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- SUBSYSTEM=="module", ACTION=="add", PROGRAM="/usr/bin/systemctl is-active trace-cmd.service", PROGRAM="/usr/bin/systemctl reload trace-cmd.service" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This reloads trace-cmd when module is added. So it should run only once at boot, but then every time a new module is loaded. /lib/udev/rules.d/98-kexec.rules: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- SUBSYSTEM=="cpu", ACTION=="add", GOTO="kdump_reload" SUBSYSTEM=="cpu", ACTION=="remove", GOTO="kdump_reload" SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload" SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload" LABEL="kdump_reload" RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --quiet /usr/bin/kdumpctl reload'" -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This rebuilds the kdump initramfs everytime CPU or memory is added or removed. so it should run only once at boot, but then every time some CPU or memory is hotplugged. Created attachment 1613603 [details]
udev patch v1
A proof-of-concept patch introducing a new keyword CONST. Currently, it can be used in a rule like so:
CONST{virt}=="kvm", ...
CONST{arch}=="ppc64*", ...
Only "virt" and "arch" keys are supported right now.
Meanwhile, I have submitted a PR with the CONST key name to see what upstream thinks about it. https://github.com/systemd/systemd/pull/13740 There are currently 3 issues being tracked (#1722855, #1748051 and #1757458) with likely the same underlying problem negatively impacting performance (very slow boot, timeouts). Part of the performance impact seems to be the constant forking of systemd-detect-virt and uname in 40-redhat.rules, which is solved by https://github.com/systemd/systemd/commit/4801d8afe2ff1c1c075c9f0bc5631612172e0bb7 and improves the performance to a certain degree. However, when thousands of memory devices are present in /sys, the real congestion is likely caused by the kernel handling readlinkat() calls on /sys, see https://bugzilla.redhat.com/show_bug.cgi?id=1722855#c67. fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/54 -> post Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1117 |