Bug 1748051

Summary:

Rules in 40-redhat.rules file for SUBSYSTEM==memory are suboptimal and may lead to timing issues

Product:

Red Hat Enterprise Linux 7

Reporter:

Renaud Métrich <rmetrich>

Component:

systemd

Assignee:

Jan Synacek <jsynacek>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Sumsal <fsumsal>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.7

CC:

aklimov, bugproxy, hannsj_uhl, ikent, jsynacek, kwalker, prajnoha, systemd-maint-list

Target Milestone:

Target Release:

7.8

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

systemd-219-72.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-31 20:02:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1762679

Bug Blocks:

1689150

Attachments:

Description	Flags
udev patch v1	none

Description Renaud Métrich 2019-09-02 14:43:35 UTC

Description of problem:

The /lib/udev/rules.d/40-redhat.rules file defines the following rules for "memory" subsystem:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# Memory hotadd request
SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
ACTION!="add", GOTO="memory_hotplug_end"
PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"

ENV{.state}="online"
PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
ATTR{state}=="offline", ATTR{state}="$env{.state}"

LABEL="memory_hotplug_end"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

On system that have many Memory devices (in some cases, I can see up to 1000 or 12000 memory devices!!!), these rules will spawn tons of "uname -p" / "systemd-detect-virt" processes which will kill the system.

This issue affects multiple packages as well, for example trace-cmd ("/lib/udev/rules.d/98-trace-cmd.rules") and kexec-tools ("/lib/udev/rules.d/98-kexec.rules").


Version-Release number of selected component (if applicable):

systemd-219-67.el7_7.1


How reproducible:

Always


Steps to Reproduce:
1. Enable debugging in udev: "udev.log-priority=debug" while booting

2. Check the number of times a command executed

  # journalctl -b | egrep '(PROGRAM|RUN)' | cut -f2- -d"'" | sort | uniq -c | sort -nr


Actual results:

    168 /usr/bin/systemctl is-active trace-cmd.service' /usr/lib/udev/rules.d/98-trace-cmd.rules:1
    124 kmod load $env{MODALIAS}' /usr/lib/udev/rules.d/80-drivers.rules:5
     32 /bin/uname -p' /usr/lib/udev/rules.d/40-redhat.rules:9
     32 /bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:12
    ...

Expected results:

  1 call to "systemd-detect-virt"
  1 call to "uname -p"

Additional info:

This was observed on a system with 16 memory devices (2GB VM with 1 CPU).
On large physical systems, it can be just crazy.

Comment 3 Jan Synacek 2019-09-05 07:32:06 UTC

I'm pretty sure that we have already discussed this in a bug elsewhere (I can't find it right now). The original problem was that systemd-detect-virt was called multiple times per node, which resulted in ~10000 calls and *that* was a big problem. So we got it to the state where it is now and couldn't find any better solution. The current state is clearly suboptimal, but good enough. The problem is that there is no way to tell udev to cache the result somewhere in memory and would simply require extending udev, which we might actually do if this shows to be a huge performance hog in the future. I believe that Kyle was looking into that but I don't remember the result.

Comment 4 Renaud Métrich 2019-09-05 07:40:47 UTC

Hi Jan,

This was discussed in BZ #1666612. However at this time, we didn't realize that there could be thousands of Memory Devices in a system.
I'm currently working on a case where I suspect this is what happens (waiting for data from customer to confirm).

Comment 5 Jan Synacek 2019-09-05 07:59:05 UTC

Thanks for finding the bug! Ok, I will look into it more and see if we can easily patch udev to turn the detection into a built-in.

Comment 6 Kyle Walker 2019-09-05 12:25:05 UTC

Yep, 1666612 was the one,

I'm still a fan of pursuing a "runonce" rule set [^1]. Even with a builtin, we would churn through that operation a LOT, when we really just need to check our arch and virtualization state once and have that information available for subsequent operations. What do you think Jan? If you think it is worth spending time on, I will see if I can mock up a POC today.

[^1]: https://bugzilla.redhat.com/show_bug.cgi?id=1666612#c15

Comment 7 Renaud Métrich 2019-09-06 12:28:53 UTC

I got some customer data, where there were 1022 Memory Devices in udev DB and 8 CPUs.
It appears that these 2 rules ("uname -p" and "systemd-detect-virt") consume something like 10 seconds of the boot:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
$ egrep '(PROGRAM|RUN)' journal.log | cut -f2- -d"'" | sort | uniq -c | sort -nr | head -2
   2044 /bin/uname -p' /usr/lib/udev/rules.d/40-redhat.rules:9
   2044 /bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:12
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
$ grep -w PROGRAM journal.log | egrep "(uname|systemd-detect)" | head -1
Sep 06 13:20:21.122356 XXX systemd-udevd[410]: PROGRAM '/bin/uname -p' /usr/lib/udev/rules.d/40-redhat.rules:9
$ grep -w PROGRAM journal.log | egrep "(uname|systemd-detect)" | tail -1
Sep 06 13:20:33.741127 XXX systemd-udevd[2957]: PROGRAM '/bin/systemd-detect-virt' /usr/lib/udev/rules.d/40-redhat.rules:12
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This is still reasonable, but for a system with 12000 Memory Devices and 8 CPUs, this could be a real issue: we can expect 120 seconds doing this ...

Comment 8 Jan Synacek 2019-09-09 14:01:27 UTC

(In reply to Kyle Walker from comment #6)
> Yep, 1666612 was the one,
> 
> I'm still a fan of pursuing a "runonce" rule set [^1]. Even with a builtin,
> we would churn through that operation a LOT, when we really just need to
> check our arch and virtualization state once and have that information
> available for subsequent operations. What do you think Jan? If you think it
> is worth spending time on, I will see if I can mock up a POC today.

I'm already working on it. The runonce would be the most general, but also the most problematic to implement. I think it might actually be best to introduce something like ARCH and VIRT keys, which would be special keys with their respective constant values. I'll dig into it and see how things go.

Comment 9 Renaud Métrich 2019-09-09 14:34:11 UTC

Hi Jan,

Didn't check the code, but is there a way for some component outside systemd to provide his own "builtin" rule?
I'm asking because in the description I mentioned over components from RHEL can trigger this issue:

"
This issue affects multiple packages as well, for example trace-cmd ("/lib/udev/rules.d/98-trace-cmd.rules") and kexec-tools ("/lib/udev/rules.d/98-kexec.rules").
"

Comment 10 Jan Synacek 2019-09-09 15:48:13 UTC

This is not really a design document, but let's call it a design bugzilla entry. I did some research of the current state and these are the possible solutions with their pluses and minuses pinpointed.

We could:

1) Use a builtin
RUN{builtin}="virt" would be nice, however the RUN commands are executed *after* *all* the rules have been processed, which means one can't write a rule that would look like this:
RUN{builtin}="virt", RESULT=="kvm", RUN="do-something-kvm-specific"

2) RUNONCE="command"
That would be the most general approach. The problem is, however, that it might possibly bloat the udevd's memory usage, as there would have to be a global maphash (the cache) where keys would be the commands and values their output. The commands might potentially be numerous. Something similar is already implemented for builtins, but there are only a handful of them, which makes the memory footprint of their cache negligible.

3) New CONST key
The most simple and clean solution in our case. However, udev has no notion of "constants", only keys and values. Parser extension is needed.

In our 40-redhat.rules, the following lines are the most problematic, as they repeatedly call the external binary for every device:
> PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
> PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"

These lines would be rewritten as follows:
> CONST="arch", RESULT=="s390*", GOTO="memory_hotplug_end"
> CONST="virt", RESULT=="none", ENV{.state}="online_movable"

The semantics are basically the same as of PROGRAM, but instead of running an external command, a constant is supplied. In this case, uname_architecture() and detect_virtualization() would be the constants. The results of those functions are cached after the first call.

Comment 11 Jan Synacek 2019-09-09 16:03:11 UTC

(In reply to Renaud Métrich from comment #9)
> Hi Jan,
> 
> Didn't check the code, but is there a way for some component outside systemd
> to provide his own "builtin" rule?
> I'm asking because in the description I mentioned over components from RHEL
> can trigger this issue:
> 
> "
> This issue affects multiple packages as well, for example trace-cmd
> ("/lib/udev/rules.d/98-trace-cmd.rules") and kexec-tools
> ("/lib/udev/rules.d/98-kexec.rules").
> "

AFAIK, no, there is not. Would the above proposed solution solve the problem? Perhaps by adding more "constants"?

Comment 12 Renaud Métrich 2019-09-10 08:53:17 UTC

No it will not work for these particular rules:

/lib/udev/rules.d/98-trace-cmd.rules:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
SUBSYSTEM=="module", ACTION=="add", PROGRAM="/usr/bin/systemctl is-active trace-cmd.service", PROGRAM="/usr/bin/systemctl reload trace-cmd.service"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This reloads trace-cmd when module is added.
So it should run only once at boot, but then every time a new module is loaded.


/lib/udev/rules.d/98-kexec.rules:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
SUBSYSTEM=="cpu", ACTION=="add", GOTO="kdump_reload"
SUBSYSTEM=="cpu", ACTION=="remove", GOTO="kdump_reload"
SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload"
SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload"

LABEL="kdump_reload"

RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; /usr/bin/systemd-run --quiet /usr/bin/kdumpctl reload'"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This rebuilds the kdump initramfs everytime CPU or memory is added or removed.
so it should run only once at boot, but then every time some CPU or memory is hotplugged.

Comment 13 Jan Synacek 2019-09-10 14:03:08 UTC

Created attachment 1613603 [details]
udev patch v1

A proof-of-concept patch introducing a new keyword CONST. Currently, it can be used in a rule like so:

CONST{virt}=="kvm", ...
CONST{arch}=="ppc64*", ...

Only "virt" and "arch" keys are supported right now.

Comment 27 Jan Synacek 2019-10-07 08:16:53 UTC

Meanwhile, I have submitted a PR with the CONST key name to see what upstream thinks about it.

https://github.com/systemd/systemd/pull/13740

Comment 28 Jan Synacek 2019-10-16 07:20:21 UTC

There are currently 3 issues being tracked (#1722855, #1748051 and #1757458) with likely the same underlying problem negatively impacting performance (very slow boot, timeouts). Part of the performance impact seems to be the constant forking of systemd-detect-virt and uname in 40-redhat.rules, which is solved by https://github.com/systemd/systemd/commit/4801d8afe2ff1c1c075c9f0bc5631612172e0bb7 and improves the performance to a certain degree. However, when thousands of memory devices are present in /sys, the real congestion is likely caused by the kernel handling readlinkat() calls on /sys, see https://bugzilla.redhat.com/show_bug.cgi?id=1722855#c67.

Comment 29 Jan Synacek 2019-10-21 10:34:52 UTC

https://github.com/systemd-rhel/rhel-7/pull/54

Comment 30 Lukáš Nykrýn 2019-10-24 10:31:33 UTC

fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/54 -> post

Comment 34 errata-xmlrpc 2020-03-31 20:02:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1117