Bug 1666612

Summary: Rules "uname -p" and "systemd-detect-virt" kill the system boot time on large systems
Product: Red Hat Enterprise Linux 7 Reporter: Renaud Métrich <rmetrich>
Component: systemdAssignee: Jan Synacek <jsynacek>
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: high    
Version: 7.6CC: aghadge, bbreard, bubrown, cgoncalves, cww, hannsj_uhl, hmatsumo, iwienand, jniu, jonathon.parrish, jsynacek, kwalker, msekleta, naresh.sukhija_ext, pdancak, pdwyer, pete.perfetti, qe-baseos-daemons, rmadhuso, samukawa-oxa, tatsu-ab1, toneata
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-64.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1701230 (view as bug list) Environment:
Last Closed: 2019-08-06 12:43:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1655750, 1701230, 1707454    

Description Renaud Métrich 2019-01-16 08:19:32 UTC
Description of problem:

A customer with a large system observed boot issues due to timeout booting. In particular systemd-udev-settle was taking more than 90 seconds (internal udevadm timeout), devices were timing out, etc.

We found out that the culprit were these 2 udev rules:

/lib/udev/rules.d/40-redhat.rules:

# Memory hotadd request
SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"

ENV{.state}="online"
PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
ATTR{state}=="offline", ATTR{state}="$env{.state}"

LABEL="memory_hotplug_end"


On a small system (my VM, with 1 disk, 1 CPU), we can see that the two program runs 537 times during the initramfs step!

# journalctl -b | grep -w PROGRAM | awk -F "'" '{ print $2 }' | sort | uniq -c
      1 /bin/sh -c 
    537 /bin/systemd-detect-virt
    537 /bin/uname -p

On the large customer system, they ran 14227 times:

  14234 /bin/systemd-detect-virt
  14227 /bin/uname -p

We found out that removing these rules on customer system was improving boot from 4min 24s down to 1min 32s, which is a reasonable boot time.


Version-Release number of selected component (if applicable):

systemd-219-62.el7.x86_64


How reproducible:

Always


Steps to Reproduce:

1. Boot with rd.udev.debug

2. Count the number of times the programs were executed

Comment 4 Jan Synacek 2019-01-16 09:07:07 UTC
So, in theory, couldn't we write a udev rule, which always runs first, runs "systemd-detect-virt" once and caches it into an environment variable?

Comment 6 Michal Sekletar 2019-01-16 10:53:36 UTC
I guess we need to split the rule to two checks because current rule (where logical AND applies among conditions) can be triggered by all memory event rules and by add events on other subsystems.

I think we need to change the rules like so,

SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
ACTION!="add", GOTO="memory_hotplug_end"

Comment 7 Renaud Métrich 2019-01-16 11:07:23 UTC
Good catch! Will give it a try this afternoon.

Comment 8 Renaud Métrich 2019-01-16 12:22:04 UTC
Patch:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# diff -u /lib/udev/rules.d/40-redhat.rules.orig /lib/udev/rules.d/40-redhat.rules
--- /lib/udev/rules.d/40-redhat.rules.orig	2019-01-16 10:59:24.813058034 +0100
+++ /lib/udev/rules.d/40-redhat.rules	2019-01-16 13:12:20.082219918 +0100
@@ -4,7 +4,8 @@
 SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}="1"
 
 # Memory hotadd request
-SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
+SUBSYSTEM!="memory" GOTO="memory_hotplug_end"
+ACTION!="add", GOTO="memory_hotplug_end"
 PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
 
 ENV{.state}="online"
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Result:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
# journalctl -b | grep -w PROGRAM | awk -F "'" '{ print $2 }' | sort | uniq -c
      1 /bin/sh -c 
      8 /bin/systemd-detect-virt
      8 /bin/uname -p
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This is far better, but not yet fully optimized (we should only have 1 call).
Unfortunately my udev knowledge is limited and I couldn't find any example to only trigger the program once, store the result in a variable, then rely on that variable.

Comment 9 Jan Synacek 2019-01-17 09:49:57 UTC
It appears that I misunderstood how ENV{key} worked. It is bound to a device, which means it cannot be used globally (across devices) as a cache. And I'm not aware of anything like ENV, but which works globally.

Comment 10 Michal Sekletar 2019-01-17 12:01:19 UTC
(In reply to Jan Synacek from comment #9)
> It appears that I misunderstood how ENV{key} worked. It is bound to a
> device, which means it cannot be used globally (across devices) as a cache.
> And I'm not aware of anything like ENV, but which works globally.

This is correct, also ENV is actually per event and once the event handling is completed the ENV values update udev database entry of the event device.

Comment 11 Renaud Métrich 2019-01-17 12:09:39 UTC
What about creating a udev builtin for "uname -p" and "systemd-detect-virt" ?
I guess we could execute the command once (or do the same as C code) then cache the value.

Comment 12 Jan Synacek 2019-01-17 12:57:49 UTC
(In reply to Renaud Métrich from comment #11)
> What about creating a udev builtin for "uname -p" and "systemd-detect-virt" ?
> I guess we could execute the command once (or do the same as C code) then
> cache the value.

A more general version would be to create a global "cache-and-env-like" variable to be assigned a defined value or an output from an external command just once. There would probably have to be an additional operation that would answer the question "does cache for <foo> exist?".

Something like this:

CACHE{virt}?, GOTO="end_label"
PROGRAM="systemd-detect-virt", DO_SOMETHING..., CACHE{virt}="$RESULT"

Comment 13 Kyle Walker 2019-01-17 13:27:36 UTC
@Jan,

I was wondering about that, but it looks like there is some precedent in the code for a global environment set around the "properties_list". 

static struct udev_ctrl_connection *handle_ctrl_msg(struct udev_ctrl *uctrl) {
<snip>
        str = udev_ctrl_get_set_env(ctrl_msg);
        if (str != NULL) {
                char *key;

                key = strdup(str);
                if (key != NULL) {
                        char *val;

                        val = strchr(key, '=');
                        if (val != NULL) {
                                val[0] = '\0';
                                val = &val[1];
                                if (val[0] == '\0') {
                                        log_debug("udevd message (ENV) received, unset '%s'", key);
                                        udev_list_entry_add(&properties_list, key, NULL);
                                } else {
                                        log_debug("udevd message (ENV) received, set '%s=%s'", key, val);
                                        udev_list_entry_add(&properties_list, key, val);
                                }
                        } else {
                                log_error("wrong key format '%s'", key);
                        }
                        free(key);
                }
                worker_kill(udev);
        }
<snip>


I haven't had time to put a proof-of-concept together, but I was thinking of implementing a global-rule pass. Something like the following paths:

<snip>
static const char* const runonce_rules_dirs[] = {
        "/etc/udev/runonce.d",
        "/run/udev/runonce.d",
        UDEVLIBEXECDIR "/runonce.d",
        NULL};
<snip>


This would be where we could put a rule such as gathering virtualization and uname information, as well as just about any other single-check behaviour, and propagate the environment variables to the rest of udev via properties_list. You get to utilize the attributes in the individual workers, and would be in keeping with the global attribute behaviour that is already there. If the team agrees that this is a decent approach, I could work on the POC and post upstream.

This would all be contingent on upstream buy-in of course...

- Kyle Walker

Comment 14 Jan Synacek 2019-01-17 15:03:56 UTC
(In reply to Kyle Walker from comment #13)
> <snip>
> static const char* const runonce_rules_dirs[] = {
>         "/etc/udev/runonce.d",
>         "/run/udev/runonce.d",
>         UDEVLIBEXECDIR "/runonce.d",
>         NULL};
> <snip>

I think that I understand the idea. The question is: isn't this too complicated for what you're trying to achieve? I'm not really sure how to answer that question. My guess is that if you want to do it in a special file(s), one file should do it, as it would just state what gets cached and how. That way, you wouldn't introduce an entire directory structure just for the cache. But, I'm not sure. I guess we should first gather some data of where in the system this would actually be used and what improvements would it bring.

Comment 15 Kyle Walker 2019-01-18 15:20:51 UTC
(In reply to Jan Synacek from comment #14)
> I think that I understand the idea. The question is: isn't this too
> complicated for what you're trying to achieve? I'm not really sure how to
> answer that question. My guess is that if you want to do it in a special
> file(s), one file should do it, as it would just state what gets cached and
> how. That way, you wouldn't introduce an entire directory structure just for
> the cache.

That is fair, and I think the safer path to go down. We can ship/maintain just a single file (UDEVLIBEXECDIR "runonce.rules").

> But, I'm not sure. I guess we should first gather some data of where in the
> system this would actually be used and what improvements would it bring.

After looking at the udev rules we have today, I think that these two lines would be the largest consumers. I mean, it would be simple enough to just make architecture and virtualization information available in the context of rules. Gather that information once during udevd initialization, and then add "VIRTUALIZATION" and "ARCHITECTURE" key names that can be used later. That just addresses the current situation, and doesn't really give us a path to address these types of issues in future, outside of adding more. 

This way we would give ourselves, and 3rd parties, a path to omit PROGRAM calls in match criteria for every device of that type where it isn't necessary.

- Kyle Walker

Comment 16 Jan Synacek 2019-01-24 07:14:48 UTC
*** Bug 1668982 has been marked as a duplicate of this bug. ***

Comment 17 Ian Wienand 2019-01-24 21:24:43 UTC
Does this bug need to be RedHat only?  It doesn't seem to include any particular customer information.  There is discussion in https://bugs.centos.org/view.php?id=15735 but people can not see the progress on this bug

Comment 18 Ian Wienand 2019-01-24 22:11:53 UTC
(In reply to Michal Sekletar from comment #6)
> I think we need to change the rules like so,
> 
> SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
> ACTION!="add", GOTO="memory_hotplug_end"

(pasting from the centos bug)

I can confirm splitting the rule works as intended (http://paste.openstack.org/show/743429/)

http://paste.openstack.org/show/743427/ is an example of udev testing that previously triggered this path, but doesn't with the change.

I also regenerated the initramfs with this, and the runtime is about what it was before; i.e. there is no longer a regression

---
[00:00:06.613] Welcome to CentOS Linux 7 (Core) dracut-033-554.el7 (Initramfs)!
[00:00:35.658] Welcome to CentOS Linux 7 (Core)!
---

(In reply to Renaud Métrich from comment #8)
> # journalctl -b | grep -w PROGRAM | awk -F "'" '{ print $2 }' | sort | uniq
> -c
>       1 /bin/sh -c 
>       8 /bin/systemd-detect-virt
>       8 /bin/uname -p

> This is far better, but not yet fully optimized (we should only have 1 call).

This about makes sense though, as it triggers once for each of the memory devices (e.g. https://bugzilla.redhat.com/attachment.cgi?id=1522941)?

Comment 22 Jan Synacek 2019-01-30 09:31:01 UTC
https://github.com/lnykryn/systemd-rhel/pull/280

Comment 29 Lukáš Nykrýn 2019-02-11 09:51:46 UTC
fix merged to staging branch -> https://github.com/lnykryn/systemd-rhel/pull/280 -> post

Comment 34 Jan Synacek 2019-04-10 08:47:12 UTC
*** Bug 1658839 has been marked as a duplicate of this bug. ***

Comment 39 naresh.sukhija_ext 2019-05-16 09:54:26 UTC
Just checking by when, the QA approvals are expected to be completed, I observe the ON_QA status since 27th march 2019

Comment 42 errata-xmlrpc 2019-08-06 12:43:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2091