Bug 2148170

Summary: Failed services consume units, causing systemd limitations on max number of units to be reached
Product: Red Hat Enterprise Linux 8 Reporter: Renaud Métrich <rmetrich>
Component: systemdAssignee: systemd maint <systemd-maint>
Status: NEW --- QA Contact: Frantisek Sumsal <fsumsal>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.7CC: dtardon, sbroz, systemd-maint-list
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Renaud Métrich 2022-11-24 13:42:34 UTC
Description of problem:

When a transient service is failing, it continues consuming a unit until "reset-failed" is issued.
This is problematic when reaching the maximum number of units (hardcoded to 128K in the sources):

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 21 #define MANAGER_MAX_NAMES 131072 /* 128K */
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Indeed, once limit is reached, many problems appear, including:

1. socket units die when getting triggered by incoming traffic

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    systemd[1]: fakeauth.socket: Failed to listen on sockets: Argument list too long
    systemd[1]: fakeauth.socket: Failed with result 'resources'.
    systemd[1]: Failed to listen on fakeauth.socket.
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

    This prevents service handling completely.

2. mount are not registered in systemd (but they are still working)

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    systemd[1]: Failed to set up mount unit: Argument list too long
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

3. logins are not moved to expected user's cgroup

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    systemd-logind[905]: Failed to start session scope session-9.scope: Argument list too long
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

4. admin cannot reboot the system using "reboot/shutdown" command

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    systemd-initctl[368371]: Failed to change runlevel: Argument list too long
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------


Such limit can be "easily" reached when having a socket-triggered service which regularly dies.
This happens in real life with "authd.socket" whose service fails if the remote end vanished before "authd@.service" could read the addr/port of remote end.
Indeed, socket-triggered services of Stream type consume 2 units:
- one as "<service@instance>"
- one as "<service@instance-localaddr:localport-remoteaddr:remoteport>"

Hence, it's sufficient to have 64K failures in the past (which can be other a full year for example) to "take down" the system.

Additionally, failed services impact systemd's performance a lot: it appears that finding a slot in the "manager->units" hashmap takes more and more time, due to having these failed services consume buckets.
This can be easily seen when spawning transient failing services in loop: initially it's fast, then slows down and we see systemd taking up to 80% of a CPU (or more).

Version-Release number of selected component (if applicable):

systemd-239-68.el8.x86_64

How reproducible:

Always

Steps to Reproduce:

1. Create a dummy socket service listening on TCP stream that always fails

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    # cat /etc/systemd/system/fakeauth.socket 
    [Socket]
    ListenStream=113
    Accept=true
    
    # cat /etc/systemd/system/fakeauth@.service 
    [Service]
    ExecStart=/bin/false
    StandardInput=socket
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

2. Start the service socket

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    # systemctl start fakeauth.socket
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

3. Trigger the service in loop until 64K instances are failing

    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
    # i=0; while [ $i -le 65535 ]; do ncat --send-only localhost 113 </dev/null; let i++; sleep 0.1; done
    -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

    Note: due to slowness with time, the socket may drop incoming requests, hence more than 65535 recursion is actually required.

Actual results:

After having ~65535 units on the system, "Argument list too long" is seen and "fakeauth.socket" unit dies.

Nov 24 13:18:05 vm-fakeauth8 systemd[1]: fakeauth.socket: Failed to listen on sockets: Argument list too long
Nov 24 13:18:05 vm-fakeauth8 systemd[1]: fakeauth.socket: Failed with result 'resources'.
Nov 24 13:18:05 vm-fakeauth8 systemd[1]: Failed to listen on fakeauth.socket.

Expected results:

Socket doesn't die, reboot can be issue, ssh logins are not in "sshd.service" cgroup, etc.

Additional info:

"Argument list too long" is not a good errno at all. The errno should be handled by caller to explain what's going on more clearly.

Comment 1 Renaud Métrich 2022-11-24 13:52:32 UTC
Using the simple reproducer below, we can see systemd taking more and more CPU when spawning transient services:

# i=1; while [ $i -lt 100000 ]; do systemd-run /bin/false; let i++; done

Initially, we see 85 services spawned per second and ~56% CPU.
Later this drops to 30 services per second and ~80% CPU.
Then finally drops to 15 services per second.

I then stopped because it was too slow.

Comment 2 Renaud Métrich 2022-11-24 13:55:59 UTC
# journalctl -b | grep "Main process exited" > errors

# awk '{ print $3 }' errors | uniq -c
    108 14:49:26
    108 14:49:27
    102 14:49:28
    100 14:49:29
     91 14:49:30
    102 14:49:31
    102 14:49:32
    101 14:49:33
     97 14:49:34
     96 14:49:35
     90 14:49:36
     90 14:49:37
     96 14:49:38
     87 14:49:39
     89 14:49:40
     91 14:49:41
     92 14:49:42
     94 14:49:43
     92 14:49:44
     92 14:49:45
     :
     64 14:50:59
     61 14:51:00
     64 14:51:01
     63 14:51:02
     64 14:51:03
     64 14:51:04
     62 14:51:05
     63 14:51:06
     63 14:51:07
     57 14:51:08
     60 14:51:09
     :
     43 14:54:07
     43 14:54:08
     44 14:54:09
     43 14:54:10
     42 14:54:11
     44 14:54:12
     :

Comment 3 Renaud Métrich 2022-11-24 16:02:33 UTC
The "Argument list too long" comes from this code:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 182 int unit_add_name(Unit *u, const char *text) {
 :
 235         if (hashmap_size(u->manager->units) >= MANAGER_MAX_NAMES)
 236                 return -E2BIG;
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 4 Renaud Métrich 2022-11-24 16:13:35 UTC
Trying to hit the Power Button to stop the system (QEMU/KVM), this fails as well:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Nov 24 17:11:15 vm-fakeauth8 qemu-ga[884]: info: guest-shutdown called, mode: powerdown
Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Creating /run/nologin, blocking further logins...
Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Failed to get load state of poweroff.target: Unknown object '/org/freedesktop/systemd1/unit/poweroff_2etarget'.
Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Scheduled shutdown to poweroff.target failed: Invalid request descriptor
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 5 Lukáš Nykrýn 2022-11-30 12:31:21 UTC
We talked about this in upstream meeting, and there is an existing solution

https://www.freedesktop.org/software/systemd/man/systemd.unit.html#CollectMode=

CollectMode=inactive-or-failed should fix all their problems.

Comment 6 Frantisek Sumsal 2022-11-30 13:38:31 UTC
(In reply to Lukáš Nykrýn from comment #5)
> We talked about this in upstream meeting, and there is an existing solution
> 
> https://www.freedesktop.org/software/systemd/man/systemd.unit.
> html#CollectMode=
> 
> CollectMode=inactive-or-failed should fix all their problems.

Also, this option should be available all the way back to RHEL 7.9, since it was backported in https://bugzilla.redhat.com/show_bug.cgi?id=1817576.

Comment 8 Lukáš Nykrýn 2022-11-30 14:19:48 UTC
Adding "insights?".  Maybe we should have a check for a lot of failed template units and suggest adding CollectMode=inactive-or-failed  to the unit file.