Description of problem: When a transient service is failing, it continues consuming a unit until "reset-failed" is issued. This is problematic when reaching the maximum number of units (hardcoded to 128K in the sources): -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 21 #define MANAGER_MAX_NAMES 131072 /* 128K */ -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Indeed, once limit is reached, many problems appear, including: 1. socket units die when getting triggered by incoming traffic -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd[1]: fakeauth.socket: Failed to listen on sockets: Argument list too long systemd[1]: fakeauth.socket: Failed with result 'resources'. systemd[1]: Failed to listen on fakeauth.socket. -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- This prevents service handling completely. 2. mount are not registered in systemd (but they are still working) -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd[1]: Failed to set up mount unit: Argument list too long -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 3. logins are not moved to expected user's cgroup -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd-logind[905]: Failed to start session scope session-9.scope: Argument list too long -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 4. admin cannot reboot the system using "reboot/shutdown" command -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- systemd-initctl[368371]: Failed to change runlevel: Argument list too long -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Such limit can be "easily" reached when having a socket-triggered service which regularly dies. This happens in real life with "authd.socket" whose service fails if the remote end vanished before "authd@.service" could read the addr/port of remote end. Indeed, socket-triggered services of Stream type consume 2 units: - one as "<service@instance>" - one as "<service@instance-localaddr:localport-remoteaddr:remoteport>" Hence, it's sufficient to have 64K failures in the past (which can be other a full year for example) to "take down" the system. Additionally, failed services impact systemd's performance a lot: it appears that finding a slot in the "manager->units" hashmap takes more and more time, due to having these failed services consume buckets. This can be easily seen when spawning transient failing services in loop: initially it's fast, then slows down and we see systemd taking up to 80% of a CPU (or more). Version-Release number of selected component (if applicable): systemd-239-68.el8.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a dummy socket service listening on TCP stream that always fails -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- # cat /etc/systemd/system/fakeauth.socket [Socket] ListenStream=113 Accept=true # cat /etc/systemd/system/fakeauth@.service [Service] ExecStart=/bin/false StandardInput=socket -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 2. Start the service socket -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- # systemctl start fakeauth.socket -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 3. Trigger the service in loop until 64K instances are failing -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- # i=0; while [ $i -le 65535 ]; do ncat --send-only localhost 113 </dev/null; let i++; sleep 0.1; done -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Note: due to slowness with time, the socket may drop incoming requests, hence more than 65535 recursion is actually required. Actual results: After having ~65535 units on the system, "Argument list too long" is seen and "fakeauth.socket" unit dies. Nov 24 13:18:05 vm-fakeauth8 systemd[1]: fakeauth.socket: Failed to listen on sockets: Argument list too long Nov 24 13:18:05 vm-fakeauth8 systemd[1]: fakeauth.socket: Failed with result 'resources'. Nov 24 13:18:05 vm-fakeauth8 systemd[1]: Failed to listen on fakeauth.socket. Expected results: Socket doesn't die, reboot can be issue, ssh logins are not in "sshd.service" cgroup, etc. Additional info: "Argument list too long" is not a good errno at all. The errno should be handled by caller to explain what's going on more clearly.
Using the simple reproducer below, we can see systemd taking more and more CPU when spawning transient services: # i=1; while [ $i -lt 100000 ]; do systemd-run /bin/false; let i++; done Initially, we see 85 services spawned per second and ~56% CPU. Later this drops to 30 services per second and ~80% CPU. Then finally drops to 15 services per second. I then stopped because it was too slow.
# journalctl -b | grep "Main process exited" > errors # awk '{ print $3 }' errors | uniq -c 108 14:49:26 108 14:49:27 102 14:49:28 100 14:49:29 91 14:49:30 102 14:49:31 102 14:49:32 101 14:49:33 97 14:49:34 96 14:49:35 90 14:49:36 90 14:49:37 96 14:49:38 87 14:49:39 89 14:49:40 91 14:49:41 92 14:49:42 94 14:49:43 92 14:49:44 92 14:49:45 : 64 14:50:59 61 14:51:00 64 14:51:01 63 14:51:02 64 14:51:03 64 14:51:04 62 14:51:05 63 14:51:06 63 14:51:07 57 14:51:08 60 14:51:09 : 43 14:54:07 43 14:54:08 44 14:54:09 43 14:54:10 42 14:54:11 44 14:54:12 :
The "Argument list too long" comes from this code: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- 182 int unit_add_name(Unit *u, const char *text) { : 235 if (hashmap_size(u->manager->units) >= MANAGER_MAX_NAMES) 236 return -E2BIG; -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
Trying to hit the Power Button to stop the system (QEMU/KVM), this fails as well: -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< -------- Nov 24 17:11:15 vm-fakeauth8 qemu-ga[884]: info: guest-shutdown called, mode: powerdown Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Creating /run/nologin, blocking further logins... Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Failed to get load state of poweroff.target: Unknown object '/org/freedesktop/systemd1/unit/poweroff_2etarget'. Nov 24 17:11:15 vm-fakeauth8 systemd-logind[903]: Scheduled shutdown to poweroff.target failed: Invalid request descriptor -------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
We talked about this in upstream meeting, and there is an existing solution https://www.freedesktop.org/software/systemd/man/systemd.unit.html#CollectMode= CollectMode=inactive-or-failed should fix all their problems.
(In reply to Lukáš Nykrýn from comment #5) > We talked about this in upstream meeting, and there is an existing solution > > https://www.freedesktop.org/software/systemd/man/systemd.unit. > html#CollectMode= > > CollectMode=inactive-or-failed should fix all their problems. Also, this option should be available all the way back to RHEL 7.9, since it was backported in https://bugzilla.redhat.com/show_bug.cgi?id=1817576.
Adding "insights?". Maybe we should have a check for a lot of failed template units and suggest adding CollectMode=inactive-or-failed to the unit file.