Bug 2248071 - systemd-oomd doesn't kick in on high memory pressure, leading to system lockup
Summary: systemd-oomd doesn't kick in on high memory pressure, leading to system lockup
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 39
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-05 23:52 UTC by Jonas Dreßler
Modified: 2024-03-05 04:26 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
log of oomctl output (35.51 KB, text/plain)
2024-02-12 19:00 UTC, Adam Williamson
no flags Details

Description Jonas Dreßler 2023-11-05 23:52:22 UTC
Description of problem: systemd-oomd is supposed to kill processes on high memory pressure, but it doesn't seem to kick in (when running `oomctl`, Total system memory and swap shows 0 for me). This eventually leads to a lock-up of the system because some process goes haywire.

In theory the kernel oom killer should also kick in, but that never happens because of the lockup happening before.


See also: https://github.com/systemd/systemd/issues/25596

Steps to Reproduce:
1. open lots of tabs in firefox with browser.tabs.unloadOnLowMemory set to `false`

Actual results: the system locks up


Expected results: firefox processes get killed by systemd-oomd and the system eventually comes back

Comment 1 Anita Zhang 2024-01-05 22:58:55 UTC
Hi, do you have the journal logs from this event? Or if you're able to run `oomctl` during this event can you send the output? `lsblk` and `uname -a` output would also be helpful for understanding your system.

I suspect the pressure is not meeting the thresholds we have set by default for Fedora and that it would make more sense to lower it for your system.

Comment 2 Jonas Dreßler 2024-01-06 12:21:51 UTC
I don't have logs, the system completely locks up, I don't think there's even anything written to disk anymore.

> I suspect the pressure is not meeting the thresholds we have set by default for Fedora and that it would make more sense to lower it for your system.

Maybe? Although from the output of oomctl (Total memory for system context: 0), I'm not sure if oomd is even working and monitoring the memory usage. I just found this reddit post which appears to explain it (https://www.reddit.com/r/systemd/comments/175mu49/oomctl_shows_0b_of_ram/), but I'm not sure if that means oomd isn't working correctly.

There's a very good reproducer on the github issue which should work on any system:

In GNOME Terminal, run this:

dd if=/dev/zero | tr '\0' '*' | less

Wait for the ":" prompt and press Shift+G to cause less to keep looking for the EOF that never comes.

Comment 3 Fedora Blocker Bugs Application 2024-01-30 09:06:26 UTC
Proposed as a Blocker for 40-final by Fedora user verdre using the blocker tracking app because:

 systemd-oomd isn't working out of the box. We depend on the kernel oom-killer, and that one might only kick in when it's too late. This means out-of-memory scenarios are often leading to complete system lockup on Fedora. We have this bug since multiple releases and it seems to affect a lot of people, it's just rarely reported.

There's a reproducer on the bugreport to confirm that systemd-oomd doesn't kick in. When running it, on most devices the kernel oom killer ends up killing the process, but on others the system will lock up.

Comment 4 Adam Williamson 2024-02-12 18:56:04 UTC
So I booted a recent Rawhide install in a VM and tried the reproducer from comment #2. I got a Shell notification "Virtual Terminal Stopped   Device memory is nearly full. Virtual terminal processes were using a lot of memory and were forced to stop.", and in the journal, I see:

test@fedora:~$ sudo journalctl -b | grep -i oom
Feb 12 10:50:36 fedora systemd[1]: Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
Feb 12 10:50:37 fedora systemd[1]: Starting systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer...
Feb 12 10:50:37 fedora systemd[1]: Started systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer.
Feb 12 10:50:37 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-oomd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Feb 12 10:53:44 fedora kernel: spice-vdagentd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Feb 12 10:53:44 fedora kernel:  oom_kill_process+0xfa/0x200
Feb 12 10:53:44 fedora kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
Feb 12 10:53:44 fedora kernel: [    766]   998   766     4056      640       64      576         0    69632      160          -900 systemd-oomd
Feb 12 10:53:44 fedora kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3c7cc564-5671-42e0-ae98-85ed81e13f11.scope,task=less,pid=3267,uid=1000
Feb 12 10:53:44 fedora kernel: Out of memory: Killed process 3267 (less) total-vm:8619972kB, anon-rss:1890344kB, file-rss:1664kB, shmem-rss:0kB, UID:1000 pgtables:7976kB oom_score_adj:200
Feb 12 10:53:44 fedora systemd[1]: user: A process of this unit has been killed by the OOM killer.
Feb 12 10:53:44 fedora systemd[1797]: vte-spawn-3c7cc564-5671-42e0-ae98-85ed81e13f11.scope: A process of this unit has been killed by the OOM killer.
Feb 12 10:53:44 fedora systemd[1797]: vte-spawn-3c7cc564-5671-42e0-ae98-85ed81e13f11.scope: Failed with result 'oom-kill'.

is that the systemd oomd killer or the kernel one?

Comment 5 Adam Williamson 2024-02-12 19:00:25 UTC
Created attachment 2016493 [details]
log of oomctl output

I had a bash loop dumping the output of oomctl to a file every second during the reproduction attempt (with timestamps), here is the output. can re-do if we need finer granularity than once a second.

Comment 6 Chris Murphy 2024-02-12 19:15:24 UTC
Looks like kernel oom-killer being invoked, not oomd. Either the process killed off so quickly consumed memory and threatened the kernel, or oomd isn't reacting soon enough. I'm not sure which. From the oomctl output attachment:
	Path: /user.slice/user-1000.slice/user/app.slice/app-org.gnome.Terminal.slice
		Memory Pressure Limit: 80.00%
		Pressure: Avg10: 3.75 Avg60: 0.76 Avg300: 0.16 Total: 496ms
		Current Memory Usage: 1.4G
Seems to me oomd should have reacted sooner.

Comment 7 Alessandro Astone 2024-02-12 22:14:16 UTC
Indeed only the kernel oom killer seems to kick in, and often only many seconds after the desktop froze already.

I got reminded of some of the worries around https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount, as I'm the owner of the change, but I can reproduce both with and without the change.

Comment 8 Geoffrey Marr 2024-02-13 16:22:44 UTC
Discussed during the 2024-02-12 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as we agreed this bug could do with further testing for a more complete understanding of whether systemd-oomd really isn't working at all, or is not kicking in in certain circumstances.

[0] https://meetbot.fedoraproject.org/blocker-review_matrix_fedoraproject-org/2024-02-12/f40-blocker-review.2024-02-12-17.05.txt

Comment 9 Geoffrey Marr 2024-03-05 04:26:35 UTC
Discussed during the 2024-03-04 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker (Final)" was made as we can't find a justification for calling this a release blocker. Even if systemd-oomd never kicks in at all, we have no release criteria covering what should happen if you run your system out of memory, and it will always be something bad. There's no requirement in Fedora that the bad thing be "systemd kills an app" rather than "the kernel kills an app" or "everything seizes up".

[0] https://meetbot.fedoraproject.org/blocker-review_matrix_fedoraproject-org/2024-03-04/f40-blocker-review.2024-03-04-17.00.log.txt


Note You need to log in before you can comment on or make changes to this bug.