Bug 2133829
Summary: | systemd-oomd killing anaconda, plasma shell, Firefox in KDE Rawhide | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | acaringi, adscvr, agurenko, airlied, alciregi, bskeggs, bugzilla, bugzilla.redhat.com, fedora, grgoffe, hdegoede, hpa, jarodwilson, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mikhail.v.gavrilov, ptalbert, steved, the.anitazha, zbyszek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | openqa | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-11-21 17:01:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Adam Williamson
2022-10-11 14:25:54 UTC
I'm even seeing a few cases where it's apparently killing a console session with DNF running in it(!), e.g. here: https://openqa.fedoraproject.org/tests/1505604# If you watch the video of that test - https://openqa.fedoraproject.org/tests/1505604/video?filename=video.ogv - around the 18 second mark, it's in the middle of a `dnf install` command when suddenly it drops back to the login prompt. In the journal we see: Oct 12 13:28:28 fedora systemd-oomd[543]: Killed /user.slice/user-0.slice/session-4.scope due to memory pressure for /user.slice/user-0.slice being 94.16% > 50.00% for > 20s with reclaim activity You can download a tarball of /var/log at https://openqa.fedoraproject.org/tests/1505604/file/_advisory_update-var_log.tar.gz . Note systemd in Rawhide has a new journal format, so you can't read the journal files from F37 or earlier with `journalctl --file`; you have to do it in a Rawhide mock chroot or something. The test boots a clean KDE install, leaves it sitting at the login prompt, then switches to a text console to do some preparation; the OOM kill and test failure happened during this phase. It was run with 3G of RAM. Is it still happening? it was definitely still a big issue up to the 1014 kernel. I think the 1019 kernel may have made things better, but I want to watch it a bit longer. Possible dup bug 2138369 Chris, After reading your reporting, I find myself agreeing that BugĀ 2138369 is a dup of this bug. I'm running the "latest" iso from dl.fedoraproject.org Fedora-Everything-netinst-x86_64-Rawhide-20221029.n.0.iso in a rescue environment. So far dnf seems to be working... No OOM yet. :-) Thank you for your VERY FINE work with this. Best regards, George... *** Bug 2138369 has been marked as a duplicate of this bug. *** This does seem to still be a problem in current Rawhide. e.g. the KDE live install tests for recent Rawhide composes usually fail, like this one today: https://openqa.fedoraproject.org/tests/1555308# that pretty clearly failed because first anaconda and then KDE got OOM-killed during the install process. That was with kernel-6.1.0-0.rc2.20221028git23758867219c.24.fc38 . Tomorrow's compose with the non-debug rc3 kernel will likely work better, but there's clearly still a problem here. Note, that test ran with 3G of RAM. I can kick the kde-live-iso compose tests to 4G of RAM too I guess and that might help mitigate this. But up till this 6.1 kernel cycle this test used to run fine in 2G of RAM, even on debug kernels. This smells like bug 2119518 No, it's not the same as that. The numbers aren't wildly insane like they were in that case. We don't see multiple millions of percent of memory pressure, we see, like, 65% or 70%. It feels more like the calculation is being performed as intended, but it's just too aggressive. I don't know what the *change* was - whether the calculation has been made intentionally more aggressive, or there just really *is* more memory "pressure" happening for some reason, or what. But it seems fairly clearly to be killing stuff in situations that it doesn't really need to. To be honest, I'm worried our current approach seems a bit too aggressive even on F37 with kernel 5.19, on my day-to-day laptop. I switched to a laptop with 8G of RAM (instead of 16G) recently, and only 6.5G of that is usable (1.5G used by the GPU, I think). I got a 6.5G ZRAM swap partition (this is a default install of Silverblue 37). Just the other day, my entire desktop got OOM killed because I was dumb enough to open a new browser tab while Sound Converter was converting some audio. If the system had waited 15 seconds for it to finish, things would probably have been fine. I run Evolution, Firefox, gedit, some terminals, and occasionally an extra app like sound converter. Evo uses quite a lot of memory because I have a lot of mail, but still, 8G of RAM isn't a small amount and it's weird that I run into OOM kills semi-often. More than once I've opened a new tab in Firefox, things ground to a halt for a bit, and then I realized Evo had disappeared because it got OOM killed... I can just go get a laptop with 16G of RAM again, and probably will, one way or another. But I'm not convinced this OOM killing is really giving me a better experience than I'd have if I just left the kernel and swap partition to figure things out. I might turn it off for a while and see how things go... I tried this in a VM and couldn't reproduce it with Fedora-KDE-Live-x86_64-Rawhide-20221101.n.0.iso and kernel 6.1.0-0.rc3.27.fc38.x86_64 However https://openqa.fedoraproject.org/tests/1561371 used 6.1.0-0.rc3.27.fc38.x86_64 and it failed with this problem. The log is in _do_install_and_reboot-var_log.tar.gz Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: systemd-oomd killed some process(es) in this unit. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2372 (anaconda) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2384 (dbus-daemon) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2387 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2398 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2404 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2419 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2420 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2422 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2423 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2424 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 2427 (python3) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 3118 (rsync) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 3120 (rsync) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Killing process 3121 (rsync) with signal SIGKILL. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Failed to kill control group /user.slice/user-1000.slice/user/app.slice/app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope, ignoring: Operation not permitted Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Failed with result 'oom-kill'. Nov 1 19:38:02 fedora systemd[1303]: app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope: Consumed 1min 470ms CPU time. Nov 1 19:38:02 fedora systemd[1303]: selinux: avc: op=setenforce lsm=selinux enforcing=0 res=1 Nov 1 19:38:02 fedora systemd-oomd[980]: Considered 16 cgroups for killing, top candidates were: Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 93.01 Avg60: 36.96 Avg300: 9.09 Total: 28s Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 1.9G Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 2726119 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 2726119 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/app-org.kde.kalendarac Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 403.1M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 19614 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 19614 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/dbus-:1.2-com.redhat.imsettings Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 43.0M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 1811 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 1811 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/xdg-desktop-portal-gnome.service Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 28.0M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 9887 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 9887 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/app-org.kde.kdeconnect.daemon Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 21.2M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 1077 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 1077 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/app-kaccess Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 17.2M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 187 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 187 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/xdg-desktop-portal-gtk.service Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 14.1M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 382 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 382 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/app-org.freedesktop.problems.applet Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 5.7M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 1360 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 1360 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/ssh-agent.service Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 1.0M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 13 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 13 Nov 1 19:38:02 fedora systemd-oomd[980]: #011Path: /user.slice/user-1000.slice/user/app.slice/dbus-:1.2-org.freedesktop.secrets Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Pressure Limit: 0.00% Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.00 Total: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Current Memory Usage: 1.0M Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Min: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Memory Low: 0B Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Pgscan: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: #011#011Last Pgscan: 0 Nov 1 19:38:02 fedora systemd-oomd[980]: Killed /user.slice/user-1000.slice/user/app.slice/app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope due to memory pressure for /user.slice/user-1000.slice/user/app.slice being 92.71% > 50.00% for > 20s with reclaim activity On my slow VM I see plasma-shell is using 22% memory by itself for quite a while but then it calms down to about 5%. I wonder if my slow VM allows enough time to go by for things to settle, whereas openqa is running on much beefier hardware possibly permitting it to slam right into this high memory usage. Anyway, maybe we need to let more time go by? Or bump both the % pressure and time threshold? Or look at the amount of reclaim to see if that's affecting IO in some way that translates into not making adequate forward progress? In my case, it definitely eventually finishes the installation. I wonder if Bug 2134971 is a duplicate of this or related. And to quote from Bug 2134971 Comment 4: > The latest kernel that I cannot reproduce the problem on is kernel-6.1.0-0.rc0.20221007git4c86114194e6.5.fc38. > The next one, kernel-6.1.0-0.rc0.20221009gita6afa4199d3d.7.fc38, inevitably leads to systemd-oomd starting to > kill random applications and eventually gnome-session after some time. The logs only list up to 10 candidates but I think when one leaf cgroup has avg10 > 90 it's driving up the overall pressure for its parent cgroup. So in Chris's log /user.slice/user-1000.slice/user/app.slice/app-liveinst-da6d8311ca2146f4acd49e74909e2873.scope with avg10 93.01 is driving /user.slice/user-1000.slice/user/app.slice to 92.71 which is over the threshold of 50. This seems to hold true for https://bugzilla.redhat.com/show_bug.cgi?id=2134971 too, but since we only dump 10 of many candidates a lot of the kill situations show candidates with avg10 0.00 but parent cgroup with avg10 > 90. I suspect there is a candidate outside of the 10 we dump that has avg10 > 90. systemd-oomd uses the "full" metric for memory pressure so I don't think one cgroup should be driving the parent's pressure up that much. Doing a `git log 4c86114194e6..a6afa4199d3d` on the linux repo, https://lore.kernel.org/all/20220915094200.139713-1-hch@lst.de/ seems the most suspicious as it's moving/modifying the PSI accounting code. If it's easy to test maybe try before and after patchset? I'm going to try and pull in one of our resctl developers to help. I tried bisecting and got pretty far, but the results make no sense to me, hence I assume something went sideways 0a78a376ef3c is fine, c6ea70604249 (which basically is the next commit) is not. But reverting that commit there doesn't solve the problem, which IOW means that the merge base (v6.0-rc2) is broken, too. That's odd. To quote myself from Bug 2134971 Comment 10: > I can reproduce the problem easily on master as of today (8e5423e991e8), but it looks like (tests are running for ~15 minutes now, normally they triggered the problem in a minute or two) the problem vanishes by reverting 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads") on top of master. Patch posted we could test in Rawhide if desired... https://lore.kernel.org/regressions/Y2Q+y8t9PV5nrjud@cmpxchg.org/ *** Bug 2134971 has been marked as a duplicate of this bug. *** The proposed patch fixes the issue for me: https://lore.kernel.org/regressions/5f7bac77-c088-6fb7-ccb5-bef9267f7186@leemhuis.info/ *** Bug 2140664 has been marked as a duplicate of this bug. *** so, Justin did backport the fix for this last week. I had already worked around the issue in openQA (by disabling systemd-oomd), but some manual testing does suggest the fix probably works. I'll try removing the workaround on openQA stg and check we don't get problems again. Bugs here didn't come back when I removed the workarounds, so I'm pretty sure the fix is good. Thanks! Adam, I just finished installing a fresh spin Fedora-Everything-netinst-x86_64-Rawhide-20221129.n.0.iso into a VBox VM and then upgraded with about half of the groups. I was continually being oom-killed. I stopped systemd-oomd.service and had no further problems. It looks like the oom-killer bug is still with us? Best regards, George... Not this specific bug, no. If this bug was back, openQA tests would have started blowing up. If you can reproduce it killing things in a situation where it definitely should not, file a new bug with details - the relevant log messages, information on how much RAM is actually being used, and so on. |