2119518 – GNOME being OOM killed during basic use on VM with 2G of RAM

Bug 2119518 - GNOME being OOM killed during basic use on VM with 2G of RAM

Summary: GNOME being OOM killed during basic use on VM with 2G of RAM

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	openqa
Duplicates (1):	2119432 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-18 17:28 UTC by Adam Williamson
Modified:	2022-08-27 05:59 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2022-08-25 22:55:52 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journal output from one case of this failure (262.77 KB, text/plain) 2022-08-18 17:28 UTC, Adam Williamson	no flags	Details
View All

Description Adam Williamson 2022-08-18 17:28:24 UTC

Created attachment 1906372 [details]
journal output from one case of this failure

In the last few days, a lot of GNOME tests have failed during Rawhide compose and update testing in openQA because oomd kills GNOME (so the desktop just suddenly disappeared and the test wigged out).

This seems to coincide with kernel 6.0 landing in Rawhide. I haven't observed the problem on Fedora 37, where we're still on kernel 5.19, yet. It has happened on composes with both regular and debug kernels, though it seems to possibly happen more often on composes with debug kernels.

I'll attach the full journal output from one example case of this, where a test of GNOME Clocks on Silverblue failed because Clocks got OOM killed before the window could even spawn. There are also reports of two other things being OOM killed shortly before that.

Comment 1 Adam Williamson 2022-08-18 17:33:02 UTC

*** Bug 2119432 has been marked as a duplicate of this bug. ***

Comment 2 Adam Williamson 2022-08-18 17:49:25 UTC

I'm bumping openQA x86_64 workers to 3G of RAM to see if it helps with this, but even if it does, the problem still seems like a problem...

Comment 3 Adam Williamson 2022-08-20 02:02:54 UTC

Welp, still hitting this on the VMs with 3G of RAM. And the latest one sets a new record:

Killed /user.slice/user-1000.slice/user/session.slice/org.gnome.Shell due to memory pressure for /user.slice/user-1000.slice/user being 27925460729.27% > 50.00% for > 20s with reclaim activity

yeah, 27925460729.27% somehow doesn't seem likely. I'm pretty sure there's a bug here somewhere. Still don't think I've seen this on 37.

Comment 4 Chris Murphy 2022-08-20 02:27:58 UTC

Maybe a PSI or a time tracking bug.

>Sep 09 03:01:05 fedora systemd-oomd[604]:                 Pressure: Avg10: 1255260529528.42 Avg60: 325612.68 Avg300: 757127258245.62 Total: 2month 4w 2d 8h 15min 12s

All the unrealistically high memory pressure messages also have ridiculously high total absolute stall times. 2 months? lol, it hasn't even been up much more than 2 minutes.

Comment 5 Adam Williamson 2022-08-20 02:45:19 UTC

Hah, months? Check this from the new record holder:

Aug 20 00:11:26 localhost-live systemd-oomd[1045]:         Path: /user.slice/user-1000.slice/user/app.slice/evolution-addressbook-factory.service
Aug 20 00:11:26 localhost-live systemd-oomd[1045]:                 Memory Pressure Limit: 0.00%
Aug 20 00:11:26 localhost-live systemd-oomd[1045]:                 Pressure: Avg10: 7879012599.42 Avg60: 1484450449675.62 Avg300: 910945133120.70 Total: 263y 7month 3w 4d 8h 3min 1s

This is currently happening very often on a test that boots a freshly-installed GNOME live image and (tries to) install to disk. It happens during or right after login. I don't *think* I've seen it on the corresponding KDE live image test at all, which is interesting.

Comment 6 Adam Williamson 2022-08-20 02:48:19 UTC

I don't think it's that KDE doesn't have oomd, because both the GNOME and KDE images include systemd-oomd-defaults.

Comment 7 Chris Murphy 2022-08-20 02:55:49 UTC

Maybe this: https://www.spinics.net/lists/cgroups/msg34044.html

Does KDE enable uresourced out of the box? I wonder if there's some interaction there.

Comment 8 Chris Murphy 2022-08-20 02:57:22 UTC

I'm not seeing this behavior when booting Live media. Are all the failures testing the installed OS?

Comment 9 Chris Murphy 2022-08-20 03:00:26 UTC

Disregard comment 8, and we have a new winner for time...
>Aug 19 22:48:37 localhost-live systemd-oomd[1110]:                 Pressure: Avg10: 120291672177.89 Avg60: 2181975382001.46 Avg300: 1743412212095.28 Total: 144y 8month 3d 22h 32min 25s

Comment 10 Adam Williamson 2022-08-20 11:49:41 UTC

So I tweaked the 'build a live image, boot and install it' update test scenario that's commonly affected by this to leave systemd-oomd-defaults out of the live image. I re-ran seven cases of that test which were hit by the bug; each built image is boot/install tested twice (once on BIOS, once on UEFI) so that's 14 tests. They all passed first time. To me that's another strong indication this is bogus behaviour, we're not *really* running out of memory. If we were really running out of memory shortly after GNOME startup and before anaconda even really got running properly, we shouldn't be able to make it all the way through a system install successfully...

Comment 11 Chris Murphy 2022-08-20 19:18:17 UTC

It's not uresourced related. Even when masked, 1 in 4 boots I get something cuckoo for cocoa puffs...
>Aug 20 15:12:26 localhost-live systemd-oomd[1113]:         Path: /user.slice/user-1000.slice/user/app.slice/evolution-calendar-factory.service
>Aug 20 15:12:26 localhost-live systemd-oomd[1113]:                 Memory Pressure Limit: 0.00%
>Aug 20 15:12:26 localhost-live systemd-oomd[1113]:                 Pressure: Avg10: 1324720599096.71 Avg60: 1076972128526.05 Avg300: 237243120076.18 Total: 582y 3month 3d 4h 4min 43s

I started an upstream cgroups thread since this looks PSI related.
https://www.spinics.net/lists/cgroups/msg34102.html

Comment 12 Adam Williamson 2022-08-23 15:24:26 UTC

Well, rc2 doesn't fix this, sadly. The live install test of today's compose got OOM killed while anaconda was launching:

https://openqa.fedoraproject.org/tests/1389696/video?filename=video.ogv&t=5.83,5.88
https://openqa.fedoraproject.org/tests/1389696/file/_boot_to_anaconda-var_log.tar.gz

Aug 23 07:08:30 localhost-live systemd-oomd[1045]: Killed /user.slice/user-1000.slice/user/session.slice/org.gnome.Shell due to memory pressure for /user.slice/user-1000.slice/user being 37934347063.24% > 50.00% for > 20s with reclaim activity
Aug 23 07:08:54 localhost-live systemd-oomd[1045]: Killed /user.slice/user-42.slice/user/session.slice/wireplumber.service due to memory pressure for /user.slice/user-42.slice/user being 249958565827.34% > 50.00% for > 20s with reclaim activity
Aug 23 07:09:09 localhost-live systemd-oomd[1045]: Killed /user.slice/user-42.slice/user/session.slice/wireplumber.service due to memory pressure for /user.slice/user-42.slice/user being 50523439279.39% > 50.00% for > 20s with reclaim activity

Comment 13 Chris Murphy 2022-08-23 18:32:38 UTC

# oomctl
Dry Run: no
Swap Used Limit: 90.00%
Default Memory Pressure Limit: 60.00%
Default Memory Pressure Duration: 20s
System Context:
        Memory: Used: 1.7G Total: 3.8G
        Swap: Used: 0B Total: 3.8G
Swap Monitored CGroups:
        Path: /
                Swap Usage: (see System Context)
Memory Pressure Monitored CGroups:
        Path: /user.slice/user-1000.slice/user
                Memory Pressure Limit: 50.00%
                Pressure: Avg10: 0.00 Avg60: 0.00 Avg300: 0.10 Total: 140y 3month 18h 24min 20s
                Current Memory Usage: 1.5G
                Memory Min: 250.0M
                Memory Low: 0B
                Pgscan: 0
                Last Pgscan: 0
# cat /sys/fs/cgroup/user.slice/user-1000.slice/user/memory.pressure
some avg10=0.00 avg60=0.00 avg300=0.18 total=8367757640799118
full avg10=0.00 avg60=0.00 avg300=0.07 total=4426019660641432

Seems pretty clearly kernel misreporting in cgroups.

Comment 14 Adam Williamson 2022-08-23 18:58:33 UTC

Not that I fully understand it, but https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f69a6577bc33d8f6d6bbe02bccdeb357b287f56 sure feels like a candidate here. I'm gonna try building a kernel with that reverted and test it.

Comment 15 Chris Murphy 2022-08-23 19:36:07 UTC

I'm seeing in sysfs cgroups is reporting outlandish values for io and cpu pressure too, not just memory. We just so happen to only be getting memory pressure based kills.

e.g.

/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/dbus-broker.service/cpu.pressure:some avg10=0.00 avg60=13574509.62 avg300=36764614025.65 total=4231219592767614
/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/dbus-broker.service/cpu.pressure:full avg10=0.00 avg60=39417441.16 avg300=353622926862.02 total=1387290370370316


/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/org.gnome.SettingsDaemon.Datetime.service/io.pressure:some avg10=0.00 avg60=0.00 avg300=2804040.90 total=8299699133557533
/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/org.gnome.SettingsDaemon.Datetime.service/io.pressure:full avg10=0.00 avg60=0.00 avg300=801154.57 total=465505844


/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/org.gnome.SettingsDaemon.Datetime.service/memory.pressure:some avg10=0.00 avg60=0.00 avg300=11622293128.69 total=34359738
/sys/fs/cgroup/user.slice/user-1000.slice/user/session.slice/org.gnome.SettingsDaemon.Datetime.service/memory.pressure:full avg10=0.00 avg60=0.00 avg300=1.44 total=12

That commit does touch PSI info for those three resources.

Comment 16 Chris Murphy 2022-08-23 21:38:27 UTC

Date: Tue, 23 Aug 2022 10:12:08 -1000
From: Tejun Heo <tj>
To: Chris Murphy <lists>
Cc: cgroups.org, Johannes Weiner <hannes>
Subject: Re: oomd with 6.0-rc1 has ridiculously high memory pressure stats wit
Message-ID: <YwU0mLBMuxpZ7Zwq.org>

Can you see whether the following helps?

Thanks.

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ec66b40bdd40..00d62681ea6a 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -957,7 +957,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi = kmalloc(sizeof(struct psi_group), GFP_KERNEL);
+	cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL);
 	if (!cgroup->psi)
 		return -ENOMEM;

Comment 17 Adam Williamson 2022-08-23 22:04:33 UTC

Scratch build with that patch running. openQA tests with just the straight revert are also running (as I did that scratch build earlier).

Comment 18 Adam Williamson 2022-08-24 01:35:38 UTC

Gah, my test builds were bad. Gotta re-run.

Comment 19 Adam Williamson 2022-08-24 05:59:07 UTC

Okay, I have a scratch build with the patch actually applied now:
https://koji.fedoraproject.org/koji/taskinfo?taskID=91192123

and a live image with that kernel in it (and systemd-oomd-defaults included):
https://openqa.stg.fedoraproject.org/tests/1992675/asset/iso/01992675-Fedora-Workstation-Live-x86_64-91192123.iso

early results in openQA look good, I've run the test six times so far, first four all past, latest two look good (i.e. they did not get killed before openQA could run the installer). Chris, you wanna grab the ISO and check it locally?

Comment 20 Adam Williamson 2022-08-24 18:10:03 UTC

OK, I've run this through ten rounds of tests and it worked every time. Other updates that were tested overnight - without this kernel fix - failed quite a lot. So I'm gonna say that patch definitely looks like a winner. Thanks to Tejun.

Comment 21 Adam Williamson 2022-08-25 22:55:52 UTC

Fix is in Rawhide now and still looks good, closing.

Comment 22 Chris Murphy 2022-08-27 02:17:08 UTC

Upstream fix is 2b97cf76289a and will be in rc3.
https://www.spinics.net/lists/cgroups/msg34350.html

Comment 23 Adam Williamson 2022-08-27 05:59:43 UTC

Yes, we never actually backported the patch to an official build - it's just that Linus had already merged it from Tejun by the time of the kernel builds from a couple of days ago. kernel-6.0.0-0.rc2.20220824gitc40e8341e3b3.22.fc38 and later include the upstream fix already.

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
airlied
alciregi
bskeggs
bugzilla
hdegoede
hpa
jarodwilson
jglisse
jonathan
josef
kernel-maint
lgoncalv
linville
masami256
mchehab
ptalbert
steved
the.anitazha