1971327 – Silverblue34: systemd-oomd Killed org.gnome.Shell@wayland.service due to memory pressure

Bug 1971327 - Silverblue34: systemd-oomd Killed org.gnome.Shell due to memory pressure

Summary: Silverblue34: systemd-oomd Killed org.gnome.Shell@wayland.service due to mem...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	34
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	fedora-kernel-btrfs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-13 16:40 UTC by Sampson Fung
Modified:	2022-06-07 21:08 UTC (History)
CC List:	35 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2022-06-07 21:08:10 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journal.log (16.13 MB, application/octet-stream) 2021-06-13 17:01 UTC, Sampson Fung	no flags	Details
journal-nobees.log (4.12 MB, text/plain) 2021-06-13 19:33 UTC, Chris Murphy	no flags	Details
View All

Description Sampson Fung 2021-06-13 16:40:59 UTC

Description of problem:
Gnome-shell and btrfs are killed.  Only a text console login prompt remain.  But cannot login.  

Version-Release number of selected component (if applicable):
kernel-5.12.9-300.fc34.x86_64
btrfs-progs-5.12.1-1.fc34.x86_64
systemd-248.3-1.fc34.x86_64



How reproducible:
Not sure


Steps to Reproduce:
1.  digikam w/ Mariadb is indexing a collection on /dev/sdc4
2.  btrfs send <from sdc4> | btrfs receive <to sdb1>
3.  bees is active on <sdb1>

Actual results:
systemd-oomd is killing process


Expected results:
Processes should be able to run normally to finish.

Additional info:

ostree://fedora:fedora/34/x86_64/silverblue
                   Version: 34.20210609.0 (2021-06-09T02:45:46Z)
                BaseCommit: fa9eb27702a64b3df78631bb843fdf47e78213437b89571088f2a0f020849633
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
           LayeredPackages: 4Pane bcache-tools bees borgbackup digikam doublecmd-gtk fotoxx
                            geeqie gnome-tweaks gthumb guestfs-tools hdparm
                            ibus-cangjie-engine-cangjie iotop libguestfs-tools
                            libvirt-client libvirt-nss mariadb mariadb-server
                            perl-Image-ExifTool podman-compose qemu-img qemu-kvm rdfind
                            sg3_utils smartmontools sway terminator vim-common virt-install
                            virt-manager virt-top virt-viewer vlc
             LocalPackages: teamviewer-15.18.4-0.x86_64
                            rpmfusion-nonfree-release-34-1.noarch
                            rpmfusion-free-release-34-1.noarch

Comment 1 Sampson Fung 2021-06-13 17:01:18 UTC

Created attachment 1790751 [details]
journal.log

This is the full log of the boot session when the error occured.

Comment 2 Chris Murphy 2021-06-13 19:32:41 UTC

393MB journal.log haha, this is what i get for asking for the whole thing...

>$ grep oomd journal.log
>[   11.019356] audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-oomd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>[64871.555522] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-gnome-org.kde.digikam-13935.scope due to swap used (7062503424) / total (7810838528) being more than 90.00%
[64873.286418] systemd[1950]: app-gnome-org.kde.digikam-13935.scope: systemd-oomd killed 68 process(es) in this unit.

OK so that doesn't seem bad... although I can't tell why this particular cgroup was chosen for being killed.

>[64900.291024] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/session.slice/org.gnome.Shell due to memory pressure for /user.slice/user-1000.slice/user being 86.56% > 50.00% for > 20s with reclaim activity
[64900.360307] systemd[1950]: org.gnome.Shell: systemd-oomd killed 23 process(es) in this unit.

That's a problem it seems to me. That's going to clobber all kinds of things, including anything running in Terminal. That's why `btrfs send` was killed, e.g.

>[65032.000572] systemd[1]: user: Killing process 58159 (btrfs) with signal SIGKILL.


[64915.236982] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 90.97% > 50.00% for > 20s with reclaim activity
[64930.743521] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.06% > 50.00% for > 20s with reclaim activity
[64946.498345] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.97% > 50.00% for > 20s with reclaim activity
[64962.496917] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.27% > 50.00% for > 20s with reclaim activity
[64978.243151] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.30% > 50.00% for > 20s with reclaim activity
[64994.242281] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.95% > 50.00% for > 20s with reclaim activity
[65010.246012] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.89% > 50.00% for > 20s with reclaim activity
[65026.246468] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.89% > 50.00% for > 20s with reclaim activity
[65041.487990] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.06% > 50.00% for > 20s with reclaim activity
[96603.487351] systemd[1]: systemd-oomd.service: State 'stop-sigterm' timed out. Killing.
[96603.487513] systemd[1]: systemd-oomd.service: Killing process 811 (systemd-oomd) with signal SIGKILL.
[96693.735910] systemd[1]: systemd-oomd.service: Processes still around after SIGKILL. Ignoring.

That's interesting! Am I reading this correctly that systemd-oomd itself is being killed? 

BTW $ grep -v bees journal.log > journal-nobees.log cuts the log size down to 4MiB. Attaching that, and adding Anita.

Comment 3 Chris Murphy 2021-06-13 19:33:23 UTC

Created attachment 1790795 [details]
journal-nobees.log

filter out beesd messages

Comment 4 Benjamin Berg 2021-06-14 08:49:09 UTC

A few notes:
 1. Digikam is getting killed due to swap use. I am not entirely sure what this means; i.e. is this the offender, or is it just not doing much and can therefore be swapped out?
 2. gnome-shell gets caught up. This requires reclaim activity, so we must have some memory reclaim, but it is not necessarily much.
 3. Chrome is not dying; I don't have an explanation for that.

About Chrome:

[64915.236982] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 90.97% > 50.00% for > 20s with reclaim activity
[64930.743521] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.06% > 50.00% for > 20s with reclaim activity
[64946.498345] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.97% > 50.00% for > 20s with reclaim activity
[64962.496917] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.27% > 50.00% for > 20s with reclaim activity
[64978.243151] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.30% > 50.00% for > 20s with reclaim activity
[64994.242281] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.95% > 50.00% for > 20s with reclaim activity
[65001.984379] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope: Stopping timed out. Killing.
[65001.986588] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope: Killing process 10350 (bwrap) with signal SIGKILL.
[65001.987089] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope: Killing process 10349 (gmain) with signal SIGKILL.
[65001.987275] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope: Stopping timed out. Killing.
[65001.987490] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope: Killing process 10399 (bwrap) with signal SIGKILL.
[65001.987644] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope: Killing process 10396 (gmain) with signal SIGKILL.
[65010.246012] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 98.89% > 50.00% for > 20s with reclaim activity

and then, user is getting forcefully shut down, and Chrome is still running!

[65031.999339] systemd[1]: user: State 'stop-sigterm' timed out. Killing.
[65032.000301] systemd[1]: user: Killing process 1950 (systemd) with signal SIGKILL.
[65032.000427] systemd[1]: user: Killing process 10350 (bwrap) with signal SIGKILL.
[65032.000496] systemd[1]: user: Killing process 10349 (gmain) with signal SIGKILL.
[65032.000572] systemd[1]: user: Killing process 58159 (btrfs) with signal SIGKILL.
[65032.000638] systemd[1]: user: Killing process 9226 (sudo) with signal SIGKILL.
[65032.000721] systemd[1]: user: Killing process 9228 (su) with signal SIGKILL.
[65032.000800] systemd[1]: user: Killing process 9231 (bash) with signal SIGKILL.
[65032.000874] systemd[1]: user: Killing process 60259 (btrfs) with signal SIGKILL.
[65032.000931] systemd[1]: user: Killing process 10399 (bwrap) with signal SIGKILL.
[65032.001002] systemd[1]: user: Killing process 10396 (gmain) with signal SIGKILL.
[65032.026120] systemd[1]: user: Main process exited, code=killed, status=9/KILL
[65032.027384] systemd[1]: user: Killing process 10350 (bwrap) with signal SIGKILL.
[65032.027859] systemd[1]: user: Killing process 10349 (gmain) with signal SIGKILL.
[65032.028019] systemd[1]: user: Killing process 58159 (btrfs) with signal SIGKILL.
[65032.028152] systemd[1]: user: Killing process 60259 (btrfs) with signal SIGKILL.
[65032.028343] systemd[1]: user: Killing process 10399 (bwrap) with signal SIGKILL.
[65032.028500] systemd[1]: user: Killing process 10396 (gmain) with signal SIGKILL.
[65041.487990] systemd-oomd[811]: Killed /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.google.Chrome-10339.scope due to memory pressure for /user.slice/user-1000.slice/user being 99.06% > 50.00% for > 20s with reclaim activity


So, to be honest, I think something is really off on this system. I am thinking that most likely, disk access was completely stalled at that point or something weird like that.

Comment 5 Benjamin Berg 2021-06-14 11:29:49 UTC

I still had the log open … and well, there are multiple backtraces from btrfs

Page fault oops with the following info:

[60015.902414] kernel: Call Trace:
[60015.902419] kernel:  kfence_unprotect+0x13/0x30
[60015.902423] kernel:  page_fault_oops+0x89/0x270
[60015.902427] kernel:  ? search_module_extables+0xf/0x40
[60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
[60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
[60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
[60015.902440] kernel:  exc_page_fault+0x67/0x150
[60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
[60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
[60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 04 00 00 65 48 8b 04 25 c0 7b 01 00 4c 8b a0 70 0c 00 00 b8 01 00 00 00 49 8d 7c 24 38 <f0> 41 0f c1 44 24 38 85 c0 0f 84 41 04 00 00 8d 50 01 09 c2 0f 88
[60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
[60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
[60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
[60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
[60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
[60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
[60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
[60015.902476] kernel:  evict+0xd1/0x180
[60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
[60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
[60015.902487] kernel:  ? iput+0x1a0/0x1a0
[60015.902489] kernel:  ? iput+0x1a0/0x1a0
[60015.902491] kernel:  list_lru_walk_one+0x47/0x70
[60015.902495] kernel:  prune_icache_sb+0x39/0x50
[60015.902497] kernel:  super_cache_scan+0x161/0x1f0
[60015.902501] kernel:  do_shrink_slab+0x142/0x240
[60015.902505] kernel:  shrink_slab+0x164/0x280
[60015.902509] kernel:  shrink_node+0x2c8/0x6e0
[60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
[60015.902514] kernel:  try_to_free_pages+0xda/0x190
[60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
[60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
[60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
[60015.902528] kernel:  pipe_write+0x30b/0x5c0
[60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
[60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
[60015.902538] kernel:  __kernel_write+0x13a/0x2b0
[60015.902541] kernel:  kernel_write+0x73/0x150
[60015.902543] kernel:  send_cmd+0x7b/0xd0
[60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
[60015.902549] kernel:  process_extent+0x19b/0xed0
[60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
[60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
[60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
[60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
[60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
[60015.902564] kernel:  ? psi_task_change+0x84/0xc0
[60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
[60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
[60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
[60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
[60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
[60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
[60015.902588] kernel:  do_syscall_64+0x33/0x40
[60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae

So, I am guessing we really can say that userspace got caught up in a btrfs or other system failure.

Comment 6 Sampson Fung 2021-06-14 12:15:27 UTC

(In reply to Benjamin Berg from comment #4)
> A few notes:
>  1. Digikam is getting killed due to swap use. I am not entirely sure what
> this means; i.e. is this the offender, or is it just not doing much and can
> therefore be swapped out?
>  2. gnome-shell gets caught up. This requires reclaim activity, so we must
> have some memory reclaim, but it is not necessarily much.
>  3. Chrome is not dying; I don't have an explanation for that.
> 
> About Chrome:
> 
> [64915.236982] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 90.97% > 50.00% for >
> 20s with reclaim activity
> [64930.743521] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 98.06% > 50.00% for >
> 20s with reclaim activity
> [64946.498345] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 98.97% > 50.00% for >
> 20s with reclaim activity
> [64962.496917] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 99.27% > 50.00% for >
> 20s with reclaim activity
> [64978.243151] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 99.30% > 50.00% for >
> 20s with reclaim activity
> [64994.242281] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 98.95% > 50.00% for >
> 20s with reclaim activity
> [65001.984379] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope:
> Stopping timed out. Killing.
> [65001.986588] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope:
> Killing process 10350 (bwrap) with signal SIGKILL.
> [65001.987089] systemd[1950]: app-flatpak-com.google.Chrome-10339.scope:
> Killing process 10349 (gmain) with signal SIGKILL.
> [65001.987275] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope:
> Stopping timed out. Killing.
> [65001.987490] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope:
> Killing process 10399 (bwrap) with signal SIGKILL.
> [65001.987644] systemd[1950]: app-flatpak-com.google.Chrome-10387.scope:
> Killing process 10396 (gmain) with signal SIGKILL.
> [65010.246012] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 98.89% > 50.00% for >
> 20s with reclaim activity
> 
> and then, user is getting forcefully shut down, and Chrome is
> still running!
> 
> [65031.999339] systemd[1]: user: State 'stop-sigterm' timed
> out. Killing.
> [65032.000301] systemd[1]: user: Killing process 1950 (systemd)
> with signal SIGKILL.
> [65032.000427] systemd[1]: user: Killing process 10350 (bwrap)
> with signal SIGKILL.
> [65032.000496] systemd[1]: user: Killing process 10349 (gmain)
> with signal SIGKILL.
> [65032.000572] systemd[1]: user: Killing process 58159 (btrfs)
> with signal SIGKILL.
> [65032.000638] systemd[1]: user: Killing process 9226 (sudo)
> with signal SIGKILL.
> [65032.000721] systemd[1]: user: Killing process 9228 (su) with
> signal SIGKILL.
> [65032.000800] systemd[1]: user: Killing process 9231 (bash)
> with signal SIGKILL.
> [65032.000874] systemd[1]: user: Killing process 60259 (btrfs)
> with signal SIGKILL.
> [65032.000931] systemd[1]: user: Killing process 10399 (bwrap)
> with signal SIGKILL.
> [65032.001002] systemd[1]: user: Killing process 10396 (gmain)
> with signal SIGKILL.
> [65032.026120] systemd[1]: user: Main process exited,
> code=killed, status=9/KILL
> [65032.027384] systemd[1]: user: Killing process 10350 (bwrap)
> with signal SIGKILL.
> [65032.027859] systemd[1]: user: Killing process 10349 (gmain)
> with signal SIGKILL.
> [65032.028019] systemd[1]: user: Killing process 58159 (btrfs)
> with signal SIGKILL.
> [65032.028152] systemd[1]: user: Killing process 60259 (btrfs)
> with signal SIGKILL.
> [65032.028343] systemd[1]: user: Killing process 10399 (bwrap)
> with signal SIGKILL.
> [65032.028500] systemd[1]: user: Killing process 10396 (gmain)
> with signal SIGKILL.
> [65041.487990] systemd-oomd[811]: Killed
> /user.slice/user-1000.slice/user/app.slice/app-flatpak-com.
> google.Chrome-10339.scope due to memory pressure for
> /user.slice/user-1000.slice/user being 99.06% > 50.00% for >
> 20s with reclaim activity
> 
> 
> So, to be honest, I think something is really off on this system. I am
> thinking that most likely, disk access was completely stalled at that point
> or something weird like that.

Yes, I think disk accesses are stalled when I tried to login at the text mode login prompt.

I at the end pressed Ctrl+Alt+Del multiple times within 2s to force shutdown, and got lots of timeout errors, volume unmounting errors, etc.  At last, the video signal is lost so I pressed power button.

Comment 7 Chris Murphy 2021-06-14 20:27:19 UTC

https://lore.kernel.org/linux-btrfs/CAL3q7H7iOfFFq_vh80Zwb4jJY8NLq-DFBA4yvj7=EbG0AadOzg@mail.gmail.com/T/#ma8ad6b59a684bce6a3bafa3551b4ffe0385ada3f

Sampson, any chance you could reproduce using a debug kernel? These have CONFIG_BTRFS_ASSERT=y set. You can use either 5.13-rc6 or 5.12.10, both are in koji, e.g.:

https://koji.fedoraproject.org/koji/buildinfo?buildID=1767750
>kernel-debug-5.12.10-300.fc34.x86_64.rpm (info) (download)
>kernel-debug-core-5.12.10-300.fc34.x86_64.rpm (info) (download)
>kernel-debug-devel-5.12.10-300.fc34.x86_64.rpm (info) (download)
>kernel-debug-modules-5.12.10-300.fc34.x86_64.rpm (info) (download) 

Just be sure to download the correct arch! Assuming they are in ~/Downloads, you can just 'cd Downloads' and 'sudo dnf install *rpm' and it'll install the debug kernel. I do not thing debug kernels get enabled by default in the bootloader menu so you'll have to choose manually or alternatively 'sudo grubby --set-default=/boot/vmlinuz...' with path to the debug kernel. And then try to reproduce the problem.

If you're really adventurous you can try to give Felipe's patch a test run if you're up to building a kernel with it!

I still wonder if there's a systemd-oomd problem. The logs show ample writes happening to /var/log/journal without complaint after the kernel splat and even after oomd kills wayland.service. And I don't see how any circumstance justifies killing off the whole shell before chrome and bees. But there's also no summary of the logic used to determine the kills either.

Sampson, do you recall when the graphical environment became unresponsive?

Comment 8 Chris Murphy 2021-06-14 21:56:06 UTC

Looking at the memory pressure continue to go up despite all the kills that seem(?) do not succeed or at least not reduce memory pressure, I wonder if we could use a sysrq+m or a sysrq+t the next time this happens? Could and should oomd just do this for us? Or maybe /proc/pressure is better, along with some stats like 'top 10 swap pressure offenders' and 'top 10 reclaim offenders' and so on... I guess this is planned but just still not rolled into the systemd version we've got in Fedora 34?

user memory pressure is:

[64900.291024] 86.56%
[64915.236982] 90.97%
[64930.743521] 98.06%
[64946.498345] 98.97%
[64962.496917] 99.27%
[64978.243151] 99.30%
[64994.242281] 98.95%
[65010.246012] 98.89%
[65026.246468] 98.89%
[65041.487990] 99.06%

Something changed quickly around this time that wasn't happening in the previous 80 minutes. I guess a lot of anonymous data has accumulated? Filled up memory and swap? And now we have a ton of reclaim happening? Reclaim itself causes IO pressure right?

Comment 9 Sampson Fung 2021-06-15 00:48:16 UTC

(In reply to Chris Murphy from comment #7)
> https://lore.kernel.org/linux-btrfs/CAL3q7H7iOfFFq_vh80Zwb4jJY8NLq-
> DFBA4yvj7=EbG0AadOzg.com/T/
> #ma8ad6b59a684bce6a3bafa3551b4ffe0385ada3f
> 
> Sampson, any chance you could reproduce using a debug kernel? These have
> CONFIG_BTRFS_ASSERT=y set. You can use either 5.13-rc6 or 5.12.10, both are
> in koji, e.g.:
> 
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1767750
> >kernel-debug-5.12.10-300.fc34.x86_64.rpm (info) (download)
> >kernel-debug-core-5.12.10-300.fc34.x86_64.rpm (info) (download)
> >kernel-debug-devel-5.12.10-300.fc34.x86_64.rpm (info) (download)
> >kernel-debug-modules-5.12.10-300.fc34.x86_64.rpm (info) (download) 
> 
> Just be sure to download the correct arch! Assuming they are in ~/Downloads,
> you can just 'cd Downloads' and 'sudo dnf install *rpm' and it'll install
> the debug kernel. I do not thing debug kernels get enabled by default in the
> bootloader menu so you'll have to choose manually or alternatively 'sudo
> grubby --set-default=/boot/vmlinuz...' with path to the debug kernel. And
> then try to reproduce the problem.
> 
> If you're really adventurous you can try to give Felipe's patch a test run
> if you're up to building a kernel with it!
> 
> I still wonder if there's a systemd-oomd problem. The logs show ample writes
> happening to /var/log/journal without complaint after the kernel splat and
> even after oomd kills wayland.service. And I don't see how any circumstance
> justifies killing off the whole shell before chrome and bees. But there's
> also no summary of the logic used to determine the kills either.
> 
> Sampson, do you recall when the graphical environment became unresponsive?

I am using radeon driver with Gnome 40.  When I am back to the computer, the whole DE is killed already.

Last 2 day, I btrfs send finished / digikam indexing finished / bees still dedupping.  (zram disabled.  16GB swap partition used)  No errors so far.

Yes, I can try to reproduce this by:
- start with a fresh MariaDB and re-index by digikam
- resend with btrfs
- let bees running concurrently
- switch back to zram with 8GB
- with a debug kernel

Comment 10 Ben Cotton 2022-05-12 15:11:19 UTC

This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 11 Ben Cotton 2022-06-07 21:08:10 UTC

Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07.

Fedora Linux 34 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
airlied
alciregi
bberg
bskeggs
bugzilla
ego.cordatus
fedoraproject
filbranden
flepied
gasinvein
hdegoede
jarodwilson
jeremy
jglisse
jonathan
josef
kasong
kernel-maint
lgoncalv
linville
lnykryn
masami256
mchehab
msekleta
ptalbert
ssahani
s
steved
systemd-maint
the.anitazha
woutersj
yuwatana
zbyszek