Bug 2193110 - Firefox freezes+crashes on some pages/types of clicks in KDE; correlations to issues with thunderbird, pts, alsa, acpi
Summary: Firefox freezes+crashes on some pages/types of clicks in KDE; correlations to...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Adam Jackson
QA Contact: Fedora Extras Quality Assurance
URL: https://bodhi.fedoraproject.org/updat...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-04 11:15 UTC by Christopher Klooz
Modified: 2023-06-06 12:41 UTC (History)
37 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-06 12:41:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Christopher Klooz 2023-05-04 11:15:17 UTC
I identified the issue with Firefox, which I file against because it is the one thing that is persistently affected, but I do not know if it is the origin. See my Bodhi posts: https://bodhi.fedoraproject.org/updates/FEDORA-2023-8127efc0a4

Firefox freezes and then crashes when specific pages are opened (e.g., when I click to start a movie on Netflix, it cashes the first seconds of the movie, and once it reaches 100% of caching to start, Firefox freezes and then crashes some seconds later). The issue also occurs when clicking the >> button of the toolbar (the button to get the remaining bookmarks). Sometimes it also happens when clicking the right mouse button within Firefox.

Related:
When my Fedora KDE is going sleep, after longer sleeps, it seems that it closes all applications. If this happens, three additional changes appear:
1) The issue in Firefox occurs on many more pages that work otherwise
2) Thunderbird is affected as well *1
3) Usually, the first terminal on KDE starts with pts/1, but after this "kill all applications" sleep occurance, the first terminal starts with pts/0.
This remains after reboot, whereas the reboot needs to kill the alsa-state.service, which in this condition seems to no longer respond.

After reboot, the issue remains again with Firefox as initially described.

I have observed the pts issue before, but it did not yet cause a correlation to other behaviors/issues.

Tested on: 6.2.14-300.fc38.x86_64, KDE spin. Up to date as of now. Only default repos (no rpmfusion), no testing repos enabled. Confined user with sysadm_u *1.

*1 There are SELinux denials logged for Thunderbird (denial to access urandom), which implies the issues of Thunderbird could be linked to the confined user account. However, that the issue occurs only in these circumstances indicates a bug imho (maybe linked to the pts?). But there are no denials logged for the other issues. I never had such issues before with confined user sysadm_u (which I use since around 3 months). I cannot test without sysadm_u atm.

`journalctl -r` user logs from two firefox crashes to have an initial overview:
```
May 04 11:36:25 fedo plasmashell[15878]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[15715]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[15859]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[16570]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[15780]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[15861]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[16696]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[16156]: Exiting due to channel error.
May 04 11:36:25 fedo plasmashell[16046]: Exiting due to channel error.
May 04 11:36:25 fedo firefox[15588]: Lost connection to Wayland compositor.
May 04 11:36:25 fedo kwin_wayland_wrapper[8284]: error in client communication (pid 15588)
```
-----------
```
May 04 11:33:40 fedo kwrite[16145]: qt.qpa.wayland: Wayland does not support QWindow::requestActivate()
May 04 11:33:39 fedo plasmashell[8307]: QString::arg: 2 argument(s) missing in org.kde.kwrite
May 04 11:33:39 fedo plasmashell[8307]: kf.service.services: KApplicationTrader: mimeType "x-scheme-handler/file" not found
May 04 11:33:32 fedo plasmashell[15823]: console.warn: "Skipping Glean as no metrics id is passed"
May 04 11:33:32 fedo plasmashell[15823]: console.error: "couldn't find cache folder /home/username/.mozilla/firefox/lcm7gdda.default-release/storage/to-be-removed"
May 04 11:33:32 fedo plasmashell[15823]: console.error: "Cache folder attempt no 1"
May 04 11:33:31 fedo kwin_wayland[8284]: kwin_core: Cannot grant a token to KWaylandServer::ClientConnection(0x5652a114d550)
May 04 11:33:31 fedo plasmashell[8307]: org.kde.plasma.libtaskmanager: Got invalid activation app_id: ""
May 04 11:33:31 fedo plasmashell[15823]: console.error: "/home/username/.mozilla/firefox/lcm7gdda.default-release/storage" "to-be-removed" 0 "" ""
May 04 11:33:30 fedo plasmashell[15588]: [ERROR glean_core] Error setting metrics feature config: Json(Error("EOF while parsing a value", line: 1, column: 0))
May 04 11:33:30 fedo firefox[15823]: Locale not supported by C library.
                                                         Using the fallback 'C' locale.
May 04 11:33:30 fedo plasmashell[15823]: *** You are running in headless mode.
May 04 11:33:30 fedo plasmashell[15823]: *** You are running in background task mode. ***
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:29 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:28 fedo firefox[15588]: Theme parsing error: gtk.css:1652:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:28 fedo firefox[15588]: Theme parsing error: gtk.css:1649:16: '-gtk-icon-size' is not a valid property name
May 04 11:33:28 fedo firefox[15588]: Locale not supported by C library.
                                                         Using the fallback 'C' locale.
May 04 11:33:28 fedo plasmashell[8307]: QString::arg: 2 argument(s) missing in firefox
May 04 11:33:28 fedo plasmashell[8307]: kf.service.services: KApplicationTrader: mimeType "x-scheme-handler/file" not found
May 04 11:32:58 fedo plasmashell[14832]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14200]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14962]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14263]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14378]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14554]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[15341]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14615]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[15048]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[15460]: Exiting due to channel error.
May 04 11:32:58 fedo plasmashell[14735]: Exiting due to channel error.
May 04 11:32:58 fedo firefox[14074]: Lost connection to Wayland compositor.
May 04 11:32:58 fedo kwin_wayland_wrapper[8284]: error in client communication (pid 14074)
May 04 11:32:54 fedo kwin_wayland[8284]: This plugin does not support raise()
```

Just to have an overview. Let me know if you need something more.

Reproducible: Always

Steps to Reproduce:
See details
Actual Results:  
See details

Expected Results:  
See details

Comment 1 Christopher Klooz 2023-05-04 11:30:08 UTC
Addition: I think this is not related, but might be noted FYI: I have had an SELinux denial with Firefox with a build in January: once updated, I could still copy-paste by marking text and then clicking the MIDDLE mouse button to get it coped into Firefox, as usual. But after each time I did this, the next RIGHT mouse click led Firefox to crash because the latter led Firefox to try to re-access the pts from which I copied with the MIDDLE mouse click before this RIGHT mouse click (I had no time to file the bug before it was solved; disappeared with the next build/update). The RIGHT mouse click led to a SELinux access denial that caused Firefox to crash.

I thought back then if it makes sense to create a SIG to use sysadm_u for testing to identify issues that remain uncovered otherwise and to ensure that code we introduce implements best practices (again, had no time so far :).

I note it in this case because the right mouse button, in some circumstances can currently cause the same issue in Firefox: crash (see my above elaboration). However, there are no SELinux denials logged at the moment for Firefox, and no denials at all in the minutes before the freeze+crash of Firefox (I checked several of its crashes).

The above issues apply btw to both `Firefox` and `Firefox-wayland`.

Comment 2 Christopher Klooz 2023-05-04 12:55:46 UTC
I downgraded to firefox-112.0.1-1.fc38, which worked properly earlier. Now, it has the same issue and does not differ to firefox-112.0.2-1.fc38. However, I just saw in dnf's log that the Firefox with which I was working already yesterday evening without issues was also 112.0.2 (including with Netflix movies).

I wanted to check if it works when booting with the recent kernel *13, which it did. However, then, with several reboots now, I found out that it seems arbitrary (also on kernel *14): sometimes the issue does not occur at all and all is fine, and sometimes it is the buggy behavior noted above. At the moment, it seems disappeared.

Whatever causes the issue, it seems another package that was changed. What is involved everywhere in the issues is Wayland & KDE. Feel free to reassign.

The last update before the first occurrence included according to dnf.log:
```
kernel
kernel-core
kernel-modules
kernel-modules-core
kernel-modules-extra
cups-browsed
elfutils
elfutils-debuginfod-client
elfutils-default-yama-scope
elfutils-libelf
elfutils-libs
firefox
firefox-langpacks
firefox-wayland
ibus
ibus-gtk2
ibus-gtk3
ibus-libs
ibus-setup
kio-gdrive
libcupsfilters
libopenmpt
libppd
librados2
librbd1
mozilla-noscript
power-profiles-daemon
python3-rpm
rpm
rpm-build-libs
rpm-libs
rpm-plugin-selinux
rpm-plugin-systemd-inhibit
rpm-sign-libs
```

But I was rebooting after these updates (to check if the new kernel works) and had no issues in the remaining evening, including with Netflix movies. Not sure why the issue was not triggered yesterday and is no longer triggered in the recent two boots. I did not change anything since the last occurrence, the only different behavior I conducted was booting once with kernel *13 before switching back to *14...

As I said before, the pts issue is nothing new, but it was never related to any issues or other changed behavior.

The following updates were installed after the above updates but did not cause any before/after-change:
```
dnf
dnf-automatic
dnf-data
hwdata
python3-dnf
yum
```

----

Addition about my hardware:
AMD Ryzen 7 PRO 6850U, working with amd-pstate driver in passive mode (my hardware doesn't work properly with the old default driver). No separated graphics.

Comment 3 Christopher Klooz 2023-05-05 12:34:47 UTC
I can no longer reproduce the issue in the original manifest atm. I did not impose any changes (also no dnf update).

However, the "kill all processes" sleeping behavior remains, and after waking up, the first terminal gets pts/0 instead of pts/1.

Because the buggy behavior of Firefox and Thunderbird is longer reproducible, while their issues were strongly impacted by the sleeping behavior and its subsequent pts issue (that keeps appearing), I suggest to *re-assign* the ticket and let's see if the buggy behavior maybe was just a symptom of some wayland/KDE issue. 

I just saw that additionally to the mentioned issues after waking up, any terminal outside the GUI (e.g., tty6) is very dark (nearly impossible to read) on one screen, while the second screen is no longer aligned to the first (it does not show all lines).

I have attached the user logs, which contain a sleeping state (about 5 minutes sleep) that contained a kill of all processes, but also the logs from root of the same time frame. Additionally the whole setroubleshoot logs of the whole boot, to check if wayland/KDE are somehow not properly set up to fit SELinux practices. "sysadm_u" might be helpful to identify that.

Important about the logs: the SELinux denials logged by the root logs occur AFTER the "big sleeping crash". The last SELinux denial before the "big crash" was 1 hour earlier (see the settroubleshoot logs).

I will check how it develops with 6.3 once testing days start, but my guess is that its more related to wayland/KDE (?), maybe in conjunction with p-state and/or my hardware, which is comparably new (Lenovo T16 AMD with AMD Ryzen 7 PRO 6850U was released not so long ago).

User logs of the time frame: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/user.journalctl.-r.13-17-52-13-22-44.log
Root logs of the time frame: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root.journalctl
Root logs limited to setroubleshoot of the boot: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root.setroubleshoot.journalctl

(all logs are output with `-r`)

Regards,
Chris

Comment 4 Christopher Klooz 2023-05-06 14:21:20 UTC
The occurrence is back: Firefox has frozen when I clicked on getfedora.org the button to download the media writer for a test.

It kept frozen but did not crash this time. System up to date as of now. See logs:
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/user-journalctl-r.FirefoxPermanentFreezeWithoutCrash.log
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl-r.FirefoxPermanentFreezeWithoutCrash.log

There is a SELinux denial roughly around the time when the freeze happened (I cannot say if it was soon before or soon after). I see no immediate relation based upon the properties of the denial. However, there are some comparabilities to the denial noted above (the "middle click then right click" Firefox behavior). If so, even if not appearing with unconfined user accounts, this would indicate a bug/unintended-action in Firefox. There was no sleep at the current boot.

Nevertheless, additionally, yesterday my desktop crashed immediately after the idling led the system to lock the screen through the sddm login (no sleep): there was only black background but the mouse cursor was still there and could be moved throughout the screen (there was just nothing left to click on ;). However, `systemctl restart sddm` was sufficient to fix this. See:

https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl-r.DesktopCrashWithOnlyBlackBackgroundPlusCursorStillWorks.About224655.RestartSddmSufficientToFix.log

Also, additionally, my system had a full freeze today, where the desktop in total has frozen completely (CTRL+ALT+F5 did not respond as well) so that I needed to hard reset. See:

https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl-r.FullSystemFreeze.About143010-143030.log

The latter is a full log of the whole boot (the boot endured only a few minutes), and its error happened when I clicked the upload button on GitLab in Firefox. However, given the times of the last entries in this log, it looks that the error itself was not logged. Not sure if the most recent two entries in the logs are indicative (all previous entries were 3 minutes earlier).

I have it about once a month that the screen has a short moment of being black (like a restart of the screen) before the desktop comes back and then freezes, but I linked this to my comparably new hardware and didn't care so far. An occurrence like that of yesterday or today is unprecedented. The current manifests began with 6.2.14 (maybe its two bugs that provoke each other?).

I will test 6.3.X once available.

I have not yet reassigned since I still cannot exclude that Firefox is involved, or that it at least has a related unintended behavior.

I guess what I have uploaded so far should be sufficient indicative for initial evaluations, so I will not flood you with further posts until you let me know if you need more, or if the problem persists after the official 6.3.X release.

Comment 5 Martin Stransky 2023-05-09 08:01:48 UTC
Please try to get backtrace of the crash/freeze:
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Using_local_debugging
Thanks.

Comment 6 Christopher Klooz 2023-05-09 11:07:09 UTC
Since switching to 6.3.1, I still have issues and each time they occur Firefox is open (but Firefox is almost always open on my system), but the current issues are always affecting the whole KDE/wayland or the whole system at all. I can at the moment not reproduce the "Firefox-only" issue. Thus, I cannot store gdb output since always the whole KDE/wayland or the whole system at all (it seems to depend on if and how amd_pstate is used) is "gone".

But I prepared debuginfo for firefox/thunderbird and I will add the requested backtrace if the "Firefox-only" issue re-occurs.

At the moment, the coredumpctl of the user acc is limited to 4th and 5th May, and contains only two entries with kwin_wayland.

TIME                          PID  UID  GID SIG     COREFILE EXE                   SIZE
Thu 2023-05-04 11:14:51 CEST 2824 1000 1000 SIGSEGV missing  /usr/bin/kwin_wayland    -
Fri 2023-05-05 13:22:18 CEST 2342 1000 1000 SIGSEGV missing  /usr/bin/kwin_wayland    -

See the related `coredumpctl debug <ID>` outputs:
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/coredumpctl-output-2342.log
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/coredumpctl-output-2824.log

See also https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org/thread/TZAU47FTRF3XOXCHH4VWZNJ2KBWGOO55/ (including attachments) for the current behavior.

Does it make sense to reassign to kwin_wayland or kernel at this time?

Comment 7 Martin Stransky 2023-05-09 18:57:11 UTC
(In reply to Christopher Klooz from comment #6)
> Since switching to 6.3.1, I still have issues and each time they occur
> Firefox is open (but Firefox is almost always open on my system), but the
> current issues are always affecting the whole KDE/wayland or the whole
> system at all. I can at the moment not reproduce the "Firefox-only" issue.
> Thus, I cannot store gdb output since always the whole KDE/wayland or the
> whole system at all (it seems to depend on if and how amd_pstate is used) is
> "gone".
> 
> But I prepared debuginfo for firefox/thunderbird and I will add the
> requested backtrace if the "Firefox-only" issue re-occurs.
> 
> At the moment, the coredumpctl of the user acc is limited to 4th and 5th
> May, and contains only two entries with kwin_wayland.
> 
> TIME                          PID  UID  GID SIG     COREFILE EXE            
> SIZE
> Thu 2023-05-04 11:14:51 CEST 2824 1000 1000 SIGSEGV missing 
> /usr/bin/kwin_wayland    -
> Fri 2023-05-05 13:22:18 CEST 2342 1000 1000 SIGSEGV missing 
> /usr/bin/kwin_wayland    -
> 
> See the related `coredumpctl debug <ID>` outputs:
> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/coredumpctl-output-
> 2342.log
> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/coredumpctl-output-
> 2824.log
> 
> See also
> https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org/
> thread/TZAU47FTRF3XOXCHH4VWZNJ2KBWGOO55/ (including attachments) for the
> current behavior.
> 
> Does it make sense to reassign to kwin_wayland or kernel at this time?

kwin looks like the correct component.

Comment 8 Christopher Klooz 2023-05-11 19:41:19 UTC
I could just reproduce the freeze of Firefox (after several seconds the freeze developed to a crash) within gdb (and the freeze and its symptoms were limited to Firefox, and did not affect the remaining system):

Please see the gdb error output / backtrace you have asked for here: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/gdb-debug-output.txt

Additionally, the last parts of gdb's output within the terminal, including its final error (which makes up the final 6 lines) can be seen here: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/gdb-terminal-output.txt

I hope the report is sufficient to determine if/how far Firefox is related to the issue.

----------

It might be noted that earlier today, first, Firefox has frozen (like in the gdb), but then a few seconds later, firefox did not crash (as it did within gdb now) but instead the whole system froze completely (I needed to hard reset). I am not sure if it was gdb that caused the error to remain within a restricted environment. Given the full system freeze of earlier today, I can only provide the root journalctl of this incident (it was finally a kernel error): https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl.-r.-b.-1.firstFirefoxFreezeThenFullSystemFreeze.kernel.6.3.1.amd_pstate-passive.log

I cannot exclude if the kwin_wayland coredump of above is related, but the time of the reports of coredump do NOT correlate to the freezes/crashes (however, kwin_wayland is linked often in the error logs of the system freezes, except the freezes that hindered further logging immediately).

However, the behavior seems to be different when different kernel drivers are used, and if not caused by Firefox (I do not know if there are two different phenomenons) the freezes seemed to have been impacted by KDE's power saving conditions. If screen dimming and screen lock is disabled within KDE, locking is still initiated after some time (screen lock seems to be a default configured somewhere else), but then without freezes (except sometimes when virtual machines boot with the default driver on the host, or generally sometimes with Firefox). 

Beyond the two reports+logs I mentioned in the test mailing list (see the archive link above; the first with amd_pstate=active, the second with amd_pstate=passive), a comparable issue appeared with the default driver (immediate freeze without logging like with amd_pstate=passive, but the behavior was equal to the immediate full freeze of amd_pstate=active: see https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl.-r.-b.-1.full.system.freeze.kernel6.3.1.defaultDriverAcpi.log ), so I switched back to amd_pstate=passive (this is most stable so far). The gdb backtrace output happened on amd_pstate=passive.

Errors link to / are logged at firefox, kwin_wayland, kernel, plasmashell, (although kwin_wayland and plasmashell as final entry could be also explained by the freezes that occur immediately before the incident can be logged, making the final entries unrelated entries).

For now I wait for your evaluation of the Firefox gdb backtrace output before doing something. If you have the information you need for Firefox, I tend to reassign to kernel?

Comment 9 Christopher Klooz 2023-05-11 22:40:09 UTC
Now, Firefox has turned unusable for several boots (then within one it worked normally again), but I could reproduce relevant scenarios and behaviors:

When I opened the Netflix login screen, I entered the email address and switch to the password field, entering just something but not enter. Then I just needed to wait 5-10 seconds until Firefox froze.

If I provoke the freeze with Firefox in gdb debugging mode, the problem is limited to Firefox and I can get gdb output -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/gdb-debug-output.2.txt (comparable to the last)

If I open Firefox natively without gdb, it takes after Firefox's freeze another 5-10 seconds until the Firefox freeze affects the remaining system and the whole system freezes with massive kernel errors in the logs. -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl.-r.-b.-1.GUI.freezy.UnUsable.kernel6.3.1.amd_pstate-passive.2.log (logs with amd_pstate=passive; comparable to the last)

If I have the kernel with amd_pstate=active or the default driver, the system freezes then immediately and completely (hard reset necessary). Difference: the default driver freezes immediately and the system is not able to create any log entries in root's journalctl. On the other hand, amd_pstate=active creates log files with massive kernel errors before freezing ("amdgpu" errors). (See earlier logs for amd_pstate=active or default driver logs)

If I have the kernel with amd_pstate=passive, it is a combination of the two other behaviors: it creates log files with the massive kernel errors like amd_pstate=active, but unlike the other two, it does not immediately freeze but the screen freezes several seconds, starts to flickering a little, changes a little, freezes again, and so on. With this state, I can use CTRL+ALT+F6 and with the next flickering, it switches to the terminal, which works properly. However, when rebooting from the terminal, an alsa and a pipewire service do not respond.

The "survival" of amd_pstate=passive explains that the system is able to create further entries in root's journalctl: in amd_pstate=passive, after the massive kernel errors, the kernel errors are then followed by kwin_wayland errors and then plasmashell and wayland protocol errors. Contrary, in amd_pstate=active, which freezes immediately and completely, everything ends after the kernel errors (while the default driver does not log relevant stuff at all as far as I read it).

Thus, I assume the kwin_wayland and plasmashell errors are the consequence of the kernel errors? And I can provoke the latter with Firefox, and sometimes with acpi-related activities.

Obviously, root's logs differ when Firefox is freezing within gdb's environment, because the issue then remains limited to Firefox: if relevant, there is an extract of root's journalctl of the time when I provoked a freeze of Firefox within gdb -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl-extract-when-firefox-freezes-in-gdb.log

The only thing I remember that I did different today before the massive freezing series began was that I once put the system into sleep (and then this freezing series kept even after several reboots).

It might be noted that today I did the dnf updates of the recent days. Generally, the issue is the same as before, but the frequence heavily changed today (but this was the same a few days ago for some hours, and then it disappeared again).

I hope that elaboration of the interactions/behaviors helps...

Comment 10 Phil Smith 2023-05-11 23:18:55 UTC
I think this is a serious X problem affecting all(?) AMD Ryzen with Radeon graphics including the recent Ryzen Lenovos, running kernels (approx.) 6.0 and later.
See
   https://gitlab.freedesktop.org/drm/amd/-/issues/2220
Search for
   kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout
in the system logs.
X crashes every day or few days depending on graphics activity
or crashes really soon with
    https://testdrive-archive.azurewebsites.net/graphics/webglstresstest/

Comment 11 Phil Smith 2023-05-11 23:32:44 UTC
See also
https://bugzilla.redhat.com/show_bug.cgi?id=2193325

Comment 12 Martin Stransky 2023-05-12 06:48:58 UTC
Should be mesa component then.

Comment 13 Christopher Klooz 2023-05-12 17:18:31 UTC
Since yesterday evening, I work with 6.3.2. Until now it was the same.

But now I had a different error (and even more different logs): an immediate full freeze even with amd_pstate=passive. Additionally, my machine was "beeping" at the time of freezing. The logs look more critical to me than the last logs, since it seems to have logged massive errors on multiple cores, whereas the errors seem to affect file systems: See https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/root-journalctl-beep-immediateFullFreeze-6.3.2-amd_pstate-passive.log

I will now switch back to 6.3.1, which "feels" to be the most stable kernel at the moment.

Comment 14 Christopher Klooz 2023-05-22 15:51:46 UTC
In short, the issue is on all kernels including 6.3.3, but 6.3.3 adds new information (maybe some new self-test?):

(6.3.3)
------------
May 21 00:57:33 fedora.domain kernel: BUG: unable to handle page fault for address: 00001b3f00000008
------------
 -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.SystemFreeze.kernel6.3.3.amd_pstate-passive.kernelBugEntry.log

(also 6.3.3)
------------
May 20 23:33:11 fedora.domain kernel: kernel BUG at lib/list_debug.c:62!
May 20 23:33:11 fedora.domain kernel: ------------[ cut here ]------------
May 20 23:33:11 fedora.domain kernel: list_del corruption. next->prev should be ffff8bf934ea84a8, but was ffff0030003084a8. (next=ffff8bf9788588a8)
------------
 -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.SystemFreezeMouseDelayFewSeconds.kernel6.3.3.amd_pstate-passive.kernelBugEntry.log

(6.2.15)
------------
May 21 20:07:02 fedora.domain kernel: RBP: ffff88a2767e8c50 R08: ffff889ff3544620 R09: ffff889feb220040
May 21 20:07:02 fedora.domain kernel: RDX: ffff88a4479e2c58 RSI: ffff889ff3544490 RDI: 0018e70000000082
May 21 20:07:02 fedora.domain kernel: RAX: ffff88a4479e2c00 RBX: 000000000000000e RCX: 0000000000000003
May 21 20:07:02 fedora.domain kernel: RSP: 0018:ffffa550c7877a48 EFLAGS: 00010206
May 21 20:07:02 fedora.domain kernel: Code: 00 00 e8 29 36 26 ec eb a9 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 15 <48> 8b 47 48 48 85 c0 74 0e 48 8b 80 20 09 00 00 c3 cc cc cc cc 31
May 21 20:07:02 fedora.domain kernel: RIP: 0010:amdgpu_ttm_tt_get_usermm+0xa/0x30 [amdgpu]
May 21 20:07:02 fedora.domain kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
May 21 20:07:02 fedora.domain kernel: CPU: 12 PID: 3378 Comm: kwin_wayla:cs0 Not tainted 6.2.15-300.fc38.x86_64 #1
May 21 20:07:02 fedora.domain kernel: general protection fault, probably for non-canonical address 0x18e700000000ca: 0000 [#1] PREEMPT SMP NOPTI
------------
 -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.SystemFreeze.kernel6.2.15.amd_pstate-passive.amd_gpuLog.log

(also 6.2.15)
------------
May 19 16:31:56 fedora.domain kernel: RIP: 0010:__kmem_cache_alloc_node+0x197/0x2f0
May 19 16:31:56 fedora.domain kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
May 19 16:31:56 fedora.domain kernel: CPU: 1 PID: 3895 Comm: kwin_wayla:cs0 Tainted: G      D            6.2.15-300.fc38.x86_64 #1
May 19 16:31:56 fedora.domain kernel: general protection fault, probably for non-canonical address 0x30475c614448660: 0000 [#2] PREEMPT SMP NOPTI
May 19 16:31:33 fedora.domain abrt-dump-journal-oops[2101]: Reported 1 kernel oopses to Abrt
May 19 16:31:32 fedora.domain abrt-server[7820]: Deleting problem directory '/var/spool/abrt/oops-2023-05-19-16:31:32-2101-0'
May 19 16:31:32 fedora.domain abrt-server[7820]: 'post-create' on '/var/spool/abrt/oops-2023-05-19-16:31:32-2101-0' exited with 1
May 19 16:31:32 fedora.domain abrt-server[7820]: Package 'kernel-core' isn't signed with proper key
May 19 16:31:32 fedora.domain abrt-dump-journal-oops[2101]: abrt-dump-journal-oops: Creating problem directories
May 19 16:31:32 fedora.domain abrt-dump-journal-oops[2101]: abrt-dump-journal-oops: Found oopses: 1
May 19 16:31:31 fedora.domain kernel: PKRU: 55555554
May 19 16:31:31 fedora.domain kernel: CR2: 00007f8dde922000 CR3: 00000001ea09a000 CR4: 0000000000750ee0
May 19 16:31:31 fedora.domain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 16:31:31 fedora.domain kernel: FS:  00007f8e01fff6c0(0000) GS:ffff8b915ee40000(0000) knlGS:0000000000000000
May 19 16:31:31 fedora.domain kernel: R13: 00000000ffffffff R14: 00000000000003b8 R15: ffff8b8a40042b00
May 19 16:31:31 fedora.domain kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 19 16:31:31 fedora.domain kernel: RBP: 0000000000000cc0 R08: 0000000000037160 R09: ffff8b896afd0000
May 19 16:31:31 fedora.domain kernel: RDX: 0000000032e42001 RSI: 0000000000000cc0 RDI: 030475c614448460
May 19 16:31:31 fedora.domain kernel: RAX: 030475c614448660 RBX: 0000000000000cc0 RCX: 00000000000003b8
May 19 16:31:31 fedora.domain kernel: RSP: 0018:ffff96a448543928 EFLAGS: 00010206
May 19 16:31:31 fedora.domain kernel: Code: 2b 14 25 28 00 00 00 0f 85 6a 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0
May 19 16:31:31 fedora.domain kernel: RIP: 0010:__kmem_cache_alloc_node+0x197/0x2f0
May 19 16:31:31 fedora.domain kernel: ---[ end trace 0000000000000000 ]---
May 19 16:31:31 fedora.domain kernel:  videobuf2_common btrtl libarc4 snd_hwdep btbcm snd_seq btintel btmtk pktcdvd snd_pci_acp5x videodev snd_seq_device irqbypass thinkpad_acpi snd_rn_pci_acp3x think_lmi ses snd_acp_config mc rapl snd_pcm pcspkr snd_soc_acpi enclosure firmware_attributes_class cfg80211 bluetooth ledtrig_audio wmi_bmof i2c_piix4 k10temp snd_pci_acp3x scsi_transport_sas snd_timer platform_profile snd rfkill mhi soundcore acpi_tad amd_pmc joydev xfs loop zram nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 adiantum dm_crypt amdgpu drm_ttm_helper ttm iommu_v2 drm_buddy nvme gpu_sched nvme_core drm_display_helper crct10dif_pclmul hid_multitouch crc32_pclmul crc32c_intel polyval_clmulni video uas ucsi_acpi polyval_generic ccp cec ghash_clmulni_intel usb_storage sha512_ssse3 typec_ucsi sp5100_tco r8169 typec nvme_common wmi i2c_hid_acpi i2c_hid serio_raw ip6_tables ip_tables fuse
May 19 16:31:31 fedora.domain kernel: Modules linked in: overlay udf crc_itu_t rfcomm uinput snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink bnep qrtr_mhi sunrpc binfmt_misc vfat fat qrtr ath11k_pci ath11k snd_soc_dmic snd_soc_acp6x_mach snd_acp6x_pdm_dma snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp qmi_helpers snd_sof snd_ctl_led snd_sof_utils snd_hda_codec_realtek mac80211 snd_hda_codec_generic intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common snd_soc_core snd_usb_audio edac_mce_amd snd_hda_intel snd_intel_dspcfg kvm_amd uvcvideo snd_usbmidi_lib snd_intel_sdw_acpi videobuf2_vmalloc snd_hda_codec snd_rawmidi snd_compress videobuf2_memops ac97_bus kvm snd_pcm_dmaengine videobuf2_v4l2 snd_pci_ps snd_hda_core snd_rpl_pci_acp6x btusb snd_pci_acp6x
May 19 16:31:31 fedora.domain kernel:  </TASK>
May 19 16:31:31 fedora.domain kernel: R13: 00000000c0186444 R14: 000000000000002d R15: 00007f8e01ffec78
May 19 16:31:31 fedora.domain kernel: R10: 00007f8dfda6cd00 R11: 0000000000000246 R12: 00007f8e01ffeb80
May 19 16:31:31 fedora.domain kernel: RBP: 00007f8e01ffeb10 R08: 00007f8e01ffecd0 R09: 00007f8e01ffeb60
May 19 16:31:31 fedora.domain kernel: RDX: 00007f8e01ffeb80 RSI: 00000000c0186444 RDI: 000000000000002d
May 19 16:31:31 fedora.domain kernel: RAX: ffffffffffffffda RBX: 00007f8e01ffec78 RCX: 00007f8e23128edd
May 19 16:31:31 fedora.domain kernel: RSP: 002b:00007f8e01ffeac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 19 16:31:31 fedora.domain kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
May 19 16:31:31 fedora.domain kernel: RIP: 0033:0x7f8e23128edd
May 19 16:31:31 fedora.domain kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 19 16:31:31 fedora.domain kernel:  ? do_syscall_64+0x68/0x90
May 19 16:31:31 fedora.domain kernel:  ? do_syscall_64+0x68/0x90
May 19 16:31:31 fedora.domain kernel:  ? do_syscall_64+0x68/0x90
May 19 16:31:31 fedora.domain kernel:  ? syscall_exit_to_user_mode+0x17/0x40
May 19 16:31:31 fedora.domain kernel:  ? do_syscall_64+0x68/0x90
May 19 16:31:31 fedora.domain kernel:  ? syscall_exit_to_user_mode+0x17/0x40
May 19 16:31:31 fedora.domain kernel:  ? do_syscall_64+0x68/0x90
May 19 16:31:31 fedora.domain kernel:  do_syscall_64+0x5c/0x90
May 19 16:31:31 fedora.domain kernel:  __x64_sys_ioctl+0x90/0xd0
May 19 16:31:31 fedora.domain kernel:  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  drm_ioctl+0x235/0x410
May 19 16:31:31 fedora.domain kernel:  drm_ioctl_kernel+0xc9/0x170
May 19 16:31:31 fedora.domain kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  amdgpu_cs_ioctl+0x4ce/0x2140 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  ? __check_object_size+0x22f/0x2b0
May 19 16:31:31 fedora.domain kernel:  amdgpu_bo_list_create+0x61/0x3d0 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  __kmalloc_node+0x4c/0x150
May 19 16:31:31 fedora.domain kernel:  ? amdgpu_bo_list_create+0x61/0x3d0 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  ? amdgpu_bo_list_create+0x61/0x3d0 [amdgpu]
May 19 16:31:31 fedora.domain kernel:  <TASK>
May 19 16:31:31 fedora.domain kernel: Call Trace:
May 19 16:31:31 fedora.domain kernel: PKRU: 55555554
May 19 16:31:31 fedora.domain kernel: CR2: 00007f8dde922000 CR3: 00000001ea09a000 CR4: 0000000000750ee0
May 19 16:31:31 fedora.domain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 16:31:31 fedora.domain kernel: FS:  00007f8e01fff6c0(0000) GS:ffff8b915ee40000(0000) knlGS:0000000000000000
May 19 16:31:31 fedora.domain kernel: R13: 00000000ffffffff R14: 00000000000003b8 R15: ffff8b8a40042b00
May 19 16:31:31 fedora.domain kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
May 19 16:31:31 fedora.domain kernel: RBP: 0000000000000cc0 R08: 0000000000037160 R09: ffff8b896afd0000
May 19 16:31:31 fedora.domain kernel: RDX: 0000000032e42001 RSI: 0000000000000cc0 RDI: 030475c614448460
May 19 16:31:31 fedora.domain kernel: RAX: 030475c614448660 RBX: 0000000000000cc0 RCX: 00000000000003b8
May 19 16:31:31 fedora.domain kernel: RSP: 0018:ffff96a448543928 EFLAGS: 00010206
May 19 16:31:31 fedora.domain kernel: Code: 2b 14 25 28 00 00 00 0f 85 6a 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0
May 19 16:31:31 fedora.domain kernel: RIP: 0010:__kmem_cache_alloc_node+0x197/0x2f0
May 19 16:31:31 fedora.domain kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
May 19 16:31:31 fedora.domain kernel: CPU: 1 PID: 5218 Comm: firefox:cs0 Not tainted 6.2.15-300.fc38.x86_64 #1
May 19 16:31:31 fedora.domain kernel: general protection fault, probably for non-canonical address 0x30475c614448660: 0000 [#1] PREEMPT SMP NOPTI
------------
 -> https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.FirstFirefoxInsideTabFreeze_thenRemainingFirefoxFreeze_thenSystemFreeze.16-31-55screenClockFrozen.kernel6.2.15.amd_pstate-passive.log

However, it still can happen that the error appears so "immediate" that the kernel cannot log the immediate event, such as in https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.SystemFreeze.kernel6.2.15.amd_pstate-passive.log (also 6.2.15).

The ffff8bf934ea84a8 error was interesting, as at that time, it started in one process to freeze, then it affected a second process, and then the whole system. Each time a few seconds in between, as if it was spreading from one process or core to the next. Another example for that is https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/2nd/root-journalctl.-r.-b.-1.FirstVLCinToolboxFreeze_thenFirefoxFreeze_thenSystemFreeze.17-44-18screenClockFrozen.kernel6.3.3.amd_pstate-passive.log , but at the latter the issue was then not causing logs (error occuring "too" immediate?). Sometimes the issue causes logs as above, sometimes no related logs at all. I think in the earlier logs there were cases were large amounts of errors occured about a minute or so before the freeze (including kernel errors).

My perception is that the very behavior changes sometimes after updates (not only kernel updates) and then most kernels change their behavior in a comparable way, but I have not explicitly logged/identified it that way, its just as perception so far (let me know if you want to compare all these logs against my dnf logs, I can provide them). I think the "from process to process" freeze also started to appear after a larger qt update, and after another update, I am back to the system always completely freezing immediately. I think that "behavioral change" was not bound to one kernel, but I did unfortunately not document, so I cannot say for sure. Given the "amd_gpu" relation, I assume that "what" calls the buggy function and "when" it does call that function in the kernel may change, but the logs make clear its a kernel issue. My earlier logs also contain amd_gpu and other kernel errors soon before the freeze.

Since this is clearly a bug in the kernel, I reassign to kernel.

Also, I tested now several days on unconfined_u user account: it does not make a difference.

All the above logs in: https://gitlab.com/py0xc31/public-tmp-storage/-/tree/main/2nd

Earlier logs can be found in: https://gitlab.com/py0xc31/public-tmp-storage/-/tree/main/

Comment 15 Christopher Klooz 2023-05-22 20:31:23 UTC
It seems the issue (or a comparable one) appears also on other people's systems: see two other examples in https://bodhi.fedoraproject.org/updates/FEDORA-2023-514965dd8a

At least one has also AMD Ryzen 7 PRO.

The elaborations much resemble from what I reported here, up to Firefox being a trigger for the kernel errors that cause a crash (the log extracts from the kernel errors are also comparable, including the amd_gpu relation).

Maybe also related:  
https://gitlab.freedesktop.org/drm/amd/-/issues/2447#note_1918408  
https://gitlab.freedesktop.org/drm/amd/-/issues/2548#note_1918409
(thanks to @benthaase to provide the links)

Comment 16 Christopher Klooz 2023-05-28 10:26:34 UTC
The "symptoms" have changed a little again, my perception remains that this happens after dnf updates (not kernel) that change applications, which then possibly trigger the kernel bug differently.

However, the kernel error logs about amd_gpu, malloc, cpu core/cache, etc. are sometimes more detailed since now sometimes some more time can pass until the final freeze occurs:

1) Some detailed kernel errors (amd_gpu; memory allocation; issues on CPU cores+cache; much more) have been logged with the following two issues:

Kernel error when the system tried to lock the screen; once the screen became black (before sddm appears), the issue occurred, I could still use the mouse on the black screen, then I could switch to the terminal, one second or two after switching to the terminal in TTY5 the error occurred; root `journalctl -r --boot=-1` after reboot: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/3rd/fullfreezeBeforeScreenLock_detailedCpuKernelErrorLogs_6.2.15.pstate-passive.log

Immediately the next reboot, I did not start KDE at all (only SDDM was active), and I did not login with the user but only to TTY4 terminal to get the root logs. However, when I shutdown after getting the above logs, the issue happened again a few operations after the shutdown command: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/3rd/fullfreezeAtShutdown_detailedCpuKernelErrorLogs_6.2.15.pstate-passive.log

An earlier freeze, that did not differ to other freezes, contained one line in the logs that did not occur before: `kernel: perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79000` (https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/3rd/fullfreeze_6.2.15_OR_6.3.3.pstate-passive.log) -> beyond that, it was widely equal the freezes that did not produce any logs. Also, I cannot say if that line is related to the freeze (I didn't track the time of that freeze). I also don't remember if that was 6.2.15 or 6.3.3.

2) Some new "kernel bug" entries have been produced at a freeze with 6.3.3 some time ago (https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/3rd/fullfreeze_6.3.3.pstate-passive.log):
```
May 26 22:58:42 fedora.domain kernel: #PF: error_code(0x0000) - not-present page
May 26 22:58:42 fedora.domain kernel: #PF: supervisor read access in kernel mode
May 26 22:58:42 fedora.domain kernel: BUG: unable to handle page fault for address: 0000003eea000043
```

I will now test if 6.3.4 somehow brings improvements about the issue, and report if there are changes in the behavior/issue.

It has to be noted that the processes that are involved can change: Firefox is not always running when the issue occurs, and with regards to the above "shutdown" freeze, even KDE is not necessarily running.

Comment 17 Christopher Klooz 2023-05-28 11:48:39 UTC
The issue persists on 6.3.4: I experienced it now twice on 6.3.4 while working. Usually I have between 0 and 3, seldomly 4, freezes a day. With 6.3.4 I just had 2 freezes in about an hour. Let's see if that frequency was just a concident or if the issue occurrances increased on 6.3.4.

Major log entries from the first freeze (the second did, at least at first glance, not produce related logs before fulle frozen):
```
May 28 13:20:25 fedora.domain kernel: RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
May 28 13:20:25 fedora.domain kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
May 28 13:20:25 fedora.domain kernel: CPU: 8 PID: 5056 Comm: kwin_wayla:cs0 Not tainted 6.3.4-201.fc38.x86_64 #1
May 28 13:20:25 fedora.domain kernel: general protection fault, probably for non-canonical address 0x49f8d7efd771dae6: 0000 [#1] PREEMPT SMP NOPTI
```
-> The very time of the desktop clock of this freeze was 13:20:24.


Full log of 1st system freeze with 6.3.4: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/fullSystemFreeze.DesktopClockTime13-20-24.kernel6.3.4.pstate-passive.log
Full log of 2nd system freeze with 6.3.4: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/fullSystemFreeze.2nd.kernel6.3.4.pstate-passive.log

Comment 18 Christopher Klooz 2023-05-28 13:18:00 UTC
With regards to my previous comment, I had to return to 6.2.15 because 6.3.4 creates too many kernel errors/freezes so that the system is not usable with this kernel.

However, it causes a new phenomenon: Firefox *¹ freezes and crashes, it cannot be re-started, and `pidof firefox` does no longer work then (it just idles without any return until I do CTRL+C). However, the system does not freeze, although the kernel errors are logged in the same way, but it seems to not "spread" from one core/thread to others *².

Again, there are massive amounts of kernel errors logged *². But some more indicative are maybe:

```
May 28 14:38:41 fedora.domain kernel: WARNING: CPU: 4 PID: 5523 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x289/0x2e0 [ttm]
```
...
```
May 28 14:38:41 fedora.domain kernel: WARNING: CPU: 4 PID: 5523 at drivers/gpu/drm/ttm/ttm_bo.c:327 ttm_bo_release+0x296/0x2e0 [ttm]
```
...
```
May 28 14:38:41 fedora.domain kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:193!
```

However, once I tried to shutdown, it became obvious that the running system was already in a corrupted state, and the shutdown culminated in *³ ...
```
May 28 14:51:09 fedora.domain kernel: #PF: error_code(0x0000) - not-present page
May 28 14:51:09 fedora.domain kernel: #PF: supervisor read access in kernel mode
May 28 14:51:09 fedora.domain kernel: BUG: unable to handle page fault for address: 0000003000300010
```

*¹ I had freezes also without Firefox running, so it is not Firefox-specific; see BZ#2193110https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/firefoxCrash-noPidof-kernelerror-6.3.4.pstate-passive.CUT.loghttps://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/firefoxCrash-noPidof-kernelerror-6.3.4.pstate-passive.preHALT.log

Comment 19 Yi Hao 2023-06-01 14:40:40 UTC
Your bodhi vote brought me here. I have the 14" version of the same laptop with pretty much the same problem.
I don't have fix to all your problems but I hope I might offer you some workaround.

1. Do not enable Linux S3 support in BIOS
It is badly broken, no longer supported and has been removed in later version of BIOS.
You must revert the setting before updating your BIOS or it will be stuck enabled with no way to change it.

2. Disable Panel Self Refresh
From your kernel log it seems you have a supported eDP panel and hence PSR auto enabled since kernel v6.2.
Set the kernel parameter: amdgpu.dcdebugmask=0x10

3. Update NVMe firmware
Not sure if your come with Kioxia, Samsung or SKHynix but Lenovo has published NVMe firmware update that fix crashes on Linux under certain workload. At least one firmware update fix Ubuntu installation from bricking the drive.

4. Stick to kernel v6.3
They do have a lot of amdgpu fixes and is currently the most stable after I did all the above.

Extra 1:
In kernel v6.1 and below, your computer might or might not totally freeze up. Half the time it is just the display no longer update. From kernel v6.2 onwards, it seems like the freeze is just display not updating.

If you are not connected to external display, close the lod, wait for the Thinkpad LED to blink. reopen the lid and continue as usual.

If you are connected to external display, press Fn+4 to trigger sleep. You need to be patient and wait up to a minute for it to sleep, and wake up the laptop.

Extra 2:
Starting kernel v6.3 I use amd_pstate=active. With "passive", I get random micro stutter.

Extra 3:
I noted your TSC is broken as well. The only fix is a new BIOS and it is no where to be seen.

Extra 4:
If your display is currently connected to the dock, try connecting it directly to the HDMI port. Loads of amdgpu bugs with dock.

Comment 20 Christopher Klooz 2023-06-02 14:31:42 UTC
Thanks for your incentives! There are indeed several possibilities that I have not yet considered and tried so far. I will test your suggestions when I have some time in the next days. They give some hope ;) 

However, my results with the 6.3.X kernel and amd_pstate=active are not always the same. For me, it is a bit arbitrary how it develops: after some dnf updates, the issue has increased from daily to sometimes more than once per hour (if not every few minutes), and with another dnf update, the issue has disappeared for days. In one update, is was just many minor updates of user space applications (which led me to the assumption that this is just a conincident development that at the best triggers the buggy function more often than others), in one case it was a major qt-related update (KDE), in one case it was mesa (which could indeed be linked). Sometimes all kernel behave equal, sometimes one is more stable than others. But it is more an average that feels often arbitrarily, and so the different behaviors could be just coincidents of course. On average, I "felt" to have had best stability with 6.2.15, but with 6.3.5, this has changed again, and so I am now on 6.3.5. Also, 6.2.X is going end of life, so I will stay with 6.3.X anyway.

At the same time, amd_pstate=passive has proven to be most reliable (by perception). Again, this can be coincidents. But I have freezes both with amd_pstate=passive (schedutil governor; powersave is extremeley slow with passive) and amd_pstate=active (powersave governor), and "passive" feels most stable. With each new kernel I test "active" again, but unlike the above perceived-arbitrary developments, I have constantly more freezes with "active" than with "passive", but the difference is not sufficiently big to exclude coincidents. However, I had micro stutters with the default driver, which is why I originally switched to amd_pstate several months ago, but nothing like that occurs with both amd_pstate on my system.

You are right that when my external external screen is attached (I have no dock), the issue occurs much more often (indeed, the difference in this respect is too big to be just a conincident). However, I also have freezes without external screen (I work around 10% of my time without external screen, then on battery). However, I will try your suggestions about the screen since it would be already a great development to just decrease the frequency of freezes.

Concerning NVMe, I have SK hynix Platinum P41 NVMe Solid State Drive 2TB. Let's see if the firmware update makes a difference.

Looking forward to also test if triggering sleep at freezes helps, this would indeed be a valuable mitigation to not always do hard reset. But I can already confirm that the sound usually keeps working when I watch a movie while a full system freeze occurs. This obviously supports your assumption. However, going sleep (especially with external screen attached) has often triggered freezes itself (could be linked to S3 I guess). But until today I was at any type of freezing just trying to get to CTRL+ALT+F*. Let's see if sleep can be exploited in some types of freezes.

Disabling S3 sounds reasonable in general, but at first glance, I cannot find it in my bios (I have not yet done a bios update since I got the Lenovo in January). Can you give me a hint in which sub-menu I should find that? I don't find documentation about that from Lenovo.

Nevertheless, I am not convinced that you have the same issue, because in my case, the issue seems to corrupt CPU core/thread, and then (casually speaking) it spreads from one to the next core/thread (this is how I read the logs in related occurrences; while the actual behavior that results is that in the GUI, this often develops from one application to the next, e.g., first Firefox freezes, then VLC, then the whole GUI/system; for once, everything was frozen including VLC except the movie itself [within VLC], which I could still watch for several minutes before I did hard reset). Once a core is corrupted, only a reset works to repair the corrupted state. Even if I can get to the terminal, at the latest when I try to `systemctl reboot (-i)` or `shutdown -r now`, the system finally crashes when trying to shutdown with many errors given its corrupted state (once any symptom appeared, it will be no longer possible to shutdown "normally", even if there is no total freeze). Trying to keep working only "accelerates the spread" of the issue. So even when the system "survives" the freeze of an application / corruption of a CPU core/thread, it is corrupted with related symptoms, one example/elaboration with logs can be found in the below:

------------

Since some of the logs revealed a potential origin, I opened https://lore.kernel.org/dri-devel/69d51cd5-732f-9dc5-4e12-d68990132c85@my.mail.de/T/#u with mails to the two maintainer of drivers/gpu/drm/ttm/ (-> kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c).

Comment 21 Christopher Klooz 2023-06-02 15:29:36 UTC
I have just had another "run" of freezes/crashes/corruptions every few minutes: this time with 6.3.5 amd_pstate=passive. On 31st, I had it with 6.2.15. With 6.3.5, I have had such a "run" only with amd_pstate=active so far. The permanent freezes occur in the recent days widely equally often as the "core to core" corruptions (the following is just one possibility how the "core to core"-like spreading corruption can take place, sometimes it is without beep (some times with multiple beeps when I keep ignoring it, sometimes it then freezes soon after the first, and sometimes some commands/applications still work). The one I just had a few minutes ago is one possibility for the "core to core"-like spreading corruption: my system beeps (error beep of the notebook), I then know something has happened and before risking it to fully freeze I directly go to an open TTY. I cannot log in with new TTYs, but I can go to one where I am already logged in. But except ENTER (to get a new line), nothing works, except that many errors are output in the TTY. When trying to shutdown, the system finally freezes a moment later (in the below gitlab folder of 6.3.5, it is the file [1] - sorry for the names but I want to keep some documentation with links between behavior and logs ;)

Today, I have only saved the last one (the file I mentioned above; I had 6.3.5 and 6.2.15 today but both are currently equal), but from the 31st, I saved several logs from the "run" of "quickly successive occurrences" (these are all 6.2.15 logs in the below 6.2.15 gitlab folder except [2] and [3]). It might be noted that the "quickly successive issues" on 31st have been without external screen.

Some current logs of 6.2.15 and 6.3.5 (including those mentioned above) can be found here:
6.2.15) https://gitlab.com/py0xc31/public-tmp-storage/-/tree/main/New6215-635/6.2.15 (6.2.15 -> the four "quickly successive occurrences" plus [2] and [3])
6.3.5) https://gitlab.com/py0xc31/public-tmp-storage/-/tree/main/New6215-635/6.3.5 (6.3.5 -> the one from now noted above, plus another one with full freeze and with differing logs, which is from 30th)

I will try to get the eDP suggestion and the NVMe firmware update implemented and then see if it makes a difference. I think the logs already indicate that these are not the origin, but maybe they achieve some decrease in occurrences.

The following are some of those contained in the respective gitlab folders noted above:
[1] beepingThenIWentTTYButCommandsNoLongerWorkedIncludingShutdownAndNewLoginsAtTTYExceptThoseAlreadyLoggedInAndExceptEnterInExistingTTY_kernel6.3.5_pstate-passive.log
[2] FullFreeze2stagedAfterLongRun-firefoxThenSys.6.2.15.pstate-passive.log
[3] FullFreeze+DesktopDiffusionWhenTurningExternalHdmiScreenOnWhileMovieOnInternalScreenForSomeMinutes.6.2.15.pstate-passive.log

Comment 22 Yi Hao 2023-06-02 19:50:10 UTC
I tried to read all your logs and seriously there are a lot and I just scroll through most of them.

I will answer the obvious questions first:
1. S3 sleep
From your log:
ACPI: PM: (supports S0 S4 S5)

It seems that S3 is not available. You can manually check by doing:
cat /sys/power/mem_sleep

If it only shows [s2idle] it means off. Your laptop is newer than mine so I believe they no longer have the setting in BIOS. I cannot remember where were they as I am now on BIOS version 1.35.

2. NVMe firmware
From your log again:
FW:51720A10

It is the same as mine. As far as I am aware this is the latest right now, at least from Lenovo.

The not so obvious are your stack trace and crash log:
1. kwin_wayland
I roughly look at the codes.
The stack trace goes to ColorDevice::update(), which calls DrmOutput::setGammaRamp() and that's it? I look at that function and it is as simple as it gets. I also notice your coredumpctl says the memory dump is missing. Why? The log didn't explicitly specify the reason for core dump.

2. firefox stack trace
They looks like normal operation to me, but I am not a firefox developer.

3. Kernel stack trace
Most of them are amdgpu, but there are also stack trace with xfs and netlink. However, your kernel is untainted.

Question that remains unanswered:
1. One process corrupting another process
As far as I understand, given the constraint of modern OS like F38, it is not possible for one process to corrupt another process. What seems more plausible is your compositor crash and those process die because they disconnected from the compositor.

2. System beep
Is it the same beep you get when you try to enter BIOS? I never face this problem before. Can you take a picture of the TTY output when it happens?

I notice in your logs there is a high chance the instruction pointer (RIP) will be alloc_something, regardless of which process or kernel component.

In your v6.3.5 log, I also see exc_invalid_op() and page_fault_oops().

You might hate me for saying this:
1. memtest
Have you tried booting Lenovo Diagnostic and run a full memory test?

2. CPU test, etc
While you are in the diagnostic menu, do a CPU test too, or maybe just run all the tests if you have time?

3. Can you try scrubbing your boot drive for error?
btrfs scrub start -Bfr /

4. Disable PSR
I see in your log PSR is still enabled. Can you try disabling it please just to set a baseline?

The flag is documented in https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
Definition as follow:
enum DC_DEBUG_MASK {
	DC_DISABLE_PIPE_SPLIT = 0x1,
	DC_DISABLE_STUTTER = 0x2,
	DC_DISABLE_DSC = 0x4,
	DC_DISABLE_CLOCK_GATING = 0x8,
	DC_DISABLE_PSR = 0x10,
	DC_FORCE_SUBVP_MCLK_SWITCH = 0x20,
	DC_DISABLE_MPO = 0x40,
};

My untested hypothesis:
Assuming all your hardware test passed, I can see from your logs that kwin_wayland and firefox crash happens when amdgpu crashes.

With a dead compositor, I assume all your app is going to die too.

I am using GNOME and I get those crash too but GNOME and firefox will recover. However, gnome-shell or Xwayland never crash in my case so perhaps that's why I got better user experience than you.

I don't know about KDE but perhaps they don't recover gracefully when the GPU reset? It is just a guess. From the log, it seems they do reset successfully.

Emphasis:
The thing is your compositor crash and mine didn't. That makes a huge difference in term of user experience.

Extra:
Did you use a script to scrub your logs? I see IPv4 and some MAC address got cleaned up. But not IPv6, SSID and MAC address for your AP.

Comment 23 Christopher Klooz 2023-06-02 21:32:54 UTC
Sorry, I didn't want to urge you to skim through all the logs. I just keep providing them so that whoever takes care of the bug can work with logs of the most recent kernel(s) and has the possibility to see the different manifests of the bug. Even with the same kernel, the logs differ (just like the bahavior).

About the information you asked for:

Concerning S3, I know about it, but I understood you the way that there is a bug in the bios related to that, which might be triggered. This is why I asked how to deactivate it explicitly there since I couldn't find anything in the BIOS. However, I checked all options. There is nothing about S3, so I assume it is already removed from my BIOS by default (1.30 atm; upgrade to 1.35 not yet done but I plan to do it that weekend). mem_sleep in software is indeed s2idle.

Concerning Lenovo diagnostics: I did the CPU Test and Memory Quick Test (if I have time in the next days, I can do the full scale test). All was PASSED/DONE without errors reported.

The same for `btrfs scrub start -Bfr /`: no errors.

The system beep: yes, it is the loud short beep you also get when you want to enter BIOS. It is not always. Sometimes, it happens multiple times. The beep, by time, correlates to the kernel errors. I am just wondering that it happens only sometimes.

The output log within the TTY (that was output shortly after I switched to the terminal after the beep; the beep occurred when I was in KDE): I haven't made a screenshot and so I cannot compare 1:1, but I am quite sure it was the content of the log [1], more precisely parts of what happened 16:34:48 and/or 16:35:36, plus two of the "kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting" which are also mostly logged. However, I am not 100% sure, but I think there were another two "kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting" entries on the screen AFTER the kernel error logs, which seem to be not logged. But I think this is not related/helpful. The open TTY was not root but another user account.

I have disabled PSR by adding the option (amdgpu.dcdebugmask=0x10) to the grub entry (I will make it a default option after testing a few days; before 6.3.6 of course), so it was disabled not before the next boot from the time I wrote the last posts. So far, now with having PSR disabled, I have had no issues. But an hour doesn't mean much. I will keep and watch it that way.

Thanks for your detailed elaboration, which can indeed explain a lot. However, in the end, we are back that there is a bug, likely with relation to "amdgpu" and likely in drivers/gpu/drm/ttm/ttm_bo.c, that affects the whole system.

In the kernel thread [2], I collected some of the reports we have so far. Finally, it is not just me.

However, beyond those I already posted in the kernel thread, I just found two further bug reports that could be the same bug:

https://bugzilla.redhat.com/show_bug.cgi?id=2012882 -> it was closed with the assumption that the issue disappeared on 6.0.5. It also relates at some points to drivers/gpu/drm/ttm/ttm_bo.c and looks comparable to some of my earlier logs.

https://bugzilla.redhat.com/show_bug.cgi?id=1985880 -> closed due to end of F34 release cycle. Also relating to drivers/gpu/drm/ttm/ttm_bo.c.

(I know that the scrubbing in my logs was incomplete, but thanks for making aware. Finally, it is no critical information.)

[1] beepingThenIWentTTYButCommandsNoLongerWorkedIncludingShutdownAndNewLoginsAtTTYExceptThoseAlreadyLoggedInAndExceptEnterInExistingTTY_kernel6.3.5_pstate-passive.log
[2] https://lore.kernel.org/dri-devel/69d51cd5-732f-9dc5-4e12-d68990132c85@my.mail.de/T/#u

Comment 24 Christopher Klooz 2023-06-03 12:02:07 UTC
Disabling PSR (amdgpu.dcdebugmask=0x10) does not mitigate the issue. It worked for some hours but then I just had two freezes within a few minutes, first one where it was again freezing app by app until finally the whole system froze, and the second one a few minutes later was an immediate full freeze:  
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/appByAppFreezingUntilFullSystemFreeze.kernel.6.3.5.amd_pstate-passive.PSRdisabled.log  
https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/immediateFullFreeze.kernel.6.3.5.amd_pstate-passive.PSRdisabled.log  

Both freezes were with amd_pstate=passive and amdgpu.dcdebugmask=0x10. I will keep PSR disabled but have just switched to amd_pstate=active to see if that works better (so now my system is with amd_pstate=active amdgpu.dcdebugmask=0x10).

Sad that it didn't mitigate, since PSR could have explained also the freezes without external monitor (PSR support of the sink is also documented in logs from boots where no external monitor was attached).

I have not yet done a BIOS update to check possibilities one by one to see what makes a difference and what not.

Comment 25 Yi Hao 2023-06-03 15:39:53 UTC
I am able to quickly reproduce the crash until my compositor no longer recover and also the system beep using the test URL from @

Comment 26 Christopher Klooz 2023-06-03 16:08:39 UTC
amd_pstate=active amdgpu.dcdebugmask=0x10 does not make it better. I have had 15-20 freezes today. So the frequency remains the same, and the arbitrary occurrence in terms of sometimes I have some hours I can work, and then I have freezes every few minutes.

Also, even when I do not log into KDE but only work in the terminal in TTY, the occurrence+frequency remains roughly the same (having worked only in mc, mcedit; also, sddm was enabled in parallel). However, the logs during copying with `mc` differ [1]. I don't know if that is indicative, I think it only documents different errors due to different apps with the same origin. But I added it [1] in case I am wrong.

I switched back to 6.2.15 also with trying PSR enabled and disabled, but no change. I guess in some hours the frequency will decrease again for unknown reasons. So PSR disabling/enabling seems to have no impact.

@ Yi Hao: You were asking for screenshots when the errors are flooded into the terminal. Generally, sometimes even in the tty terminal it only freezes (without error floodings or so). But some of the freezes I had today contained this "error flooding" occurrence: see screenshots [2] [3] [4] and their related log [5]. In this case, the errors were flooded into the terminal where I was running `mc` [2]. It did not freeze immediately but I could still switch to another TTY [3][4], which was also flooded with the errors. As usual, no possibility to login anywhere once an occurrence corrupted the system.

Another time I was only in TTY with mc (also not logged in KDE), the logs contained a message that I didn't see (or at least notice) before: `kernel: Fixing recursive fault but reboot is needed!` [6]. However, the system [6] was broken.

An earlier freeze today was again with this "from app to app" feeling, but ironically, it did not cause any kernel error logs [7]. But might be interesting to note that even if there is no immediate freeze, in some seldom cases, the kernel doesn't log kernel errors [7]. This one [7] was with amd_pstate=active. I switched back to pstate=passive after 3 corrupted pstate=active sessions in a short amount of time. However, given the amount of freezes of today, I guess pstate=active was only a coincidence.

I only added some of the 6.3.5 logs:

[1] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/SESkernelErrorDuringCopyingInTtyWithoutKdeLoggedIn.kernel.6.3.5.amd_pstate-passive.PSRdisabled.log
[2] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/mc-tty.jpg
[3] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/blank-tty-upperPartScreen.jpg
[4] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/blank-tty-lowerPartScreen.jpg
[5] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/errorOutputInMc-TtyChangeWithOutputLog-noLogin-ThenFrozen.kernel.6.3.5.pstate-passive.PSRdisabled.log
[6] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/NotLoggedInKdeOnlySddmEnabled.WorkingInMcInTty.kernel.6.3.5.amd_pstate-passive.PSRdisabled.log
[7] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/someAppsFrozen.SomeAppsCrippled.ShutdownInTtyBroken.kernel.6.3.5.amd_pstate-active.PSRdisabled.log

Comment 27 Yi Hao 2023-06-03 16:25:32 UTC
I am able to quickly reproduce the crash until my compositor no longer recover and also the system beep using the test URL from Comment 10.

With a fresh boot, the first amdgpu crash is:
ring gfx_0.0.0 timeout.

I then didn't do anything and just wait. A short moment later it crash by itself again with:
ring sdma0 timeout.

I then keep repeating the test until the screen go blank and no longer recover. I switched to TTY and all I see from dmesg is:
ring gfx_0.0.0 timeout, but soft recovered

The whole sequence of "amdgpu: GPU reset begin!" no longer happens.
I am using GNOME so I switched to F1 and gdm is still alive. When I try to login, all I get is a TTY with a single "@" character (F2).

The system beep happen if I accidentally press some key like PgUp and PgDn. If I press too fast then I get multiple beeps.

There are some clarification I would like to seek:
1. Full freeze
Did the CapsLock LED still function? If yes, then the kernel is still alive. You can try to ping it with other device if you didn't block ping packet. SSH should work too.

2. Cannot switch to TTY
I have tried to reproduce this and found that FnLock must be enabled. The LED on the Esc key must stay lit. Press Fn+Esc to enable it.

You must also release all keys on the keyboard before hitting Ctrl+Alt+F4 or something. If you hold on to Ctrl+Alt, the TTY doesn't switch.

For the WebGL Stress Test:
Even if it says the test is successful, scroll down and I can see the rendering is corrupted. The level of corruption is different for each run but most of the time it is filled with noise and vertical lines.

I quickly try the same test on my Android phone with Firefox and it won't complete at all.
On Chrome, it ran and says WebGL hit a snag. However, they didn't crash.

To be fair on my laptop, only amdgpu crash. coredumpctl is empty. dmesg didn't suggest any other userland process crash. I have the full dmesg up to the point of me rebooting in TTY. Should I upload it?

Extra 1:
I check your logs in Comment 24 and they didn't contain any crash or whatever. I suspect those didn't get flushed to disk.

Extra 2:
I noted you have LUKS but don't have "rd.luks.options=discard" in kargs. TRIM won't work in your case.

Extra 3:
I noted selinux denial for some of the stuff. It is just an untested hypothesis but is it possible kwin_wayland GPU recovery got blocked by selinux? Do you use selinux with MLS?

Things to try:
1. Did you try closing the lid or Fn+4? Assuming CapsLock still work, this should work but it sometimes take up to a minute.

2. Can you try the WebGL Stress Test and see if it kill your compositor for good? This would test my previous hypothesis that amdgpu crash kills kwin_wayland.

3. Can you check your SMC version:
cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
Mine is:
SMC feature version: 0, program: 4, firmware version: 0x04453700 (69.55.0)

On your Comment 26:
Those are no longer amdgpu crash and I am not sure how they happen. Seen them in some of your old logs but I have no idea how to reproduce them. I haven't have them myself.

Comment 28 Christopher Klooz 2023-06-03 18:36:42 UTC
Concerning Extra 1:

Yes, I saw that. Sometimes, it is not written to disk before the system gave up. But some of the logs of my most recent comment (2023-06-03 16:08:39 UTC) contain kernel error logs with PSR disabled.

Concerning Extra 2:

TRIM is disabled by intention. However, now that you mention the storage: it is just indirectly related, but I just did the full Quick Storage Test in Lenovo Diagnostics: All PASSED. I also did the Extended CPU Test in the Lenovo diagnostics today, also PASSED.

Concerning Extra 3:

The SELinux denials are because Fedora is currently not aligned well with SELinux, and once "confined users" are used, this leads to the denials. We are working on that somewhere else.

However, I have already tested to disable the confined profile of my user account ("sysadm_u") by setting my user account back to "unconfined_u". This does not impact the occurrences.

Also, you can see in my recent comment (2023-06-03 16:08:39 UTC) some log files that did not involve the confined user account (the logs of boots where I was not logged in KDE). E.g., https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/NotLoggedInKdeOnlySddmEnabled.WorkingInMcInTty.kernel.6.3.5.amd_pstate-passive.PSRdisabled.log

So I think we can exclude this as origin.

Concerning "no longer amdgpu crash":

I already noticed and also wrote in the kernel thread that sometimes, logs contain explicit reference to amdgpu and sometimes not (and sometimes many refernces, sometimes just one or a few), but sometimes no errors are logged at all despite the error occuring. Given that the occurrences behave correspondingly, and much of the error types they produce are shared (including the "symptoms"), I guess they are linked. Also, concerning the time/frequency of occurrences, both happen very often, or both are absent for hours (and both are seldom on battery without attached devices, but a regular phenomenon when attached). I don't think this is a coincident? Of course I can only guess.

Also, in my recent logs of comment #26, there are references to amdgpu as well. E.g., [6] of my #26 comment:
```
Jun 03 17:01:08 fedora.fritz.box kernel:  amdgpu_gem_object_free+0x34/0x60 [amdgpu]
Jun 03 17:01:08 fedora.fritz.box kernel:  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
```

The log that identified the "kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:193!" also referenced amdgpu only once in its logs: "kernel:  ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu]"
(part 1 https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/firefoxCrash-noPidof-kernelerror-6.3.4.pstate-passive.CUT.log ; part 2 https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/retry6.3.4/firefoxCrash-noPidof-kernelerror-6.3.4.pstate-passive.preHALT.log)

--------

Since the only permanent correlation of the frequency is if I am on battery or not, I will test and play with that (occurrences are seldom on battery without attached devices; except the "run" with high frequency of occurrences on Wednesday, I do not remember when the last occurrence was on battery).

When on battery (which means usually travelling), I have no screen attached, and two USB hubs primarily with multiple storage devices are not attached as well. I will focus on playing with the monitor settings first.

--------

I haven't really counted but I think today I have reached a record of above 30 occurrences.

I do not review most of them, but since I just prepared my post here, I reviewed the last freeze/log *¹ to see how they currently look like: it again contains many amdgpu references (btw, this one is also with `amdgpu.dcdebugmask=0x10`):

```
Jun 03 19:33:29 fedora-2.fritz.box kernel: #PF: error_code(0x0002) - not-present page
Jun 03 19:33:29 fedora-2.fritz.box kernel: #PF: supervisor write access in kernel mode
Jun 03 19:33:29 fedora-2.fritz.box kernel: BUG: unable to handle page fault for address: 0000007de200001b
Jun 03 19:33:28 fedora-2.fritz.box kernel: PKRU: 55555554
Jun 03 19:33:28 fedora-2.fritz.box kernel: CR2: 00007f2daf014000 CR3: 000000015e5be000 CR4: 0000000000750ef0
Jun 03 19:33:28 fedora-2.fritz.box kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 03 19:33:28 fedora-2.fritz.box kernel: FS:  00007feaf63506c0(0000) GS:ffff94845ee00000(0000) knlGS:0000000000000000
Jun 03 19:33:28 fedora-2.fritz.box kernel: R13: 00000000ffffffff R14: 00000000000003b8 R15: ffff947d40042b00
Jun 03 19:33:28 fedora-2.fritz.box kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Jun 03 19:33:28 fedora-2.fritz.box kernel: RBP: 0000000000000cc0 R08: 0000000000039160 R09: ffff947d55194440
Jun 03 19:33:28 fedora-2.fritz.box kernel: RDX: 00000000053f0000 RSI: 0000000000000cc0 RDI: 748ae2c3be2938b7
Jun 03 19:33:28 fedora-2.fritz.box kernel: RAX: 748ae2c3be293ab7 RBX: 0000000000000cc0 RCX: 00000000000003b8
Jun 03 19:33:28 fedora-2.fritz.box kernel: RSP: 0018:ffffa6ad884979c0 EFLAGS: 00010206
Jun 03 19:33:28 fedora-2.fritz.box kernel: Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0
Jun 03 19:33:28 fedora-2.fritz.box kernel: RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
Jun 03 19:33:28 fedora-2.fritz.box kernel: ---[ end trace 0000000000000000 ]---
Jun 03 19:33:28 fedora-2.fritz.box kernel:  snd_rn_pci_acp3x btbcm videobuf2_common snd_seq btintel snd_acp_config libarc4 btmtk irqbypass videodev snd_seq_device think_lmi ses snd_soc_acpi rapl mc pcspkr enclosure k10temp i2c_piix4 firmware_attributes_class wmi_bmof snd_pcm snd_pci_acp3x cfg80211 bluetooth thinkpad_acpi scsi_transport_sas snd_timer ledtrig_audio platform_profile snd rfkill mhi soundcore acpi_tad joydev amd_pmc xfs loop zram dm_crypt nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 adiantum amdgpu i2c_algo_bit drm_ttm_helper ttm iommu_v2 drm_buddy gpu_sched drm_display_helper ccp cec nvme crct10dif_pclmul crc32_pclmul nvme_core crc32c_intel polyval_clmulni polyval_generic video hid_multitouch ucsi_acpi ghash_clmulni_intel sha512_ssse3 typec_ucsi sp5100_tco r8169 typec nvme_common i2c_hid_acpi wmi i2c_hid serio_raw uas usb_storage ip6_tables ip_tables fuse
Jun 03 19:33:28 fedora-2.fritz.box kernel: Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer pktcdvd nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc bnep qrtr_mhi binfmt_misc vfat fat snd_acp6x_pdm_dma snd_soc_dmic snd_soc_acp6x_mach snd_sof_amd_rembrandt snd_sof_amd_renoir snd_ctl_led snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof_xtensa_dsp snd_sof snd_hda_codec_generic qrtr snd_hda_codec_hdmi snd_sof_utils ath11k_pci snd_soc_core snd_hda_intel ath11k snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr intel_rapl_common snd_usb_audio snd_compress snd_hda_codec edac_mce_amd ac97_bus uvcvideo snd_pcm_dmaengine qmi_helpers snd_pci_ps kvm_amd uvc snd_hda_core snd_usbmidi_lib videobuf2_vmalloc btusb mac80211 snd_rpl_pci_acp6x videobuf2_memops snd_rawmidi snd_hwdep btrtl kvm snd_pci_acp6x videobuf2_v4l2 snd_pci_acp5x
Jun 03 19:33:28 fedora-2.fritz.box kernel:  </TASK>
Jun 03 19:33:28 fedora-2.fritz.box kernel: R13: 00000000c0186444 R14: 000000000000002b R15: 00007feaf634fc78
Jun 03 19:33:28 fedora-2.fritz.box kernel: R10: 00007feaf2173600 R11: 0000000000000246 R12: 00007feaf634fb80
Jun 03 19:33:28 fedora-2.fritz.box kernel: RBP: 00007feaf634fb10 R08: 00007feaf634fcd0 R09: 00007feaf634fb60
Jun 03 19:33:28 fedora-2.fritz.box kernel: RDX: 00007feaf634fb80 RSI: 00000000c0186444 RDI: 000000000000002b
Jun 03 19:33:28 fedora-2.fritz.box kernel: RAX: ffffffffffffffda RBX: 00007feaf634fc78 RCX: 00007feb17128edd
Jun 03 19:33:28 fedora-2.fritz.box kernel: RSP: 002b:00007feaf634fac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 03 19:33:28 fedora-2.fritz.box kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
Jun 03 19:33:28 fedora-2.fritz.box kernel: RIP: 0033:0x7feb17128edd
Jun 03 19:33:28 fedora-2.fritz.box kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? do_syscall_64+0x6c/0x90
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? do_syscall_64+0x6c/0x90
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Jun 03 19:33:28 fedora-2.fritz.box kernel:  do_syscall_64+0x60/0x90
Jun 03 19:33:28 fedora-2.fritz.box kernel:  __x64_sys_ioctl+0x94/0xd0
Jun 03 19:33:28 fedora-2.fritz.box kernel:  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  drm_ioctl+0x26d/0x4b0
Jun 03 19:33:28 fedora-2.fritz.box kernel:  drm_ioctl_kernel+0xcd/0x170
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  amdgpu_cs_ioctl+0x4c1/0x20f0 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? __check_object_size+0x233/0x2b0
Jun 03 19:33:28 fedora-2.fritz.box kernel:  amdgpu_bo_list_create+0x65/0x3d0 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  __kmalloc_node+0x50/0x150
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? amdgpu_bo_list_create+0x65/0x3d0 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? amdgpu_bo_list_create+0x65/0x3d0 [amdgpu]
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? __kmem_cache_alloc_node+0x1ba/0x320
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? asm_exc_general_protection+0x26/0x30
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? exc_general_protection+0x1be/0x420
Jun 03 19:33:28 fedora-2.fritz.box kernel:  ? die_addr+0x36/0x90
Jun 03 19:33:28 fedora-2.fritz.box kernel:  <TASK>
Jun 03 19:33:28 fedora-2.fritz.box kernel: Call Trace:
Jun 03 19:33:28 fedora-2.fritz.box kernel: PKRU: 55555554
Jun 03 19:33:28 fedora-2.fritz.box kernel: CR2: 00007f2daf014000 CR3: 000000015e5be000 CR4: 0000000000750ef0
Jun 03 19:33:28 fedora-2.fritz.box kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 03 19:33:28 fedora-2.fritz.box kernel: FS:  00007feaf63506c0(0000) GS:ffff94845ee00000(0000) knlGS:0000000000000000
Jun 03 19:33:28 fedora-2.fritz.box kernel: R13: 00000000ffffffff R14: 00000000000003b8 R15: ffff947d40042b00
Jun 03 19:33:28 fedora-2.fritz.box kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Jun 03 19:33:28 fedora-2.fritz.box kernel: RBP: 0000000000000cc0 R08: 0000000000039160 R09: ffff947d55194440
Jun 03 19:33:28 fedora-2.fritz.box kernel: RDX: 00000000053f0000 RSI: 0000000000000cc0 RDI: 748ae2c3be2938b7
Jun 03 19:33:28 fedora-2.fritz.box kernel: RAX: 748ae2c3be293ab7 RBX: 0000000000000cc0 RCX: 00000000000003b8
Jun 03 19:33:28 fedora-2.fritz.box kernel: RSP: 0018:ffffa6ad884979c0 EFLAGS: 00010206
Jun 03 19:33:28 fedora-2.fritz.box kernel: Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0
Jun 03 19:33:28 fedora-2.fritz.box kernel: RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
Jun 03 19:33:28 fedora-2.fritz.box kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
Jun 03 19:33:28 fedora-2.fritz.box kernel: CPU: 0 PID: 4689 Comm: firefox:cs0 Not tainted 6.3.5-200.fc38.x86_64 #1
Jun 03 19:33:28 fedora-2.fritz.box kernel: general protection fault, probably for non-canonical address 0x748ae2c3be293ab7: 0000 [#1] PREEMPT SMP NOPTI
```

After that log, I did the Lenovo diagnostics noted above.

*¹ https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/Post28_2023_06_03

Comment 29 Yi Hao 2023-06-04 00:00:18 UTC
Hi, I might have bias towards amdgpu problem when I first came here.
Taking account into your input, I revisit the logs again focusing on your TTY output dump: general protection fault.

Here are the list of processes that has caused GP in the kernel, both v6.2 and v6.3:
polkitd
(sd-worker)
check-konsole
rm
rsync
Renderer
kworker
firefox
podman fedora-toolbox-37
kauditd
livesys
plymouthd

What are the odds of cron calling your python script check-konsole causing GP randomly?

How about:
rsnapshot[9944]: /bin/rsnapshot -c /etc/rsnapshot.blz.conf hourly: ERROR: Error! rm_rf("/blz/tmpbu/hourly.23/")
causing a GP just deleting file?

polkitd or even plymouth?

I initially suspect that one core on your CPU might be broken, but it seems they happened on CPU 0, 2, 3, 8, 9, 15.

If Lenovo Diagnostic didn't give any error, it might be because it is running without power management.
That brings back my previous question of what is your SMU version?

SMU do power management and some other stuff. Also there is a newer Embedded Controller firmware that has updated thermal control. You get both by updating your BIOS.

The other question is do you have this problem when your laptop is new or it is only recently?

I am so sorry to be here offering you a supposedly workaround with my bias towards amdgpu. It must have been frustrating that nothing works out.

Comment 30 Christopher Klooz 2023-06-04 12:55:26 UTC
No need for sorry! Thank you very much for taking care! My knowledge in the hardware <-> kernel fields is obviously much below yours. I am very happy that you keep helping me with the issue. And we know already more than in the beginning. I still hope it is not a hardware thing.

But before providing the information you asked for (which I did below): it seems that I found a way to strongly mitigate the issue (but it seems to have been temporary). After my last post the first thing I tried was to change the screen refresh rate of the attached monitor from 120 Hz to 60 Hz. While I had over 30 freezes yesterday, roughly every 5 - 30 minutes (I think mostly only 5-20 minutes), this has changed immediately once I changed the refresh rate: after that, I had no issues for more than 6 hours.

Then, having worked over 6 hours without issues, it froze once again:
```
Jun 04 02:28:56 fedora kernel: Workqueue: btrfs-worker btrfs_work_helper
Jun 04 02:28:56 fedora kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022
Jun 04 02:28:56 fedora kernel: CPU: 2 PID: 21584 Comm: kworker/u32:2 Not tainted 6.3.5-200.fc38.x86_64 #1
Jun 04 02:28:56 fedora kernel: general protection fault, probably for non-canonical address 0x701ad840e2042a63: 0000 [#1] PREEMPT SMP NOPTI
Jun 04 02:26:27 fedora wpa_supplicant[2667]: wlp2s0: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=COUNTRY alpha2=DE
Jun 04 02:26:24 fedora wpa_supplicant[2667]: wlp2s0: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=COUNTRY alpha2=US
Jun 04 02:26:21 fedora kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
Jun 04 02:26:21 fedora kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
Jun 04 02:26:17 fedora NetworkManager[2457]: <info>  [1685838377.9790] device (wlp2s0): supplicant interface state: interface_disabled -> inactive
Jun 04 02:26:17 fedora NetworkManager[2457]: <info>  [1685838377.9779] device (wlp2s0): supplicant interface state: inactive -> interface_disabled
Jun 04 02:26:17 fedora NetworkManager[2457]: <info>  [1685838377.8747] device (wlp2s0): set-hw-addr: set MAC address to 8A:49:5E:09:72:69 (scanning)
```

However, this is already a big difference just after changing the refresh rate of the attached screen.

I changed the refresh rate within KDE, which makes me assume that it did not affect TTY terminals (this is supported by the fact that when on 120 Hz in KDE, it takes long for the screen to adjust and show the TTY terminal when doing CTRL+ALT+F* from KDE and vice versa). But even in the cases where I was not logged in KDE, I assume that SDDM (which was always enabled+running; even if not on screen at the moment of occurrences) also took the 120 Hz of KDE (this assumption is supported by the fact that there was never a switchover time between SDDM and KDE; at the same time, SDDM had the same switchover time to TTY terminals like KDE when 120 Hz was set).

It seems to not completely get rid of the problem, but strongly mitigates it (at least in this constellation of pstate etc.), which also explains why I tend to have had seldomly issues when on battery without external screen (the internal is always on 60 Hz).

With 60 Hz, I was yesterday with pstate passive and PSR disabled. Now I test 60 Hz with pstate active. It worked much longer than the massive occurrence series of yesterday, but shorter than the initial pstate=passive+60Hz try, then the issue reoccurred after about an hour - when I was working in terminal for the first time since I started the 60 Hz test (this is maybe a trigger, too?) [1]. However, despite the beeping and a few journal entries (only single entries that are not related to the issue) in the root terminal, I could still work and switch between KDE and TTY [1]. I could even try to login to new tty's [1]. However, once done the latter, the massive error output came again and then the system was frozen [1]. However, soon after at the next boot, also when I had already worked in a root terminal earlier at that boot, a freeze reoccurred. Now I am back to pstate=passive and I do not log into a terminal with root and see if it gets again as stable as yesterday evening. If this does not work, I will disable all my scripts and backup jobs. *¹ *²

Concerning the log extract above: I noticed that the log extract above indeed seems to be not related to amdgpu. But given the frequency and development of occurrences, all issues seem to have the same origin, while graphical stuff more often triggers the occurrence in some circumstances.

Given the btrfs entries of above, I did "btrfs scrub start -Bfr" at all three btrfs (/, /home, /var). In all three cases: "Error summary:    no errors found"

Maybe relevant: I am not 100% sure but I think the first error that was flooded in the terminal where I was working was "ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting", but the last time this error was logged was before all of the kernel errors -> this happened soon after the rsnapshot jobs [1]!

*¹ SUPPLEMENT/UPDATE TO THE ABOVE: I am back to the freezes every few minutes, even with 60 Hz and pstate=passive, now I disabled all cronjobs and let's see.
*² SUPPLEMENT/UPDATE No 2: Disabling cronjobs doesn't make a difference. I keep playing with other stuff, detaching, etc.. I still wonder that 60 Hz gave me over 6 hours without issues.

-------------------

Concerning check-konsole:

This python script (which only imports and uses the "getoutput" function from "subprocess", complemented by "if" clauses; it calls commands and gets its inputs through bash; quite primitive and ugly) checks regularly if a symlink has been deleted, changed or replaced. If so, it creates a warning file in another folder. Then it ensures the warning file to be owned by root (own+grp) and removes any writing permissions except owner.

It might be noted that the folder in which the warning file is created is a tmpfs within the normal user's home directory, which is mounted with "tmpfs /home/<username>/<dir> tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,size=3145728k,inode64 0 0".

-------------------

Concerning /bin/rsnapshot -c /etc/rsnapshot.blz.conf:

This is a widely default-configured rsnapshot. It might be noted that /blz is a USB storage device attached to the notebook.

If the issue can be triggered somehow by USB, this can also explain the freezes I had on 31st May despite being on battery, because back then, I had my phone attached by USB. Mostly when on battery, I have nothing attached by USB.

Concerning rsync, this is used by rsnapshot but also by some of my own scripts to backup some data independently from rsnapshot, both copy to USB-attached storage devices.

The other services are default from Fedora KDE Spin I think.

-------------------

Concerning BIOS, CPU, etc.:

I tried to update with "fwupdmgr install r23ul65w.cab" in root terminal but it output that no supported hardware was found. I received a page from Lenovo when I bought the device that forwards me to the very product page of my notebook with firmware and so on. So I do not think that I have the wrong one. Also, the related README contains ThinkPad "T16 Gen 1  (Machine Types:21CH,21CJ)". 21CH should be mine.

I will try the ISO but I need to buy a DVD to burn it to, so I cannot do this before Tuesday (hopefully). Unfortunately the ISO doesn't work on USB Sticks.

I did another CPU Quick Test with all algorithms and another Extended Stress Test in Lenovo Diagnostics of the CPU. All passed.

I have Embedded Controller Version 1.23, BIOS 1.30.

I assume with SMU version you mean SMC? -> "SMC feature version: 0, program: 4, firmware version: 0x04453700 (69.55.0)" (sorry, I have overlooked that question eerlier)
So I have the same SMC as you. If you mean something else with SMU, please let me know how to identify it.

Concerning the beginning when I bought the notebook: I had an issue in the beginning but it was not equal. It was a immediate full freeze that happened only once every few days, but back then it did not make a difference if I had anything attached or not. Also, I had a flickering that was provoked when the screen was on specific pages at specific positions and that could be mitigated and provoked that way, e.g., by scrolling away and back to the very position (so I thought it was clearly software/driver issue), or by marking something and unmark it again once it occurred. However, after switching to amd_pstate=passive, both issues were gone. I think this worked out for some months. Then the new issue has risen: the described occurrence, but mostly it occurrs only if my USB stuff and external Screen were attached, except the one series of occurrences in a few minutes I had on 31st. So I thought it is not the same thing.

I hope the firmware upgrade works once I have a DVD to burn it to, and hopefully it helps.

-------------------

(All logs of above and below are with PSR disabled)

[1] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/massiveErrorsAndSingleTerminalErrorOutputsButStillWorkingUntilNewTtyLogin.kernel.6.3.5.pstate-active.EXTRACT.log

Comment 31 Yi Hao 2023-06-04 14:33:14 UTC
Can you try updating your BIOS using LVFS?
fwupdmgr refresh --force
fwupdmgr update

Can you list all your userland crash with (as root):
coredumpctl

If there is item in there, can you also:
coredumpctl info <pid> > <file>

I want to make sure you are only having kernel issue and not userland issue. It is weird to have hardware problem that only kills kernel.

I don't think 120Hz has anything to do with your stability issue, at least none of the log support this hypothesis. In sddm and kde, they use amdgpu driver. In TTY, they use simplefb driver. That might explain the difference when switching.

Assuming no hardware failure, the only explanation of kernel corruption is a bad kernel component, which means driver. This might explain why you have a stable machine with nothing connected.

I have noticed that almost all the GP happen in the SLUB allocator.
Can you add this to your kernel parameter:
slub_debug=F

It is documented here: https://www.kernel.org/doc/Documentation/vm/slub.txt

Boot your machine without anything connected, it should not generate any error in log.
Introduce more device to your laptop and monitor your log.

My first suspect is your Sound Blaster.

Comment 32 Christopher Klooz 2023-06-04 15:06:26 UTC
`coredumpctl` as root -> No coredumps found.

Additionally, it might be noted that any journalctl at ==boot-6 and earlier (-7 -8, -30, -35, etc.) are not found ("No data available"), but the logs are still existent if I scroll journalctl -r manually. This happens for the first time. Whatever, nothing critical for me.

-----

I will try now slub_debug=F and let you know. After that, I try the fwupdmgr refresh --force.

Thanks again!

Comment 33 Christopher Klooz 2023-06-04 15:57:50 UTC
I booted 6.3.5 with slub_debug=F.

The journalctl is available at [1]; major events/time:

Device        |  last entry before attached  |  time of attachment/entries
USB Hub 1     |  17:11:21                    |  17:11:43-44
5 TB HDD      |  17:12:21                    |  17:12:44-52 *¹
5 TB HDD      |  17:13:21                    |  17:14:10-17 *¹
Keyboard      |  17:14:21                    |  17:14:57
DVD/BD        |  17:16:21                    |  17:16:39
Sandisk Stick |  17:17:21                    |  17:17:37
Mouse         |  17:17:51                    |  17:18:00
Sound         |  17:18:21                    |  17:18:45-46
USB Hub 2     |  17:19:21                    |  17:19:31-33 (including one usb-usb-cable in one port that is not connected to anything)
Samsung Stick |  17:19:51                    |  17:20:08
Sandisk SSD   |  17:20:21                    |  17:20:42 *²
Ext. Screen   |  17:21:51                    |  No entry
Power (USB)   |  17:22:21                    |  No entry
Ethernet RJ45 |  17:22:54                    |  17:23:27


Maybe I have overlooked something in the logs [1], but myself, I can find only some errors related to three storage devices:

*¹ the 5 TB HDD both contain:
```
Jun 04 17:12:52 fedora kernel: scsi 0:0:0:1: Failed to bind enclosure -19
Jun 04 17:12:52 fedora kernel: scsi 0:0:0:1: Failed to get diagnostic page 0x1
Jun 04 17:12:52 fedora kernel: scsi 0:0:0:1: Wrong diagnostic page; asked for 1 got 8
```
```
Jun 04 17:14:10 fedora kernel: ses 1:0:0:1: Failed to bind enclosure -19
Jun 04 17:14:10 fedora kernel: ses 1:0:0:1: Failed to get diagnostic page 0x1
Jun 04 17:14:10 fedora kernel: ses 1:0:0:1: Wrong diagnostic page; asked for 1 got 8
```

*² the Sandisk SSD (2TB) contains:
```
Jun 04 17:20:42 fedora kernel: ses 5:0:0:1: Failed to bind enclosure -19
Jun 04 17:20:42 fedora kernel: ses 5:0:0:1: Failed to get diagnostic page 0x1
```

[1] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/PSRdisabled/slub_debug.log

It might be noted that all these devices are attached when I am at home with the screen attached (and the three devices with errors are backup targets for the rsnapshot jobs). But the one series with 3-4 crashes I had on 31st May (with 6.2.15) did not contain any of these USB devices but only my phone which was attached by USB (which it normally isn't at home, and seldomly even when I am not at home) plus a bluetooth mouse (not the one I use at home). However, if we ignore the one-time series of 31st May, all crashes I explicitly remember or from which I have explicit logs contained these devices. In either case, what all occurrences have in common is that something is attached by USB.

Comment 34 Christopher Klooz 2023-06-04 16:07:37 UTC
When trying updating BIOS using LVFS, I get first the return that 1 supported device found, but when I do update, it only has an update for "UEFI dbx" and states explicitly that it has no updates for: UEFI Device Firmware, UEFI Device Firmware, ELAN... , Integrated camera..., SKHynix ...

Comment 35 Yi Hao 2023-06-04 17:39:49 UTC
I look at your 31 May crash again. It is different.
Seems like it is in IOMMU and it is not a GP.
It might very well be a Qualcomm bug.
I believe in recent kernel, Qualcomm has a lot of fixes for ath11k.

For BIOS update, it definitely is:
Lenovo ThinkPad T16 Gen 1, model 21CHCTO1WW

You can use a tool called "geteltorito" to get rid of the El-Torito header and dd the image to USB drive.
You must also disable Secure Boot to boot the firmware updater.

Game plan for slub_debug:
Maybe I should explain more on how I intent to do it.
My suspicion on bad driver is just that: A suspicion.

Looking at all the logs, I am hoping that slub_debug can positively identify that it is indeed something corrupting it. I think it will catch it. I surely hope so. Finger crossed.

The thing with this is that you must wait even more as you can use the laptop for hours without anything bad happen.
When something happen and we can positively see slub_debug catch the error, we will then move into elimination phase to identify which hardware cause the kernel corruption.

Once we identify the hardware, we will then see which kernel module it uses so that you can file a separate bug report to the author / maintainer.

I hope you understand that this won't magically tell you right away what is wrong.

For those errors on your drive, I think they might be just firmware bug in the device, or maybe just kernel info, or maybe quirks.

For now, only you can use your instinct to tell which hardware cause the issue based on the way you use them. The log will confirm it for you, assuming slub_debug catches it.

Comment 36 Yi Hao 2023-06-04 20:16:06 UTC
Hi,
I hope to answer some previously unanswered question.

1. KDE handling of GPU reset
Referring to https://bugs.kde.org/show_bug.cgi?id=459872
They currently don't handle it. The referred bug is also closed with their thinking that driver problem should be solved by the driver itself instead of KDE.

2. Your USB drive error log
Referring to the codes: https://elixir.bootlin.com/linux/v6.3.5/source/drivers/scsi/ses.c

Your drive uses the uas driver, which in turns uses the scsi driver which allows things like TRIM command to be sent to the actual driver.
Since it is not an enclosure but just one drive, both messages got printed during init.

Can you get me the following please:
# lsusb -tv
# lspci -k

I have been looking at your hardware and has no clue which one could be the culprit since they are common and the drivers are widely use.

I was thinking your Sound Blaster might use their own driver but it turns out they just use the common USB Audio driver.

Comment 37 Christopher Klooz 2023-06-04 20:48:28 UTC
Hi,

I think I misunderstood your initial slub_debug argument. Feel free to ignore my recent post about it. 

I will keep the slub_debug=F parameter in my grub options and keep everything attached to my system to provoke whatever it is. I also enabled the cron jobs again. I will let you know once the next occurrence takes place and provide logs (I'm already curious what slub_debug will provide). At the moment, the system seems stable but its just a matter of time.

Concerning your questions in general: I explicitly only use the vanilla kernel from Fedora and use only its default repositories; I don't add third party stuff (drivers, repos, etc.). Beyond that I have only a toolbox container, but that also only has rpmfusion for vlc and related packages. The latter cannot explain what takes place outside the user account anyway because of toolbox.

Concerning the outputs of `lsusb -tv` & `lspci -k` (with all devices attached): https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/lsx

Let me know if you need more. Otherwise, I will report once I have some slub_debug logs.

Comment 38 Christopher Klooz 2023-06-05 16:07:57 UTC
The system has worked properly despite trying to provoke the occurrences longer than expected.

But now I have a log from journalctl with slub_debug=F: https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/slub_debug-F/immediateFreezeWithinKde.log

Generally, it was an immediate freeze within the KDE GUI. However, after reading the logs, I think there was another "symptom" of the issue a little before the system finally froze: thunderbird reported an error after that it could no longer access some data on storage. However, this is obviously not linked to the origin.

Fingers crossed that the log is more indicative than earlier logs. I will keep slub_debug=F enabled for now.

Comment 39 Yi Hao 2023-06-05 16:50:20 UTC
Your selinux blocked thunderbird from accessing /dev/urandom.
Can you try allowing "sysadm_u" access to "urandom_device_t" and see if it fixes your thunderbird stability issue?
Maybe allow access to "random_device_t" as well.

Try and cat /dev/urandom and /dev/random and see if it works after you added the rules.

I strongly believe many of your userland issues are selinux related but I did not read all of the selinux related log to confirm or deny my personal believe. Based on your previous coredumpctl output, they didn't crash.
Currently I am only focusing on kernel oops.

In my humble opinion it is better to triage each symptom / bug / behavior separately so they can be targeted individually. Your logs basically has all of them but based on my limited knowledge, they are not related to each other.

I can however confirm your laptop's failure to reboot is due to GP. For some reason the instruction pointer is pointing to an invalid address. This single occurrence is still a mystery.
GP happens when the kernel is Not Tainted. Once kernel oops happen, they become frequent.

I read your log, no kernel related error. can you confirm by:
cat /proc/sys/kernel/tainted

0 means Not Tainted. Meaning not even kernel oops.

I have a favor to ask. When you start having GP, please capture for me:
$ lsmod

Did you modify any of the module's parameter other than enabling amd_pstate?

Comment 40 Christopher Klooz 2023-06-05 18:34:36 UTC
Concerning thunderbird:
The SE denial of urandom is expected but not related (this denial was after the error; the denial was caused by re-starting thunderbird). The data that could not be accessed at the very moment was the calendar, not urandom. To explain the logs of that moment: I had the error that the file related to the thunderbird calendar could not be opened, which caused me to close and re-start thunderbird: this re-starting caused the SE denial of urandom. So first there was the error, then I re-started thunderbird, then the SE denial happened as expected consequence of starting the thunderbird process.

However, with regards to the logs: there is a "Permission denied" without SELinux relations that is documented in the log entries before I restarted thunderbird at 16:29:46 and 16:30:13. This was not SELinux. 

Also, the behavior of SELinux is consistent. Thunderbird never had that issue before, and after re-opening thunderbird, it worked again (including access to calendar). SELinux does permanently block it or permanently not. 

However, this is not a major issue. I just noted the correlation since that is not normal and it occurred soon before the freeze. As far as it concerns me, you can ignore the thunderbird stuff. Sorry for creating confusion here.

Generally:
Also, as noted before, the occurrences do not change if I disable SELinux confined users, which I tested already earlier (I initially also assumed this could be related). But that does not change anything. Also, no active user account is SELinux-confined if I log in only as root in the terminal without logging in KDE in parallel (there have been several freezes/occurrences in that condition, including some of the above logs where no KDE was logged in). Additionally, SELinux should not create GP or wrong pointer addresses, independent of its configuration. That would be indeed a major bug.

However, to produce more comprehensible logs for you, I will disable the confined user account for now and wait for the next freeze. Then, you will get logs without SELinux confined users. I assume that is helpful for you?

Also, do not forget that also in the past some freezes did not create any logs at all. Some just froze without logging anything, some contained massive kernel errors. Let's hope the next freezes are "better" in producing log entries. The next journal log will be without SELinux confinement of course!

I will try to do `lsmod` before freezing next time. When the occurrence does not cause immediate freeze, this should be no problem.

Concerning module parameters: No, the only modifications of Fedora KDE Spin that impact module parameters can be seen in the grub options/boot parameters. At the moment, this is: options root=UUID=... ro rootflags=subvol=root rd.luks.uuid=... rhgb quiet amd_pstate=passive amdgpu.dcdebugmask=0x10 slub_debug=F. Nothing else.

---

A separated question: Since I added slub_debug=F, my system has become quite stable: even the copy of the large data amount which I gave up after trying over 10 times now worked properly - I did it several times. Can this stability somehow be linked to slub_debug=F? Its finally a distinct correlation with a big difference before and after adding slub_debug=F? (To avoid confusion: this change in behavior is from before disabling the SELinux confined users; so the only change is slub_debug=F). I have no experience with such parameters or their potential impact.

Comment 41 Christopher Klooz 2023-06-05 18:59:18 UTC
I HAVE SLUB_DEBUG HITS! Normally, I have `cat /proc/sys/kernel/tainted` = 0. But after writing the last post, I checked it and then `cat /proc/sys/kernel/tainted` = 32. After I saw that, I stored immediately `lsmod` from the corrupted system (see [1]) and stored the logs (see [2]), which contained massive kernel errors and also hits of slub_debug! Here** three extracts from [2]:

```
Jun 05 16:52:14 fedora kernel: SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
Jun 05 16:52:14 fedora kernel: **********************************************************
Jun 05 16:52:14 fedora kernel: **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
Jun 05 16:52:14 fedora kernel: **                                                      **
Jun 05 16:52:14 fedora kernel: ** administrator!                                       **
Jun 05 16:52:14 fedora kernel: ** the kernel, report this immediately to your system   **
Jun 05 16:52:14 fedora kernel: ** If you see this message and you are not debugging    **
Jun 05 16:52:14 fedora kernel: **                                                      **
Jun 05 16:52:14 fedora kernel: ** might reduce the security of your system.            **
Jun 05 16:52:14 fedora kernel: ** via the console, logs, and other interfaces. This    **
Jun 05 16:52:14 fedora kernel: ** This system shows unhashed kernel memory addresses   **
Jun 05 16:52:14 fedora kernel: **                                                      **
Jun 05 16:52:14 fedora kernel: **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
Jun 05 16:52:14 fedora kernel: **********************************************************

```
...
...
...
```
Jun 05 18:56:20 fedora.fritz.box kernel: CPU: 1 PID: 13592 Comm: kworker/u32:6 Tainted: G    B              6.3.5-200.fc38.x86_64 #1
Jun 05 18:56:20 fedora.fritz.box kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
Jun 05 18:56:20 fedora.fritz.box kernel: -----------------------------------------------------------------------------
Jun 05 18:56:20 fedora.fritz.box kernel: BUG kmalloc-1k (Tainted: G    B             ): Wrong object count. Counter is 10 but counted were 28
Jun 05 18:56:20 fedora.fritz.box kernel: =============================================================================
Jun 05 18:56:20 fedora.fritz.box kernel: Disabling lock debugging due to kernel taint

```
...
...
...
```
Jun 05 18:56:20 fedora.fritz.box kernel: Object 0xffff8fd10c902000 @offset=8192 fp=0xc5d6e3752d901092
Jun 05 18:56:20 fedora.fritz.box kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
Jun 05 18:56:20 fedora.fritz.box kernel: -----------------------------------------------------------------------------
Jun 05 18:56:20 fedora.fritz.box kernel: BUG kmalloc-1k (Not tainted): Freechain corrupt
Jun 05 18:56:20 fedora.fritz.box kernel: =============================================================================
Jun 05 18:56:17 fedora.fritz.box kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
Jun 05 18:56:17 fedora.fritz.box kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
```

[1] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/slub_debug-F/HIT/lsmod.log -> `lsmod` WHILE kernel-tainted=32 and DURING the corrupted boot
[2] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/slub_debug-F/HIT/slub_debug_HIT.log -> full logs with slab_debug and kernel errors

** This happened immediately after my last post, so I could not disable SELinux before that corruption, which (based upon the logs) took place long before I found out and long before I posted. But the next occurrence will be without SELinux confined users, which I disabled. However, I think the slub_debug hits help already.

Comment 42 Christopher Klooz 2023-06-05 19:07:06 UTC
Sorry for the first extract, I just saw that is intended. The BUG reports are in the second and the third extracts (in the logs, there are much more kernel errors and related data above and below these two extracts) [2]:

```
...
...
Jun 05 18:56:20 fedora.fritz.box kernel: CPU: 1 PID: 13592 Comm: kworker/u32:6 Tainted: G    B              6.3.5-200.fc38.x86_64 #1
Jun 05 18:56:20 fedora.fritz.box kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
Jun 05 18:56:20 fedora.fritz.box kernel: -----------------------------------------------------------------------------
Jun 05 18:56:20 fedora.fritz.box kernel: BUG kmalloc-1k (Tainted: G    B             ): Wrong object count. Counter is 10 but counted were 28
Jun 05 18:56:20 fedora.fritz.box kernel: =============================================================================
Jun 05 18:56:20 fedora.fritz.box kernel: Disabling lock debugging due to kernel taint

```
...
...
...
```
Jun 05 18:56:20 fedora.fritz.box kernel: Object 0xffff8fd10c902000 @offset=8192 fp=0xc5d6e3752d901092
Jun 05 18:56:20 fedora.fritz.box kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
Jun 05 18:56:20 fedora.fritz.box kernel: -----------------------------------------------------------------------------
Jun 05 18:56:20 fedora.fritz.box kernel: BUG kmalloc-1k (Not tainted): Freechain corrupt
Jun 05 18:56:20 fedora.fritz.box kernel: =============================================================================
Jun 05 18:56:17 fedora.fritz.box kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
Jun 05 18:56:17 fedora.fritz.box kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting
...
...
```


[2] https://gitlab.com/py0xc31/public-tmp-storage/-/blob/main/slub_debug-F/HIT/slub_debug_HIT.log -> full logs with slab_debug and kernel errors

Comment 43 Yi Hao 2023-06-05 19:26:32 UTC
Well I am surprised it is caused by Qualcomm ath11k.
Looks like it corrupt kmalloc-1k when trying to update regulatory domain.
Maybe an ath11k bug report with this SLUB output?

Let's continue using and see if any more slub debug log appear. Also see if you hit any stability issue. At this stage GP hasn't happen yet as it is the first corruption.

I did not enable red zoning with just slub_debug=F
If something broke and caused a GP, it very well means buffer overflow in the driver.
In that case, test again with:
slub_debug=FZ

I am not sure if KDE freeze is caused by amdgpu issue. Like I said, your logs has a lot of different error, amdgpu being one.

For the permission issue, did you try the classic troubleshooting using "ls -lZ"?
Check the permissions (and possibly selinux context) along the path and compare it to the uid of the process.

I actually have the exact same wireless card with you and haven't have issue.

Comment 44 Yi Hao 2023-06-05 19:34:20 UTC
I think I might know how this bug happen to you. How do you set your timezone?

I see my log and it has the following:
setting regulatory domain to MY based on timezone (Asia/Kuala_Lumpur)

Maybe put that in your ath11k bug report as well.

Comment 45 Christopher Klooz 2023-06-05 19:58:02 UTC
> Maybe an ath11k bug report with this SLUB output?

I will do once I have time. I hope I can manage that tomorrow, but maybe not before the weekend.

> In that case, test again with:
> slub_debug=FZ

You mean replace slub_debug=F with slub_debug=FZ now?

> How do you set your timezone?

It was set by Anaconda (Fedora installer) during installation. I admit, I never thought about how Anaconda managed it. I have time zone Berlin.

> I am not sure if KDE freeze is caused by amdgpu issue. Like I said, your logs has a lot of different error, amdgpu being one.

Well, I have also freezes outside KDE, when KDE is not even active (only SDDM in some occurrences). I have to say that the ath11k issue in the slub_debug log did not cause noteworthy issues - from the user experience perspective, this is something new (I only became aware of the error by `cat /proc/sys/kernel/tainted` = 32). I could even reboot after that error: that is something completely new as well -> Before this occurrence, once anything occurred, rebooting has never worked. Can this be linked to the slub_debug=F parameter? Let's see how the next slub_debug reports develop. I hope one of the freezes will reveal more.

> For the permission issue, did you try the classic troubleshooting using "ls -lZ"?

I am not sure what exactly you refer to, but if you refer to the issues of SELinux and KDE, this is a separated issue. The permissions and roles have to be aligned. That's on the "to do" list but will need time. But SELinux cannot cause these issues, except it has bugs itself. Especially since the issues also occur when I disabled SELinux-confined users or when I only log into root (which is not confined at all).

However, in order to create more comprehensible reports, I have disabled the SELinux-confined-users for now. My Fedora is now default, including its SELinux. So let's see in the future slub_debug reports what errors actually belong to real bugs and what are only sidekicks of SELinux-confined user accounts.

I am still wondering that my system has so less issues since it is on slub_debug=F.

So, you mean we should switch to slub_debuf=FZ from now on?

Comment 46 Yi Hao 2023-06-05 20:13:19 UTC
Hi,
Like I mentioned previously, I have no idea on your userland issues. All I see is a lot of selinux related error. That might be my bias again.
Furthermore, I did not experience it in person to even imagine how it is like. I can only imagine based on your description. I am sure you know how this kind of thing goes.
With that, I think I shall leave myself out of your userland issue and focus on kernel.

Yes you replace slub_debug=F with slub_debug=FZ.
However, you only do it if you have stability issue. If you don't have stability issue, keep using "F".
"FZ" is much slower and uses more memory.

The whole idea is to allow you to continue using your WiFi while this is being fixed.
If you don't need WiFi, just blacklist the driver altogether and remove the slub_debug parameter.

You system don't have "less issue", it is just random. Feeling is inaccurate when dealing with transient bug like this. That's why I only can tell you based on what is in the log.

I suspect your neighbor trigger it too. If you see the log properly, some other AP is broadcasting itself with country code US. The wireless driver is basically changing between US and DE.

Just let ath11k maintainer deal with the problem.

Comment 47 Christopher Klooz 2023-06-05 20:39:53 UTC
Yes, forget the userland stuff. I also disabled SELinux-confined users to clear the logs from unnecessary SELinux-denials. I hope that makes the logs more comprehensible.

-------

If there are no other issues except memory and performance (I assume that is the case?), I would use slub_debug=FZ permanently because I often cannot foresee stability issues. They appear and disappear on a random basis. So if I enable FZ only when stability issues rise, they could be already disappeared once FZ is in place. 

-------

If you allow me to summarize some correlations that relate to your post:

- When I am away from home, then I use in most cases WiFi (in any case, never cable) -> I have never had an occurrence in this setting (at least, I don't remember).
- When I am at home, I never use WiFi (the controller is usually enabled, but not connected to a WiFi network) -> instead I am always connected with RJ45 at home: all occurrences I have in mind and that I have logged have happened while connected with RJ45, except:
- On 31st May (the day of the mystery freezes), I was away from home but did not use WiFi: instead, I used my mobile phone that was simulating Ethernet via USB (which I do seldomly).

I cannot say if that is related. But it might be noted for your consideration.

Comment 48 Yi Hao 2023-06-05 20:54:25 UTC
Hi,
slub_debug is not a silver bullet. In fact I only recommend it after studying all your logs and correlate with high confidence that the corruption happen in SLUB allocator.

There are a lot of debugging facilities in the kernel that are disabled by default.

I also mentioned that each issue must be triaged separately, at least from a developer point of view to be useful.
I also understand your view from a user perspective: The computer broke. Fedora is bad. Lenovo is a lemon.

Actually I notice this bug has very little traction before I came. It takes effort to process all your logs and it is not something maintainer has time to do.

I don't know how to word this better. Just don't randomly enable debugging stuff. You get a big warning about reduced security in your log when it is enabled. It is there for a reason. Debugging tool won't fix the problem. It just identify and possibly mitigate the problem (a very specific problem, not general problem).

For WiFi, it is active even you are not connected to any network. This bug apparently can be triggered in this situation. It might be triggered only when it is unconnected to any network. ath11 maintainer will find out for you.

Comment 49 Christopher Klooz 2023-06-05 21:25:36 UTC
Thanks for your points! And thanks for your help in general!

Once I have time, I will file a bug against ath11k.

You just gave me the idea to set everything back to normal but blacklist the ath11k and see if any issues occur in that condition.

If it will not work, I will return to slub_debugging and post once I have logs that contain more than the recent slub_debug ones.

Comment 51 Christopher Klooz 2023-06-06 12:41:06 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=217528  

Thanks again Yi Hao!!!  

As elaborated in the bug report at kernel.org, all issues I have seem to be solved just by:  
keep permanently connected to a WiFi network OR boot with `module_blacklist=ath11k_pci,ath11k`.  

Additionally, I keep checking `cat /proc/sys/kernel/tainted` from time to time in case something occurs without symptoms. But so far the output is always 0.

I have already returned to my preferred settings (including SELinux-confined users) after my comment #49 . All seems to work. Having had so many hours without issues has not happened since the issues have begun. Let's hope it stays that way.


Note You need to log in before you can comment on or make changes to this bug.