Bug 2164667

Summary: KDE crashes on start with mesa-23.0.0~rc3-3.fc38
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: mesaAssignee: Adam Jackson <ajax>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 38CC: ajax, bskeggs, igor.raits, jglisse, j, lyude, mail, matt.fagnani, ndegraef, pasik, rhughes, robatino, rstrode, tintou, tstellar
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: openqa
Fixed In Version: mesa-23.0.0~rc3-3.fc38 mesa-23.0.0~rc4-1.fc38 mesa-23.0.0~rc4-3.fc39 mesa-23.0.0~rc4-3.fc38 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-15 16:34:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2083910    
Attachments:
Description Flags
backtrace from the gnome-shell Wayland crash
none
backtrace from kde (kwin_wayland)
none
updated kde backtrace with 20955 patch applied none

Description Adam Williamson 2023-01-26 00:59:46 UTC
With mesa-23.0.0~rc3-2.fc38 , neither KDE nor GNOME starts successfully, at least in a VM. This is consistently affecting openQA tests and I can reproduce it in a local VM. I installed today's Workstation live ISO - https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20230125.n.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-Rawhide-20230125.n.0.iso - and confirmed it boots fine; then I updated mesa to mesa-23.0.0~rc3-2.fc38 (didn't update anything else), rebooted, and the system gets stuck at a flashing cursor. You can log in at a VT if you switch to another tty, but the graphical desktop never comes up.

The journal shows an initial attempt to run GNOME on Wayland which fails with a double free:

Jan 25 16:55:17 localhost-live gnome-shell[896]: Running GNOME Shell (using mutter 43.1) as a Wayland display server
Jan 25 16:55:17 localhost-live org.gnome.Shell.desktop[896]: free(): double free detected in tcache 2
Jan 25 16:55:17 localhost-live audit[896]: ANOM_ABEND auid=4294967295 uid=42 gid=42 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 pid=896 comm="gnome-shell" exe="/usr/bin/gnome-shell" sig=6 res=1
Jan 25 16:55:17 localhost-live systemd[1]: Created slice system-systemd\x2dcoredump.slice - Slice /system/systemd-coredump.
Jan 25 16:55:17 localhost-live audit: BPF prog-id=75 op=LOAD
Jan 25 16:55:17 localhost-live audit: BPF prog-id=76 op=LOAD
Jan 25 16:55:17 localhost-live audit: BPF prog-id=77 op=LOAD
Jan 25 16:55:17 localhost-live systemd[1]: Started systemd-coredump - Process Core Dump (PID 901/UID 0).
Jan 25 16:55:17 localhost-live audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@0-901-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 25 16:55:17 localhost-live systemd-coredump[902]: Process 896 (gnome-shell) of user 42 dumped core.

then there are several subsequent attempts to run it on X.org instead, which *also* fail with double frees:

Jan 25 16:55:17 localhost-live /usr/libexec/gdm-x-session[919]: X.Org X Server 1.20.14
Jan 25 16:55:17 localhost-live /usr/libexec/gdm-x-session[919]: X Protocol Version 11, Revision 0
...
Jan 25 16:55:17 localhost-live /usr/libexec/gdm-x-session[919]: free(): double free detected in tcache 2
Jan 25 16:55:17 localhost-live /usr/libexec/gdm-x-session[919]: (EE)

I'll try and get the backtrace from the gnome-shell coredump, in case it helps identify the problem.

Proposing as an F38 Beta blocker as a violation of "A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop using a user account created during installation or a 'first boot' utility." - https://fedoraproject.org/wiki/Basic_Release_Criteria#Expected_installed_system_boot_behavior .

Comment 1 Adam Williamson 2023-01-26 01:05:12 UTC
I've requested the build be untagged for now: https://pagure.io/releng/issue/11243

Comment 2 Adam Williamson 2023-01-26 01:18:42 UTC
Created attachment 1940529 [details]
backtrace from the gnome-shell Wayland crash

Here's the backtrace from the crash of gnome-shell on the Wayland attempt.

Comment 3 Adam Williamson 2023-01-26 02:53:57 UTC
huh, the backtrace makes it look a lot like reverting one or both of the commits from https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20663 would avoid this, but I tried that and it doesn't seem to help. haven't looked if the backtrace is different after the reverts, yet.

Comment 4 Corentin Noël 2023-01-26 12:30:45 UTC
Could you try with this patch instead https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20933

Comment 5 Adam Williamson 2023-01-26 16:51:04 UTC
Thanks! Will do.

Comment 6 Adam Williamson 2023-01-26 17:46:55 UTC
That looks like it works indeed. Thanks!

Comment 7 Fedora Update System 2023-01-26 18:15:21 UTC
FEDORA-2023-40b973fa06 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-40b973fa06

Comment 8 Fedora Update System 2023-01-26 18:18:10 UTC
FEDORA-2023-40b973fa06 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 9 Adam Williamson 2023-01-27 02:34:33 UTC
Ugh, I tested GNOME and trusted KDE would be alright, but it looks like only GNOME is fixed and KDE is still broken :(

New backtrace coming.

Comment 10 Adam Williamson 2023-01-27 02:42:50 UTC
Created attachment 1940691 [details]
backtrace from kde (kwin_wayland)

Best backtrace I can get from kwin_wayland so far.

Comment 11 Corentin Noël 2023-01-27 11:47:56 UTC
I hope that this one will do the trick https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20955

Comment 12 Adam Williamson 2023-01-27 16:47:38 UTC
Thanks, I'll try it! Sorry, my bad for not properly re-testing both desktops yesterday.

Comment 13 Adam Williamson 2023-01-27 18:06:51 UTC
This is a bit delayed by a new version of rust which turns out to have a bug that makes it incapable of building mesa. working on that with the rust packager, but building rust takes two hours, so it'll be a bit.

Comment 14 Adam Williamson 2023-01-27 23:33:59 UTC
Created attachment 1940817 [details]
updated kde backtrace with 20955 patch applied

Welp, we got a new rust, but unless I messed something up somewhere, KDE still doesn't start even with 20955 :( New backtrace attached.

Comment 15 Adam Williamson 2023-01-27 23:40:16 UTC
...yeah, I see it's basically the same backtrace, and I don't immediately see how that can happen either. But what can I say, somehow it is. One of those `createNewDrawable`s is explicitly set to be NULL, or something?

Comment 16 Fedora Update System 2023-02-01 23:56:22 UTC
FEDORA-2023-aa7f4c7892 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-aa7f4c7892

Comment 17 Fedora Update System 2023-02-02 00:03:14 UTC
FEDORA-2023-aa7f4c7892 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 18 Adam Williamson 2023-02-02 07:25:17 UTC
KDE still does not start with that build. Untag request reopened: https://pagure.io/releng/issue/11247#comment-838991

Comment 19 Corentin Noël 2023-02-02 08:22:04 UTC
Are you sure that you applied https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20955 I don't see it applied here https://src.fedoraproject.org/rpms/mesa/commits/rawhide

Comment 20 Adam Williamson 2023-02-02 16:43:34 UTC
I didn't do the RC3 build. You're right that it isn't in that, but when you first posted the PR, I did a scratch build (actually, two) with the patch attached and tested it, and KDE still failed. See comments 14 and 15.

Comment 21 Adam Williamson 2023-02-02 18:34:24 UTC
I did another scratch build and confirmed KDE still does not work with RC3 with the patch applied: e.g. https://openqa.stg.fedoraproject.org/tests/2555833 . The scratch build is at https://koji.fedoraproject.org/koji/taskinfo?taskID=97011662 if you want to check it out / test it. I'm a bit baffled how we could still be getting what looks like the same backtrace with that patch in place, but, you can see from the scratch build logs it definitely applied, and the openQA test logs show that build was definitely used. You can check the patch file from the .src.rpm if you want to confirm the patch itself is correct, or here it is:

From aec3abec559c1ae52aa43da70194f961007513b2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Corentin=20No=C3=ABl?= <corentin.noel>
Date: Fri, 27 Jan 2023 12:27:14 +0100
Subject: [PATCH] egl/dri2: Make sure to never call a null createNewDrawable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

driCoreExtension in src/gallium/frontends/dri/dri_util.c is actually defining a NULL
createNewDrawable, ignore such cases.

Signed-off-by: Corentin Noël <corentin.noel>
---
 src/egl/drivers/dri2/egl_dri2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/egl/drivers/dri2/egl_dri2.c b/src/egl/drivers/dri2/egl_dri2.c
index b20f2468fb40..91c479ec6220 100644
--- a/src/egl/drivers/dri2/egl_dri2.c
+++ b/src/egl/drivers/dri2/egl_dri2.c
@@ -1592,11 +1592,11 @@ dri2_create_drawable(struct dri2_egl_display *dri2_dpy,
                                               dri2_surf->base.Type == EGL_PIXMAP_BIT);
    } else {
       __DRIcreateNewDrawableFunc createNewDrawable;
-      if (dri2_dpy->image_driver)
+      if (dri2_dpy->image_driver && dri2_dpy->image_driver->createNewDrawable)
          createNewDrawable = dri2_dpy->image_driver->createNewDrawable;
-      else if (dri2_dpy->dri2)
+      else if (dri2_dpy->dri2 && dri2_dpy->dri2->createNewDrawable)
          createNewDrawable = dri2_dpy->dri2->createNewDrawable;
-      else if (dri2_dpy->swrast)
+      else if (dri2_dpy->swrast && dri2_dpy->swrast->createNewDrawable)
          createNewDrawable = dri2_dpy->swrast->createNewDrawable;
       else
          return _eglError(EGL_BAD_ALLOC, "no createNewDrawable");
-- 
GitLab

Comment 22 Matt Fagnani 2023-02-02 19:09:20 UTC
kwin_wayland crashed with traces like those reported by Adam when starting Plasma with Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso and mesa-23.0.0~rc4-1.fc38 in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver. Plasma started normally without these crashes with Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration enabled using the virgl driver and on bare metal using the amdgpu driver with amd_iommu=off on the kernel command line to work around the black screen problem involving amdgpu and AMD IOMMUs with 6.2 kernels I reported at https://bugzilla.redhat.com/show_bug.cgi?id=2156691 The kwin_wayland crash might be specific to the use of the llvmpipe driver.

Comment 23 Adam Williamson 2023-02-02 19:21:02 UTC
Thanks a lot Matt, that's really useful. I meant to test on metal but hadn't got around to it yet, I'll try and do it today.

Comment 24 Matt Fagnani 2023-02-02 20:14:20 UTC
(In reply to Adam Williamson from comment #23)
> Thanks a lot Matt, that's really useful. I meant to test on metal but hadn't
> got around to it yet, I'll try and do it today.

No problem. kwin_wayland also crashed some times in KWin::Workspace::geometry with traces like those in https://bugzilla.redhat.com/show_bug.cgi?id=2133796 and https://bugzilla.redhat.com/show_bug.cgi?id=2094671 alternating with kwin_wayland crashes with traces like those you reported as kwin_wayland restarted several times before systemd stopped trying to restart it with Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver. I've seen those kwin_wayland KWin::Workspace::geometry null pointer dereference crashes for months only in Rawhide KDE Plasma VMs using llvmpipe. The kwin_wayland crashes in KWin::Workspace::geometry occasionally prevented Plasma from starting when they repeated many times, but usually Plasma would start with them happening 1-5 times or not at all. There were also many other Plasma programs that crashed as shown by coredumpctl after the kwin_wayland ones I guess because kwin_wayland didn't start. The test of Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso on bare metal I mentioned was with an AMD A10-9620P CPU and integrated AMD Radeon R5 GPU using the radeonsi mesa driver.

Comment 25 Adam Williamson 2023-02-02 20:27:34 UTC
My bare metal testing gives similar results: on a system with an AMD graphics adapter that works fine with the open source drivers, booting today's Rawhide nightly (which has the RC4 build) normally works fine, booting it in "basic graphics mode" (which forces a fallback to llvmpipe) gives a black screen. Can't immediately confirm it's the same cause as I can't even reach a VT after the black screen, and I'm testing the live image without install ATM so I can't check the log messages from the next boot.

Comment 26 Adam Williamson 2023-02-02 22:59:47 UTC
Back testing in a VM, can confirm the backtrace is still the same with the 20955-on-RC4 build. I did another build with some debug messages and that shows we're on the `dri2_dpy->dri2->createNewDrawable` path, the second of the three possibilities.

Comment 27 Matt Fagnani 2023-02-03 03:39:05 UTC
(In reply to Adam Williamson from comment #26)
> Back testing in a VM, can confirm the backtrace is still the same with the
> 20955-on-RC4 build. I did another build with some debug messages and that
> shows we're on the `dri2_dpy->dri2->createNewDrawable` path, the second of
> the three possibilities.

I've seen frequent plasmashell journal warnings that the dri2 screen wasn't created in recent Rawhide KDE Plasma GNOME Boxes QEMU/KVM VMs with 3D acceleration disable using llvmpipe like the following in my report on plasmashell crashes at https://bugzilla.redhat.com/show_bug.cgi?id=2160869
plasmashell[2722]: libEGL warning: egl: failed to create dri2 screen

The dri2 screen not being created properly might be involved in the new crashes you reported with mesa-23.0.0~rc3-2.fc38 and later given the similarities in the functions and variables like dri2_create_drawable and dri2_dpy->dri2->createNewDrawable.

Comment 28 Adam Williamson 2023-02-03 07:22:24 UTC
Hmm, so I did another round of debugging...right before this line:

dri2_surf->dri_drawable = createNewDrawable(dri2_dpy->dri_screen,
                                                  config, loaderPrivate);

I added checks whether each of the things referred to exists:

      if (!createNewDrawable)
         _eglLog(_EGL_WARNING, "egl: XXX createNewDrawable missing");
      if (!dri2_surf->dri_drawable)
         _eglLog(_EGL_WARNING, "egl: XXX dri_drawable missing");
      if (!dri2_dpy->dri_screen)
         _eglLog(_EGL_WARNING, "egl: XXX dri_screen missing");
      if (!config)
         _eglLog(_EGL_WARNING, "egl: XXX config missing");
      if (!loaderPrivate)
         _eglLog(_EGL_WARNING, "egl: XXX loaderPrivate missing");

the only one that gets triggered is the check for `dri2_surf->dri_drawable` - that does not exist. (createNewDrawable does pass the check). I'm too dumb about C to know if this is a problem. In a sense it feels like it shouldn't be because the job of this function is to create it, after all...but does it need to exist in *some* sense for it to be referenced there?

Matt, I don't think that warning is directly the cause of this, though I suppose the reason that fails and the reason this fails could be related...

Comment 29 Matt Fagnani 2023-02-03 15:16:16 UTC
Since llvmpipe is also called swrast according to https://docs.mesa3d.org/drivers/llvmpipe.html I'd guess that the
else if (dri2_dpy->swrast && dri2_dpy->swrast->createNewDrawable) would be the more appropriate condition to be used with llvmpipe than the else if (dri2_dpy->dri2 && dri2_dpy->dri2->createNewDrawable) you mentioned was used. Could dri2_dpy->dri2 not be null but also not have a valid dri2 screen? I wonder if the else if (dri2_dpy->dri2 && dri2_dpy->dri2->createNewDrawable) createNewDrawable = dri2_dpy->dri2->createNewDrawable; were commented out in a debug build, would the crash still happen? I didn't see the libEGL warning: egl: failed to create dri2 screen in the journal with Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver after the black screen from another VT. Those warnings happened with many KDE programs including kwin_wayland_wrapper in Rawhide KDE Plasma VMs from the last few weeks at least, but I only saw them when the llvmpipe driver was used and not with virgl and virtio_gpu or radeonsi and amdgpu. Those other KDE programs might not have been running long enough for the dri2 screen warnings to appear with Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso before kwin_wayland crashed.

Comment 30 Adam Williamson 2023-02-03 16:49:38 UTC
Well, AIUI at least, the confusing thing here is that if the issue were as you described, the backtrace should be different. The thing about the backtrace is the final two frames:

#0  0x0000000000000000 in ?? ()
No symbol table info available.
#1  0x00007f3eeaab44c3 in dri2_create_drawable (dri2_dpy=dri2_dpy@entry=0x56095bfae740, config=config@entry=0x56095bf0d850, dri2_surf=dri2_surf@entry=0x56095c3b6210, loaderPrivate=loaderPrivate@entry=0x56095c3b38c0) at ../src/egl/drivers/dri2/egl_dri2.c:1573
        createNewDrawable = <optimized out>

especially the way frame 0 is just...nothing, i.e. it looks like we're trying to call a NULL. That's why the obvious fix is what Corentin came up with - in the original code it looks like whatever we're assigning as 'createNewDrawable' could possibly be a null, so the patch adds further checks to make really really sure it isn't. And indeed, my additional debug output above seems to confirm it really really isn't a null - otherwise it would fail the `if (!createNewDrawable)` check and the log warning would be shown. But then, how are we getting this traceback? That's the mystery. At least that's the mystery for me, maybe it looks different to someone with more C skills. I'm much better with python tracebacks :D

If the problem were as you describe, we'd expect something...different. Frame 0 would show the call to 'createNewDrawable' and one of the values it dealt with would be a null, or it would call something *else* and that would be the place where we break, or something like that. Not just something that makes it look like, when we reach the bit of `dri2_create_drawable` where we call the thing we assigned as `createNewDrawable`, it's not there.

Comment 31 Adam Williamson 2023-02-03 16:52:41 UTC
btw, it's entirely possible that just commenting out the dri2->createNewDrawable part of the conditional as you suggest might 'fix' things, as that's the bit where we fail and we might indeed somehow work if we just skip it and go straight to swrast. But that's not super helpful, because we can't just do that; presumably that part is there for a reason, there are some drivers/configurations where we want to use `dri2_dpy->dri2->createNewDrawable`. We can't just chuck those out, or even give them lower priority than `dri2_dpy->swrast->createNewDrawable` - the priority order is clearly intentional and in order of preference, we only want to use swrast if there's no other option.

Comment 32 Matt Fagnani 2023-02-03 17:54:10 UTC
What you wrote makes sense to me. I didn't mean to suggest that commenting out else if (dri2_dpy->dri2 && dri2_dpy->dri2->createNewDrawable) createNewDrawable = dri2_dpy->dri2->createNewDrawable; would be a proper fix to the problem, just that a debug build with that might indicate that the use of else if (dri2_dpy->swrast && dri2_dpy->swrast->createNewDrawable) would still work and the fallback to the llvmpipe/swrast path wasn't happening properly. My point was that else if (dri2_dpy->dri2 && dri2_dpy->dri2->createNewDrawable) being true might not be sufficient to check that the dri2 screen was created and valid with mesa-23.0.0. Could the dri2 screen not being created properly create an uninitialized null value somewhere that showed up as being used in frame #0? Could the null value in frame #0 be the return value of another function not shown in the trace? Perhaps a null value in the variable storing the dri2 screen or a null pointer to a function might be involved. Running kwin_wayland under valgrind might help to identify the problem though I'm not sure how to do that during boot. Has this been reported as an issue upstream?

Comment 33 Adam Williamson 2023-02-03 18:05:19 UTC
> Could the dri2 screen not being created properly create an uninitialized null value somewhere that showed up as being used in frame #0? Could the null value in frame #0 be the return value of another function not shown in the trace? Perhaps a null value in the variable storing the dri2 screen or a null pointer to a function might be involved.

Well yeah, but at least in my experience, what you get in that kinda scenario is not a frame like the frame 0 we're getting. You get something that looks more like frame 1 - a record of a function call, but one of the *variables* involved shows as 0x000000000000. but as I said, I only have a limited understanding of this low-level language stuff :P

> Has this been reported as an issue upstream?

Yes, well, it's being discussed via the pull requests - https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20955 .

Comment 34 Matt Fagnani 2023-02-03 20:26:45 UTC
Adam, the trace you attached in comment 10 has errors accessing memory in frame #9 in KWin::EglGbmLayerSurface::createSurface contains in surface = std::optional<KWin::EglGbmLayerSurface::Surface> = {[contained value] = {gbmSurface = <error reading variable: Cannot access memory at address 0x7f0100000008>, importSwapchain = <error reading variable: Cannot access memory at address 0x10>,

Frame #10 has similar errors in KWin::EglGbmLayerSurface::checkSurface 
        newSurface = std::optional<KWin::EglGbmLayerSurface::Surface> = {[contained value] = {gbmSurface = <error reading variable: Cannot access memory at address 0xb03b3e57d90427ff>, importSwapchain = <error reading variable: Cannot access memory at address 0xe6673214bbb7285d>, importMode = (unknown: 0x56697274), currentBuffer = <error reading variable: Cannot access memory at address 0x6c2d312d72747569>, currentFramebuffer = <error reading variable: Cannot access memory at address 0x3a80000069746f7a>, forceLinear = false}}

Those errors might indicate memory corruption possibly in kwin_wayland which led to the crash higher in the trace.

In comment 28 you wrote dri2_surf->dri_drawable = createNewDrawable(dri2_dpy->dri_screen,
                                                  config, loaderPrivate);
The line 1604 I see for the proposed patch is dri2_surf->dri_drawable = createNewDrawable(dri2_dpy->dri_screen_render_gpu,
                                                  config, loaderPrivate);
at https://gitlab.freedesktop.org/mesa/mesa/-/blob/9c6c0d0a1a3128b7f186b8143757c9300823f039/src/egl/drivers/dri2/egl_dri2.c
Adding a debug output line for dri2_dpy->dri_screen_render_gpu which is different between the two like to did for the others in that comment might be something to try.

When I booted Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver after the black screen from another VT I ran startx &. Plasma on X started normally. So the problem might be specific to Plasma on Wayland.

Should this problem be reported to mesa and KDE in the usual way? Thanks.

Comment 35 Matt Fagnani 2023-02-04 18:30:03 UTC
I reported this problem at https://bugs.kde.org/show_bug.cgi?id=465284 and https://gitlab.freedesktop.org/mesa/mesa/-/issues/8232

When I booted Fedora-KDE-Live-x86_64-Rawhide-20230202.n.0.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver after the black screen I ran startx & from another VT. Plasma on X ran normally. I disabled automatic login from sddm in System Settings and logged out. kwin_wayland as the sddm Wayland compositor crashed with the same type of trace after logging out. The black screen problem with the flashing text cursor at the top left happened on logout.

Comment 36 Ben Cotton 2023-02-07 15:12:18 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 38 development cycle.
Changing version to 38.

Comment 37 Matt Fagnani 2023-02-14 02:38:53 UTC
I built mesa 23.0.0-rc1 from the upstream archive with fedpkg mockbuild in a Fedora 37 host after changing mesa.spec to use it in a Fedora 38 mesa repo I cloned. I booted Fedora-KDE-Live-x86_64-38-20230209.n.1.iso in a GNOME Boxes QEMU/KVM VM with 3D acceleration disabled using the llvmpipe driver in which I installed nautilus and used it to copy and paste the mesa rpms into the VM. I disabled automatic login from sddm in System Settings and logged out. I switched to VT2. When a black screen appeared to due the kwin_wayland crashes of the type at https://bugzilla.redhat.com/show_bug.cgi?id=2168034 I restarted sddm with sudo systemctl restart sddm. I updated to the mesa 23.0.0-rc1 rpms I built in VT2. I switched to sddm and logged in. The kwin_wayland crashes like those reported here happened with mesa 23.0.0-rc1. I checked the kwin_wayland crash traces with coredumpctl.

I cloned the upstream mesa 23.0 branch and made archives of the repo at earlier stages with commands like
git archive --prefix mesa-23.0.0~git7d5b1cd0/ -o /programs/mesa/mesa-23.0.0~git7d5b1cd0.tar.gz 7d5b1cd0
I changed mesa.spec to build from those earlier snapshots. I reverted the commit "Update to 23.0.0-rc4" 4a6a053618905ba63ee45e14fb924eef9003fef3 because it removed the patches for the double-free crashes and valgrind build problem which weren't included upstream before then. I used fedpkg mockbuild to build the mesa rpms at those earlier snapshots. Starting Plasma with mesa rpms from the 22.3-branchpoint tag in the 23.0 branch didn't show this type of crash. I bisected between 22.3-branchpoint and 23.0.0-rc1 in the 23.0 branch, and I built and tested the mesa rpms as above. The first bad commit "frontend/dri: move callbacks from the VTable into dri_screen, dri_drawable" renamed dri2_create_buffer to dri2_create_drawable (the top function in the traces) in src/gallium/frontends/dri/dri2.c and src/gallium/frontends/dri/dri_drawable.c among other related changes.

7d5b1cd02c4d29d0636db66d668607a6692daa75 is the first bad commit
commit 7d5b1cd02c4d29d0636db66d668607a6692daa75
Author: Marek Olšák <marek.olsak>
Date:   Tue Nov 15 16:13:49 2022 -0500

    frontend/dri: move callbacks from the VTable into dri_screen, dri_drawable
    
    This just moves the callbacks and renames the functions.
    Some functions had to be moved up because they are initialized there.
    Remove some obsolete comments.
    
    Reviewed-by: Adam Jackson <ajax>
    Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19741>

 src/gallium/frontends/dri/dri2.c         | 54 +++++++++++++++-----------------
 src/gallium/frontends/dri/dri_drawable.c |  8 ++---
 src/gallium/frontends/dri/dri_drawable.h |  6 ++--
 src/gallium/frontends/dri/dri_screen.h   | 15 +++++++--
 src/gallium/frontends/dri/dri_util.c     | 17 +++++-----
 src/gallium/frontends/dri/dri_util.h     | 42 +++----------------------
 src/gallium/frontends/dri/drisw.c        | 43 ++++++++++++-------------
 src/gallium/frontends/dri/kopper.c       | 26 +++++++++------
 8 files changed, 96 insertions(+), 115 deletions(-)

I tried to revert that patch with the following, 
git revert 7d5b1cd02c4d29d0636db66d668607a6692daa75
git format-patch -1 ac43910bf65ed46746ef957737953dddd19aa2eb

I reset to the head in the F38 mesa repo clone at 23.0.0-rc4 with git reset --hard e7344b6d973d83a0e11c972da56616f11128e8c9
I added Patch11:        0001-Revert-frontend-dri-move-callbacks-from-the-VTable-i.patch
to mesa.spec

Errors occurred during the fedpkg mockbuild run when applying the patch. 
+ /usr/lib/rpm/rpmuncompress /builddir/build/SOURCES/0001-Revert-frontend-dri-move-callbacks-from-the-VTable-i.patch
+ /usr/bin/patch -p1 -s --fuzz=0 --no-backup-if-mismatch -f
2 out of 4 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/dri2.c.rej
1 out of 3 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/dri_drawable.c.rej
1 out of 2 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/dri_screen.h.rej
3 out of 7 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/dri_util.c.rej
1 out of 2 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/dri_util.h.rej
1 out of 3 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/drisw.c.rej
2 out of 6 hunks FAILED -- saving rejects to file src/gallium/frontends/dri/kopper.c.rej
error: Bad exit status from /var/tmp/rpm-tmp.mkIPJe (%prep)
    Bad exit status from /var/tmp/rpm-tmp.mkIPJe (%prep)

The first bad commit was part of a merge at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19741 so it might depend on the other patches in it.

Comment 38 Matt Fagnani 2023-02-14 21:35:25 UTC
Michel Dänzer wrote a patch at https://gitlab.freedesktop.org/mesa/mesa/-/issues/8232#note_1772691 which I added to mesa.spec and built 23.0.0-rc4 with. Plasma started without this type of kwin_wayland crash in dri2_create_drawable or the black screen happening with the patch. There was one kwin_wayland crash of the type at https://bugzilla.redhat.com/show_bug.cgi?id=2168034 but that's a different problem. Michel's patch appears to fix this kwin_wayland crash in dri2_create_drawable. Thanks.

Comment 39 Fedora Update System 2023-02-15 16:24:07 UTC
FEDORA-2023-302eb35710 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-302eb35710

Comment 40 Fedora Update System 2023-02-15 16:26:08 UTC
FEDORA-2023-0a5799f541 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-0a5799f541

Comment 41 Fedora Update System 2023-02-15 16:34:02 UTC
FEDORA-2023-302eb35710 has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 42 Fedora Update System 2023-02-15 16:37:02 UTC
FEDORA-2023-0a5799f541 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.