Bug 1814015 - Freezes and/or lockups in Wayland sessions on systems with dual graphics adapters including NVIDIA (Lenovo P53, P1, Dell XPS 15 9560 ...)
Summary: Freezes and/or lockups in Wayland sessions on systems with dual graphics adap...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 32
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker AcceptedFreezeException
Depends On:
Blocks: 1816645 1816768 F32FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2020-03-16 18:20 UTC by David Ober
Modified: 2020-04-14 20:25 UTC (History)
34 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-14 20:25:11 UTC
Type: Bug


Attachments (Terms of Use)
The journal log from P50 running smoothly. (281.56 KB, text/plain)
2020-03-27 09:44 UTC, Lukas Ruzicka
no flags Details
output of "journalctl -aeb" before crash on Lenovo P53 (121.36 KB, text/plain)
2020-03-27 15:26 UTC, Geoffrey Marr
no flags Details
gnome-shell journal entry before crash (158.84 KB, text/plain)
2020-03-27 15:27 UTC, Geoffrey Marr
no flags Details

Description David Ober 2020-03-16 18:20:08 UTC
Description of problem:Boot the system and log in using wayland desktop, after some time varies on time (usually within 15 minutes) the system locks up


Version-Release number of selected component (if applicable):


How reproducible:consistent


Steps to Reproduce:
1.login as wayland
2.let the system sit
3.

Actual results:systems locks up


Expected results:system should continue to operate


Additional info:

Comment 1 František Zatloukal 2020-03-26 14:55:45 UTC
David, I assume you mean GNOME Wayland by Wayland session, so reassigning to mutter. 
Also, can you try to disable nVidia GPU? Is it possible to do so in UEFI? If not,you can try to add "nouveau.modeset=0" to kernel line in grub (just after ... rhgb quiet).

Mutter developers, workaround for time-being might be blacklisting Wayland sessions on Optimus configurations (or blacklisting just this specific GPU), what do you think?

Comment 2 Jonas Ådahl 2020-03-26 15:19:22 UTC
(In reply to František Zatloukal from comment #1)
> David, I assume you mean GNOME Wayland by Wayland session, so reassigning to
> mutter. 
> Also, can you try to disable nVidia GPU? Is it possible to do so in UEFI? If
> not,you can try to add "nouveau.modeset=0" to kernel line in grub (just
> after ... rhgb quiet).
> 
> Mutter developers, workaround for time-being might be blacklisting Wayland
> sessions on Optimus configurations (or blacklisting just this specific GPU),
> what do you think?

Are any external monitors connected?

Is it verified that falling back on Xorg makes the issue go away? FWIW, it's more likely to be a kernel issue than a mutter issue, as I have a hybrid graphics laptop here (P50, intel + nouveau) that doesn't have any stability issues.

Both kernel logs and gnome-shell logs would be helpful to pinpoint where the issue lies anyhow.

Comment 3 Lukas Ruzicka 2020-03-27 09:43:14 UTC
I cannot reproduce this on P50 (just tried it) running Fedora WS 32 Beta. The system is still responsive after approximately 20 minutes of being idle. I am attaching the journal from the machine, it it helps.

Comment 4 Lukas Ruzicka 2020-03-27 09:44:20 UTC
Created attachment 1674031 [details]
The journal log from P50 running smoothly.

Comment 5 Geoffrey Marr 2020-03-27 15:23:30 UTC
(In reply to Jonas Ådahl from comment #2)
> (In reply to František Zatloukal from comment #1)
> > David, I assume you mean GNOME Wayland by Wayland session, so reassigning to
> > mutter. 
> > Also, can you try to disable nVidia GPU? Is it possible to do so in UEFI? If
> > not,you can try to add "nouveau.modeset=0" to kernel line in grub (just
> > after ... rhgb quiet).

Using the same hardware (Lenovo P53 w/ nVidia Quadro T1000), when I disable the nouveau driver, the system no longer freezes.

> > 
> > Mutter developers, workaround for time-being might be blacklisting Wayland
> > sessions on Optimus configurations (or blacklisting just this specific GPU),
> > what do you think?
> 
> Are any external monitors connected?

No.

> 
> Is it verified that falling back on Xorg makes the issue go away? FWIW, it's
> more likely to be a kernel issue than a mutter issue, as I have a hybrid
> graphics laptop here (P50, intel + nouveau) that doesn't have any stability
> issues.

Haven't tested on Xorg yet, will do that and post results. At this point I think it has to do with nouveau.

> 
> Both kernel logs and gnome-shell logs would be helpful to pinpoint where the
> issue lies anyhow.

Comment 6 Geoffrey Marr 2020-03-27 15:26:32 UTC
Created attachment 1674103 [details]
output of "journalctl -aeb" before crash on Lenovo P53

Comment 7 Geoffrey Marr 2020-03-27 15:27:56 UTC
Created attachment 1674105 [details]
gnome-shell journal entry before crash

Comment 8 Geoffrey Marr 2020-03-27 15:30:02 UTC
The card/driver combo on my Lenovo P53 that crashes:

01:00.0 VGA compatible controller: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] (rev a1)
	Subsystem: Lenovo Device 2297
	Kernel driver in use: nouveau
	Kernel modules: nouveau

Comment 9 Carlos Soriano 2020-04-02 14:02:42 UTC
The log shows some crashes in the GNOME Shell JS code for searching, which it might be the cause of the system freeze (gnome shell crashing on Wayland brings the session down).
Is the crash reproducible in Xorg too? Also, is it possible to consistently trigger it by searching for something in GNOME Shell?

Comment 10 Karol Herbst 2020-04-02 14:21:21 UTC
(In reply to Geoffrey Marr from comment #8)
> The card/driver combo on my Lenovo P53 that crashes:
> 
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117GLM [Quadro T1000
> Mobile] (rev a1)
> 	Subsystem: Lenovo Device 2297
> 	Kernel driver in use: nouveau
> 	Kernel modules: nouveau

as this might be a nouveau issue, mind testing out the kernel from my copr once this build is through? https://copr.fedorainfracloud.org/coprs/karolherbst/Nouveau_Testing/build/1326186/

Thanks.

Comment 11 Karol Herbst 2020-04-02 14:30:46 UTC
seems like I messed up a little, here is the new build: https://copr.fedorainfracloud.org/coprs/karolherbst/Nouveau_Testing/build/1326194/

Comment 12 David Ober 2020-04-04 00:29:20 UTC
I have seen freezes a couple of times on Xorg but not consistent as it is an wayland.  I have not been able to consistently cause the crash on Xorg where as on wayland it will eventually happen if if you are doing nothing so not sure about the searching you are referring to.

Comment 13 Karol Herbst 2020-04-04 00:49:11 UTC
(In reply to David Ober from comment #12)
> I have seen freezes a couple of times on Xorg but not consistent as it is an
> wayland.  I have not been able to consistently cause the crash on Xorg where
> as on wayland it will eventually happen if if you are doing nothing so not
> sure about the searching you are referring to.

mind testing my copr and see if the issue goes away? Without having proper logs we can only guess what the actual problem is, but the chances are high it's a known issue fixed by my patches.

dnf copr enable karolherbst/Nouveau_Testing
dnf install kernel-5.6.1-9000.fc32

Comment 14 Karol Herbst 2020-04-04 00:50:29 UTC
(In reply to Karol Herbst from comment #13)
> (In reply to David Ober from comment #12)
> > I have seen freezes a couple of times on Xorg but not consistent as it is an
> > wayland.  I have not been able to consistently cause the crash on Xorg where
> > as on wayland it will eventually happen if if you are doing nothing so not
> > sure about the searching you are referring to.
> 
> mind testing my copr and see if the issue goes away? Without having proper
> logs we can only guess what the actual problem is, but the chances are high
> it's a known issue fixed by my patches.
> 
> dnf copr enable karolherbst/Nouveau_Testing
> dnf install kernel-5.6.1-9000.fc32

ohh, and make sure to boot the exact version (and you might need to disable secure boot otherwise you won't be able to boot this kernel build)

Comment 15 Geoffrey Marr 2020-04-06 21:11:35 UTC
Discussed during the 2020-04-06 blocker review meeting: [0]

The decision to classify this bug as a "RejectedBlocker" and an "AcceptedFreezeException" was made as this seems to affect only one laptop model and there is a known workaround, so it is not considered broad enough in impact to be a release blocker, but is accepted as an FE as a significant and highly visible bug on a relatively popular piece of hardware.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-04-06/f32-blocker-review.2020-04-06-16.00.txt

Comment 16 Lyude 2020-04-06 21:49:48 UTC
(In reply to Geoffrey Marr from comment #15)
> Discussed during the 2020-04-06 blocker review meeting: [0]
> 
> The decision to classify this bug as a "RejectedBlocker" and an
> "AcceptedFreezeException" was made as this seems to affect only one laptop
> model and there is a known workaround, so it is not considered broad enough
> in impact to be a release blocker, but is accepted as an FE as a significant
> and highly visible bug on a relatively popular piece of hardware.
> 
> [0]
> https://meetbot.fedoraproject.org/fedora-blocker-review/2020-04-06/f32-
> blocker-review.2020-04-06-16.00.txt

uhhhhhhhhhh, hey sorry to pop in here but I _REALLY_ think this should be reconsidered. First off-I'm not really sure where you got the impression this isn't a widespread issue. There are very, very few laptops with nvidia pascal GPUs that do not have this issue and additionally - it is also getting harder and harder to actually find laptops that don't come with these GPUs. Multiple manufacturers include these GPUs by default in almost all of their models nowadays, and additionally since some turing GPUs are affected laptops with those GPUs (which again, is a _LOT_ of laptops) they will also break without this patch.

The other thing is that I really don't agree with saying there's a 'workaround' here. Sure-you can get the GPU working with nouveau.runpm=0, but that doesn't actually matter because it's extremely non-obvious to any user who isn't a nouveau developer that this issue might be occuring on their machine. The main symptoms of this problem are usually random freezes (assuming the system manages to boot at all), and the only error messages that get left are in dmesg. Case in point here - we currently have dozens of cases open at Red Hat claiming all sorts of random freezes and hangs, with almost all of them being the same runtime PM issue because no one filing said bugs realized that was the problem.

Long story short from most users' perspectives, this will mean the Fedora installer just hangs on their laptop and they likely just decide to give up and go onto a different distro or go back to windows. And on top of that, even if they -do- happen to know about the workaround they are basically demolishing the batteries on their laptops by turning off runtime PM. That's really, really not a good experience.

Please reconsider this as blocker material, there's a lot of really good reasons we're pushing to try to get this fix in so late in the cycle (it has taken a -while- to get upstream to consider these fixes at all, otherwise we would have done this much earlier).

Comment 17 Geoffrey Marr 2020-04-06 23:41:15 UTC
> uhhhhhhhhhh, hey sorry to pop in here but I _REALLY_ think this should be
> reconsidered. First off-I'm not really sure where you got the impression
> this isn't a widespread issue. There are very, very few laptops with nvidia
> pascal GPUs that do not have this issue and additionally - it is also
> getting harder and harder to actually find laptops that don't come with
> these GPUs. Multiple manufacturers include these GPUs by default in almost
> all of their models nowadays, and additionally since some turing GPUs are
> affected laptops with those GPUs (which again, is a _LOT_ of laptops) they
> will also break without this patch.

In our testing, we haven't seen that. We've tested P50, P52, and P53 hardware (all with nVidia graphics), and the only combo that has proven to freeze has been the P53 with nVidia Quadro T1000 graphics and nouveau driver. If you have more information regarding crashes on other models/graphics hardware, please add it to this bug.
 
> The other thing is that I really don't agree with saying there's a
> 'workaround' here. Sure-you can get the GPU working with nouveau.runpm=0,
> but that doesn't actually matter because it's extremely non-obvious to any
> user who isn't a nouveau developer that this issue might be occuring on
> their machine. The main symptoms of this problem are usually random freezes
> (assuming the system manages to boot at all), and the only error messages
> that get left are in dmesg. Case in point here - we currently have dozens of
> cases open at Red Hat claiming all sorts of random freezes and hangs, with
> almost all of them being the same runtime PM issue because no one filing
> said bugs realized that was the problem.

Since the scope of the reported testing has shown that this bug only affects a single model of hardware, blacklisting or disabling the dedicated graphics could be a viable solution to ship F32 on time, i.e. "nouveau.modeset=0".

> Long story short from most users' perspectives, this will mean the Fedora
> installer just hangs on their laptop and they likely just decide to give up
> and go onto a different distro or go back to windows. And on top of that,
> even if they -do- happen to know about the workaround they are basically
> demolishing the batteries on their laptops by turning off runtime PM. That's
> really, really not a good experience.

There hasn't been an install reported that froze or hung and caused the system not to install. All of the testing has shown that the freeze is occurring after install. If you have conflicting information, please add it/link the bug to this report.

> Please reconsider this as blocker material, there's a lot of really good
> reasons we're pushing to try to get this fix in so late in the cycle (it has
> taken a -while- to get upstream to consider these fixes at all, otherwise we
> would have done this much earlier).

This can be reconsidered for blocker status by reproposing it in the blocker-bugs app [0] and providing more information as to why it should be reconsidered. As of this morning's blocker-review meeting, there were four "-1 blocker" votes [1]; only three are needed to drop the proposed blocker status. Consider attending the blocker review meeting next week (every Monday, 16:00UTC, #fedora-blocker-review) should you decide to repropose this bug as a blocker.

[0] https://qa.fedoraproject.org/blockerbugs/propose_bug
[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2020-04-06/f32-blocker-review.2020-04-06-16.00.log.html

Comment 18 Mark Pearson 2020-04-07 00:10:48 UTC
Just as a note - we are currently testing the fix (I read the blocker minutes and realised an update before the meeting would have been useful :)). 
Initial testing looks promising - we just aren't quite ready to give the thumbs up yet. We'll get an update ASAP and hopefully the blocker discussion becomes a no-op.
Mark

Comment 19 Karol Herbst 2020-04-07 00:11:49 UTC
(In reply to Geoffrey Marr from comment #17)
> > uhhhhhhhhhh, hey sorry to pop in here but I _REALLY_ think this should be
> > reconsidered. First off-I'm not really sure where you got the impression
> > this isn't a widespread issue. There are very, very few laptops with nvidia
> > pascal GPUs that do not have this issue and additionally - it is also
> > getting harder and harder to actually find laptops that don't come with
> > these GPUs. Multiple manufacturers include these GPUs by default in almost
> > all of their models nowadays, and additionally since some turing GPUs are
> > affected laptops with those GPUs (which again, is a _LOT_ of laptops) they
> > will also break without this patch.
> 
> In our testing, we haven't seen that. We've tested P50, P52, and P53
> hardware (all with nVidia graphics), and the only combo that has proven to
> freeze has been the P53 with nVidia Quadro T1000 graphics and nouveau
> driver. If you have more information regarding crashes on other
> models/graphics hardware, please add it to this bug.
>  

we saw this issue on laptops from every vendor. Myself I have a XPS 9560 which is hit by the same issue. I was testing on a 2nd gen P1 which doesn't boot as well. And this P53 probably has the same issue. We know that the P50 is not affected, but that's just something random.

I am investigating this issue for quite some time now and right now I would assume that if a laptop is dual graphics with an nvidia GPU, there is a high chance of hitting this issue. But again, the P53 could be a different bug here as until today we still don't know why it crashed due to the absence of useful logs so we can only guess here what the actual issue is.

> > The other thing is that I really don't agree with saying there's a
> > 'workaround' here. Sure-you can get the GPU working with nouveau.runpm=0,
> > but that doesn't actually matter because it's extremely non-obvious to any
> > user who isn't a nouveau developer that this issue might be occuring on
> > their machine. The main symptoms of this problem are usually random freezes
> > (assuming the system manages to boot at all), and the only error messages
> > that get left are in dmesg. Case in point here - we currently have dozens of
> > cases open at Red Hat claiming all sorts of random freezes and hangs, with
> > almost all of them being the same runtime PM issue because no one filing
> > said bugs realized that was the problem.
> 
> Since the scope of the reported testing has shown that this bug only affects
> a single model of hardware, blacklisting or disabling the dedicated graphics
> could be a viable solution to ship F32 on time, i.e. "nouveau.modeset=0".
> 
> > Long story short from most users' perspectives, this will mean the Fedora
> > installer just hangs on their laptop and they likely just decide to give up
> > and go onto a different distro or go back to windows. And on top of that,
> > even if they -do- happen to know about the workaround they are basically
> > demolishing the batteries on their laptops by turning off runtime PM. That's
> > really, really not a good experience.
> 
> There hasn't been an install reported that froze or hung and caused the
> system not to install. All of the testing has shown that the freeze is
> occurring after install. If you have conflicting information, please add
> it/link the bug to this report.
> 

from personal experience and from a couple of users asking me directly: yes, there are dozens of those. Sadly all of the bugs filed against the redhat bugzilla are private, but we also have several bugs at the kernel bugzilla and this monster bug report: https://bugzilla.kernel.org/show_bug.cgi?id=156341 but this got a bin where everybody was thrown in whatever issue. So in the end, it's not a regression but a bug plaguing users for a longer time already.

Anyway, we have now patches we can move upstream to fix this, so we are already working on getting those in everywhere as fast as possible.

> > Please reconsider this as blocker material, there's a lot of really good
> > reasons we're pushing to try to get this fix in so late in the cycle (it has
> > taken a -while- to get upstream to consider these fixes at all, otherwise we
> > would have done this much earlier).
> 
> This can be reconsidered for blocker status by reproposing it in the
> blocker-bugs app [0] and providing more information as to why it should be
> reconsidered. As of this morning's blocker-review meeting, there were four
> "-1 blocker" votes [1]; only three are needed to drop the proposed blocker
> status. Consider attending the blocker review meeting next week (every
> Monday, 16:00UTC, #fedora-blocker-review) should you decide to repropose
> this bug as a blocker.
> 
> [0] https://qa.fedoraproject.org/blockerbugs/propose_bug
> [1]
> https://meetbot.fedoraproject.org/fedora-blocker-review/2020-04-06/f32-
> blocker-review.2020-04-06-16.00.log.html

Comment 20 David Ober 2020-04-07 19:35:54 UTC
So I tested yesterday using the 5.6.1-9000 image and without wayland I did not see any issues with wayland sometime during the night last night it locked up.  I install the 5.6.2-9000 kernel this morning and have been running using wayland for 8 hours and have not seen any issues.  Also of note my P53 does not have a T1000 GPU it has an RTX-5000 GPU since I saw in an earlier post thinking it was limited to the T1000

Comment 21 Karol Herbst 2020-04-07 19:56:35 UTC
(In reply to David Ober from comment #20)
> So I tested yesterday using the 5.6.1-9000 image and without wayland I did
> not see any issues with wayland sometime during the night last night it
> locked up.  I install the 5.6.2-9000 kernel this morning and have been
> running using wayland for 8 hours and have not seen any issues.  Also of
> note my P53 does not have a T1000 GPU it has an RTX-5000 GPU since I saw in
> an earlier post thinking it was limited to the T1000

mhh, both builds should contain the same fixes, but there can be still some other issues around, or maybe it's the same just it triggers less often.

But in the end it's a significant improvement over the stock kernel, correct?

Comment 22 František Zatloukal 2020-04-07 20:51:30 UTC
(In reply to Lyude from comment #16)
> Please reconsider this as blocker material, there's a lot of really good
> reasons we're pushing to try to get this fix in so late in the cycle (it has
> taken a -while- to get upstream to consider these fixes at all, otherwise we
> would have done this much earlier).

You can re propose it as a blocker, but it's an Accepted Freeze Exception. That means if there is a fix available, it'll get pulled as if it was a blocker (even after the freeze). It just means that the F32 release wouldn't be delayed because of this. Which I think is appropriate, and I am not going to change my vote here.

Comment 23 Adam Williamson 2020-04-09 00:27:27 UTC
Re-assigning to kernel, since it seems apparent this is a kernel issue. Karol, Justin, Jeremy (the kernel maintainers), can you work together to get a kernel build with whatever patches are considered to be the "fixes for this bug"? Thanks!

Comment 24 Karol Herbst 2020-04-09 09:46:46 UTC
(In reply to Adam Williamson from comment #23)
> Re-assigning to kernel, since it seems apparent this is a kernel issue.
> Karol, Justin, Jeremy (the kernel maintainers), can you work together to get
> a kernel build with whatever patches are considered to be the "fixes for
> this bug"? Thanks!

already done and according to Mark the fixes seem to help (I have a copr with those patches applied), but let's wait for Mark to give us a proper response :)

Comment 25 Adam Williamson 2020-04-09 14:29:01 UTC
Karol: I meant an official Fedora kernel build. A COPR is great but we can't ship that. :)

Comment 26 Karol Herbst 2020-04-09 14:42:58 UTC
(In reply to Adam Williamson from comment #25)
> Karol: I meant an official Fedora kernel build. A COPR is great but we can't
> ship that. :)

I know, I just meant we already added the patches to the fedora kernel package, the testing was just done based on the copr. So an update should be available soon at least in Fedora 32.

Comment 27 David Ober 2020-04-09 14:47:43 UTC
I have been the 5.6.2=9000 kernel now for 24 hours, it appears to be stable

Comment 29 František Zatloukal 2020-04-09 15:39:04 UTC
David, can you try if kernel-5.6.2-301 fixes the issues too?

Thanks!

Comment 30 Adam Williamson 2020-04-09 15:52:18 UTC
That update is actually stable now. Setting this bug to ON_QA for confirmation. Can folks with affected systems please test with that update (5.6.2-301) and confirm whether their issues are fixed? Thanks.

Comment 31 Adam Williamson 2020-04-14 16:08:55 UTC
Ping? Anyone?

Comment 32 Mark Pearson 2020-04-14 16:15:50 UTC
I'll chase folk down for updates on our side. I've not heard of any stability issues recently....hopefully saying that doesn't jinx it and we can close this out

Comment 33 David Ober 2020-04-14 20:15:03 UTC
I have been running the 5.6.3-300 for several days now with no issues to report

Comment 34 Adam Williamson 2020-04-14 20:25:11 UTC
OK, let's count that as confirmation then, and close this. Thanks.


Note You need to log in before you can comment on or make changes to this bug.