Bug 1663440
| Summary: | [Nvidia, EGLStream] Lost GL context after resuming not handled | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jonas Ådahl <jadahl> | ||||||||||||||||||||
| Component: | mutter | Assignee: | Jonas Ådahl <jadahl> | ||||||||||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||
| Version: | 8.0 | CC: | fmuellner, jkoten, knutjbj, rstrode, tpelka, tpopela, wchadwic | ||||||||||||||||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||||||||||||||||
| Target Release: | 8.0 | Flags: | rule-engine:
mirror+
|
||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||
| Fixed In Version: | gnome-shell-3.28.3-6.el8 mutter-3.28.3-17.el8 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||
| Last Closed: | 2021-02-01 07:31:41 UTC | Type: | Bug | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||
| Bug Depends On: | 1681803 | ||||||||||||||||||||||
| Bug Blocks: | 1657660, 1701002, 1739559 | ||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Jonas Ådahl
2019-01-04 10:55:34 UTC
Created attachment 1520818 [details] mutter changes So this is the mutter half of the changes that seem to fix things up okay. There's going to be a shell half, too. We do some redraws on VT switch with this patchset, that don't seem to be strictly necessary. They're harmless though. https://bugzilla.gnome.org/show_bug.cgi?id=739178 suggests it might be necessary in some cases, to redraw on VT switch, but if it is, then there's some redraws we aren't doing on VT switch that we should be. For now, I'm just leaving it as-is, but we may want to shore it up one way or the other. Created attachment 1520820 [details]
shell changes
This is the shell half of the changes.
It already has some code in place to try to deal with FBO corruption on X11, but not texture corruption on wayland. This patchset makes the same code run on wayland, and also makes it rebuild the background textures.
There's also a patch in here to clear the texture cache.
There are rendering corruptions in gnome-shell after resume - text in top bar and window decoration titlebar. The app windows and their content seems to render fine. Also after second resume I only got blank screen. I don't see anything in dmesg and there is no crash. Possible error in logs relate to gsd-power: T60 gsd-power[2382]: Error setting property 'PowerSaveMode' on interface org.gnome.Mutter.DisplayConfig: T imeout was reached (g-io-error-quark, 24) mutter-3.28.3-14.el8 Created attachment 1528167 [details]
screenshot 1
Created attachment 1528173 [details]
journal.log
it looks like the glyph cache isnt getting purged correctly. will investigate it looks like the glyph cache isnt getting purged correctly. will investigate how are you initiating suspend? (In reply to Ray Strode [halfline] from comment #19) > how are you initiating suspend? I click the Suspend button in the User Menu. can't reproduce here... i wonder if logind is crashing. can you get a pid listing of logind before and after the failure? so I still can't reproduce exactly, but I think I may have figured out the problem. i forced logind to stop using kill and suspend and resumed. This was enough to make the login screen hang (even though i wasn't at the login screen). Looking into the issue, we're closing file descriptor 0 and then it's getting reused for a timerfd. file descriptor 0 and it seems to be causing confusion. A rebuild of glibc with debug enabled and a little gdb debugging later showed that the culprit is this commit: https://gitlab.gnome.org/GNOME/mutter/commit/53d63ea72b79d60a0b7b094de24606d12d44e1e2#92a5eefceb207942e81c84183f57c7049188c8c9_587_625 It's immediately clear the problem. I'm using the gvariant handle as if it's fd, but it's actually supposed to be used as an index into an fd list. This mistake is because the xml definition is missing the annotation necessary to tell the call to return the fd list. I'll push a fix shortly that should hopefully resolve that problem, though I'm not completely sure this will fix the text corruption you're seeing. We'll see I guess. (In reply to Ray Strode [halfline] from comment #22) Rereading comment 22, there are enough typos and unclear parts, that I want to clarify... > i forced logind to stop using kill and suspend and resumed. To be clear, I forced logind to stop using kill -STOP on the logind process, then suspended and resumed. > This was enough to make the login screen hang (even though i wasn't at the login screen). To be clear, I suspended from the logged in user session, but I noticed the login screen session is what hung. > Looking into the issue, we're closing file descriptor 0 and then it's > getting reused for a timerfd. file descriptor 0 and it seems to be causing > confusion. I meant to say "file descriptor 0 is special, since it's used as stdin, and I believe that is making the code get confused". But it could also just be confusion caused from the inhibition code closing the fd out from under gnome-wall-clock. since they both think they own fd 0, when really neither should (since it's supposed to set to /dev/null) Anyway fix is building now, marking MODIFIED for QE. If this reopens again, we should consider marking it a blocker, since bug 1657660 is a blocker and it relies on this. Alternatively, the blocker bar is too high for this bug, we should stop making bug 1657660 a blocker as well. *** Bug 1674584 has been marked as a duplicate of this bug. *** The gnome-shell hung is fixed. The rendering issues still remain. Switching to VT and back helps to redraw the screen correctly. thanks for the quick turn around. can you give the output of ps -ef before and after? i want to see if suspend is somehow making logind crash. another thing, can you post the output of: $ loginctl $ loginctl show-session both before and when the screen is scrambled ? Created attachment 1534121 [details] explicitly pause on suspend i have a theory for what's going wrong. On my machine the kernel forces a vt switch to the same vt it's already on during suspend. we rely on the signal this generates for restoring some of the corruption. It seems that the kernel doesn't universally do this, though. I think the above patch should fix the problem for you. scratch build here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20179096 (In reply to Ray Strode [halfline] from comment #28) > another thing, can you post the output of: > > $ loginctl > $ loginctl show-session > > both before and when the screen is scrambled ? The output of the commands is located at http://nest.test.redhat.com/mnt/qa/scratch/jkoten/nvidia/ The first resume went fine, there were no visible corruptions, so I did second suspend/resume where I ended with scrambled screen. *B* stands for before suspend *A* stands for the first after resume *A2* stands for the second after resume (In reply to Ray Strode [halfline] from comment #29) > Created attachment 1534121 [details] > explicitly pause on suspend > > i have a theory for what's going wrong. > > On my machine the kernel forces a vt switch to the same vt it's already on > during suspend. we rely on the signal this generates for restoring some of > the corruption. > > It seems that the kernel doesn't universally do this, though. I think the > above patch should fix the problem for you. > > scratch build here: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20179096 Didn't help. There are still some rendering issues, window decoration and also on the lock screen - black instead of the proper background. Created attachment 1534135 [details]
screenshot
Created attachment 1534147 [details]
dmesg
[ 57.729896] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[ 57.729901] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[ 57.729905] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[ 57.729936] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0020, Class 0000a097, Offset 000023a8, Data 00000000
[ 57.762124] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[ 57.762131] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[ 57.762134] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[ 57.762165] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0018, Class 0000a097, Offset 00001614, Data 00000000
[ 57.795322] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[ 57.795327] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[ 57.795331] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[ 57.795363] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0010, Class 0000a097, Offset 000023a8, Data 00000000
[ 57.806737] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[ 57.806742] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[ 57.806746] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[ 57.806775] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0008, Class 0000a097, Offset 00001614, Data 00000000
i think i'm going to give you a scratch build with some debug logging added to see what code is getting run when on your system. scratch build is here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20190197 can you try it and post full journalctl -b output? Created attachment 1534451 [details]
journalctl -b output
This time I have tested with different GPU - GP106GL [Quadro P2000] to just see if is not hw specific.
And It was definitely harder to reproduce.
First time the gnome-shell actually crashed.
Second time the rendering corruption was only on the unlock screen - the grey background but the user session was actually fine.
The third time I finally hit the window decoration corruption which I saw previously, on second suspend/resume.
There was a reboot between the three test runs and the attached journal is from the last one.
Created attachment 1534452 [details]
coredump gnome-shell
coredumpctl output of the gnome-shell crash, seems to be random one.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |