Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1663440

Summary: [Nvidia, EGLStream] Lost GL context after resuming not handled
Product: Red Hat Enterprise Linux 8 Reporter: Jonas Ådahl <jadahl>
Component: mutterAssignee: Jonas Ådahl <jadahl>
Status: CLOSED WONTFIX QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0CC: fmuellner, jkoten, knutjbj, rstrode, tpelka, tpopela, wchadwic
Target Milestone: rcKeywords: Triaged
Target Release: 8.0Flags: rule-engine: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gnome-shell-3.28.3-6.el8 mutter-3.28.3-17.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-01 07:31:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1681803    
Bug Blocks: 1657660, 1701002, 1739559    
Attachments:
Description Flags
mutter changes
none
shell changes
none
screenshot 1
none
journal.log
none
explicitly pause on suspend
none
screenshot
none
dmesg
none
journalctl -b output
none
coredump gnome-shell none

Description Jonas Ådahl 2019-01-04 10:55:34 UTC
Description of problem:

When the GL context is lost, its expected that everything is restored. This means the shell chrome, as well as client buffers.

Under X11 we handle this by restarting the compositor. This is not possible with Wayland.

How reproducible:


Steps to Reproduce:
1. Log in
2. Suspend computer
3. Resume

Actual results:

The screen content consists of black and white squares


Expected results:

The login screen is properly shown


Additional info:

Comment 7 Ray Strode [halfline] 2019-01-15 18:05:43 UTC
Created attachment 1520818 [details]
mutter changes

So this is the mutter half of the changes that seem to fix things up okay.  There's going to be a shell half, too.

We do some redraws on VT switch with this patchset, that don't seem to be strictly necessary. They're harmless though.

https://bugzilla.gnome.org/show_bug.cgi?id=739178 suggests it might be necessary in some cases, to redraw on VT switch, but if it is, then there's some redraws we aren't doing on VT switch that we should be.

For now, I'm just leaving it as-is, but we may want to shore it up one way or the other.

Comment 8 Ray Strode [halfline] 2019-01-15 18:08:06 UTC
Created attachment 1520820 [details]
shell changes

This is the shell half of the changes.

It already has some code in place to try to deal with FBO corruption on X11, but not texture corruption on wayland.  This patchset makes the same code run on wayland, and also makes it rebuild the background textures.

There's also a patch in here to clear the texture cache.

Comment 14 Jiri Koten 2019-02-08 16:34:15 UTC
There are rendering corruptions in gnome-shell after resume - text in top bar and window decoration titlebar. The app windows and their content seems to render fine.

Also after second resume I only got blank screen. I don't see anything in dmesg and there is no crash. Possible error in logs relate to gsd-power:

T60 gsd-power[2382]: Error setting property 'PowerSaveMode' on interface org.gnome.Mutter.DisplayConfig: T
imeout was reached (g-io-error-quark, 24)

mutter-3.28.3-14.el8

Comment 15 Jiri Koten 2019-02-08 16:34:52 UTC
Created attachment 1528167 [details]
screenshot 1

Comment 16 Jiri Koten 2019-02-08 16:38:49 UTC
Created attachment 1528173 [details]
journal.log

Comment 17 Ray Strode [halfline] 2019-02-08 16:42:26 UTC
it looks like the glyph cache isnt getting purged correctly.  will investigate

Comment 18 Ray Strode [halfline] 2019-02-08 16:42:27 UTC
it looks like the glyph cache isnt getting purged correctly.  will investigate

Comment 19 Ray Strode [halfline] 2019-02-11 18:29:25 UTC
how are you initiating suspend?

Comment 20 Jiri Koten 2019-02-11 19:11:40 UTC
(In reply to Ray Strode [halfline] from comment #19)
> how are you initiating suspend?

I click the Suspend button in the User Menu.

Comment 21 Ray Strode [halfline] 2019-02-11 20:00:51 UTC
can't reproduce here...

i wonder if logind is crashing.  can you get a pid listing of logind before and after the failure?

Comment 22 Ray Strode [halfline] 2019-02-11 22:42:57 UTC
so I still can't reproduce exactly, but I think I may have figured out the problem.

i forced logind to stop using kill and suspend and resumed.  This was enough to make the login screen hang (even though i wasn't at the login screen).

Looking into the issue, we're closing file descriptor 0 and then it's getting reused for a timerfd.  file descriptor 0 and it seems to be causing
confusion.

A rebuild of glibc with debug enabled and a little gdb debugging later showed that the culprit is this commit:

https://gitlab.gnome.org/GNOME/mutter/commit/53d63ea72b79d60a0b7b094de24606d12d44e1e2#92a5eefceb207942e81c84183f57c7049188c8c9_587_625

It's immediately clear the problem. I'm using the gvariant handle as if it's fd, but it's actually supposed to be used as an index into an fd list.

This mistake is because the xml definition is missing the annotation necessary to tell the call to return the fd list.

I'll push a fix shortly that should hopefully resolve that problem, though I'm not completely sure this will fix the text corruption you're seeing.

We'll see I guess.

Comment 23 Ray Strode [halfline] 2019-02-11 22:58:58 UTC
(In reply to Ray Strode [halfline] from comment #22)
Rereading comment 22, there are enough typos and unclear parts, that I want
to clarify...

> i forced logind to stop using kill and suspend and resumed.
To be clear, I forced logind to stop using kill -STOP on the logind process, then suspended and resumed.

> This was enough to make the login screen hang (even though i wasn't at the login screen).
To be clear, I suspended from the logged in user session, but I noticed the login screen session is what hung.

> Looking into the issue, we're closing file descriptor 0 and then it's
> getting reused for a timerfd.  file descriptor 0 and it seems to be causing
> confusion.
I meant to say "file descriptor 0 is special, since it's used as stdin, and
I believe that is making the code get confused".
 
But it could also just be confusion caused from the inhibition code closing the fd out from under
gnome-wall-clock. since they both think they own fd 0, when really neither should (since it's supposed to
set to /dev/null)

Anyway fix is building now, marking MODIFIED for QE.

If this reopens again, we should consider marking it a blocker, since bug 1657660 is a blocker and it relies on this.  Alternatively, the blocker bar is too high for this bug, we should stop making bug 1657660 a blocker as well.

Comment 25 Jiri Koten 2019-02-12 13:08:20 UTC
*** Bug 1674584 has been marked as a duplicate of this bug. ***

Comment 26 Jiri Koten 2019-02-12 13:11:06 UTC
The gnome-shell hung is fixed. The rendering issues still remain. Switching to VT and back helps to redraw the screen correctly.

Comment 27 Ray Strode [halfline] 2019-02-12 13:22:52 UTC
thanks for the quick turn around.  can you give the output of ps -ef before and after?  i want to see if suspend is somehow making logind crash.

Comment 28 Ray Strode [halfline] 2019-02-12 14:27:38 UTC
another thing, can you post the output of: 

$ loginctl
$ loginctl show-session

both before and when the screen is scrambled ?

Comment 29 Ray Strode [halfline] 2019-02-12 15:51:51 UTC
Created attachment 1534121 [details]
explicitly pause on suspend

i have a theory for what's going wrong.

On my machine the kernel forces a vt switch to the same vt it's already on during suspend.  we rely on the signal this generates for restoring some of the corruption.

It seems that the kernel doesn't universally do this, though.  I think the above patch should fix the problem for you.

scratch build here:  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20179096

Comment 30 Jiri Koten 2019-02-12 16:19:49 UTC
(In reply to Ray Strode [halfline] from comment #28)
> another thing, can you post the output of: 
> 
> $ loginctl
> $ loginctl show-session
> 
> both before and when the screen is scrambled ?

The output of the commands is located at http://nest.test.redhat.com/mnt/qa/scratch/jkoten/nvidia/

The first resume went fine, there were no visible corruptions, so I did second suspend/resume where I ended with scrambled screen.

*B* stands for before suspend
*A* stands for the first after resume
*A2* stands for the second after resume

Comment 31 Jiri Koten 2019-02-12 16:26:25 UTC
(In reply to Ray Strode [halfline] from comment #29)
> Created attachment 1534121 [details]
> explicitly pause on suspend
> 
> i have a theory for what's going wrong.
> 
> On my machine the kernel forces a vt switch to the same vt it's already on
> during suspend.  we rely on the signal this generates for restoring some of
> the corruption.
> 
> It seems that the kernel doesn't universally do this, though.  I think the
> above patch should fix the problem for you.
> 
> scratch build here: 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20179096

Didn't help. There are still some rendering issues, window decoration and also on the lock screen - black instead of the proper background.

Comment 32 Jiri Koten 2019-02-12 16:26:57 UTC
Created attachment 1534135 [details]
screenshot

Comment 33 Jiri Koten 2019-02-12 16:30:18 UTC
Created attachment 1534147 [details]
dmesg

[   57.729896] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[   57.729901] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[   57.729905] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[   57.729936] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0020, Class 0000a097, Offset 000023a8, Data 00000000
[   57.762124] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[   57.762131] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[   57.762134] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[   57.762165] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0018, Class 0000a097, Offset 00001614, Data 00000000
[   57.795322] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[   57.795327] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[   57.795331] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[   57.795363] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0010, Class 0000a097, Offset 000023a8, Data 00000000
[   57.806737] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 11 Error
[   57.806742] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[   57.806746] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa2040800
[   57.806775] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0008, Class 0000a097, Offset 00001614, Data 00000000

Comment 34 Ray Strode [halfline] 2019-02-13 14:40:10 UTC
i think i'm going to give you a scratch build with some debug logging added to see what code is getting run when on your system.

Comment 35 Ray Strode [halfline] 2019-02-13 15:06:38 UTC
scratch build is here:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20190197

can you try it and post full

journalctl -b

output?

Comment 36 Jiri Koten 2019-02-13 16:42:48 UTC
Created attachment 1534451 [details]
journalctl -b output

This time I have tested with different GPU - GP106GL [Quadro P2000] to just see if is not hw specific.

And It was definitely harder to reproduce.
First time the gnome-shell actually crashed.
Second time the rendering corruption was only on the unlock screen - the grey background but the user session was actually fine.
The third time I finally hit the window decoration corruption which I saw previously, on second suspend/resume.

There was a reboot between the three test runs and the attached journal is from the last one.

Comment 37 Jiri Koten 2019-02-13 16:43:48 UTC
Created attachment 1534452 [details]
coredump gnome-shell

coredumpctl output of the gnome-shell crash, seems to be random one.

Comment 47 RHEL Program Management 2021-02-01 07:31:41 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.