Bug 1272737

Summary: Black screen after logout
Product: [Fedora] Fedora Reporter: Giulio 'juliuxpigface' <juliux.pigface>
Component: coglAssignee: Peter Robinson <pbrobinson>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: awilliam, fmuellner, juliux.pigface, kparal, lbrabec, lgw0619, pbrobinson, robatino, rstrode
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AcceptedBlocker
Fixed In Version: cogl-1.22.0-2.fc23 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-24 12:24:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1170821    
Attachments:
Description Flags
Output of 'journalctl -u gdm -b -2 | tail'
none
gdm traceback from gdb
none
system journal
none
rpm -qa
none
full log of an affected boot (std/VNC)
none
journal with debug gdm info
none
log with GDM debugging enabled
none
hang on bare metal - journal
none
hang journal (thinkpad x201i)
none
rpm -qa (thinkpad x201i) none

Description Giulio 'juliuxpigface' 2015-10-18 11:06:09 UTC
Created attachment 1084113 [details]
Output of 'journalctl -u gdm -b -2 | tail'

Description of problem:
When closing a Gnome session, gdm doesn't pop up and the screen becomes black.
Switching to other consoles (CTRL+ALT+F*) isn't also possible.

Moreover, this seems only reproducible on qemu-kvm guests and not on bare metal (thanks kalev for this information!).

Version-Release number of involved components:
gdm-3.18.0-1.fc23.x86_64
spice-vdagent-0.16.0-2.fc23.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Log in through gdm to a standard Gnome session.
2. Log out from the top-right menu.

Actual results:
The screen becomes black, but the system isn't entirely stuck.

Expected results:
Gdm should be accessible after a logout.

Additional info:
I'm attaching journalctl's log.

Comment 1 Fedora Blocker Bugs Application 2015-10-18 11:12:58 UTC
Proposed as a Blocker and Freeze Exception for 23-final by Fedora user juliuxpigface using the blocker tracking app because:

 Even if this seems to work on bare metal, it is (in my opinion) worth discussing at least for FE.

The release criteria involved is "2.6.6 Shutdown, reboot, logout" from Beta's set.

"Shutting down, logging out and rebooting must work using standard console commands and the mechanisms offered (if any) by all release-blocking desktops. [...] Logging out must return the user to the environment from which they logged in, working as expected."

https://fedoraproject.org/wiki/Fedora_23_Beta_Release_Criteria#Shutdown.2C_reboot.2C_logout

Comment 2 Kamil Páral 2015-10-19 13:32:33 UTC
I can reproduce this, but it's a race condition. If I try to log out several times, I hit this issue quite soon (e.g. 3rd ot 5th logout, sometimes 1st). The system still works, but you need to connect remotely, local access is completely frozen.

I tried only VM, not sure if it affects bare metal as well.

Comment 3 Kamil Páral 2015-10-19 13:33:56 UTC
Created attachment 1084409 [details]
gdm traceback from gdb

I'm not sure how to tell which process is stuck. I retrieved traceback from gdm using gdb, in hope it's the right process.

Comment 4 Kamil Páral 2015-10-19 13:37:04 UTC
Created attachment 1084412 [details]
system journal

Comment 5 Kamil Páral 2015-10-19 13:37:22 UTC
Created attachment 1084413 [details]
rpm -qa

Comment 6 Adam Williamson 2015-10-19 15:23:36 UTC
Yeah, like Kamil I found I couldn't hit this every time, but quite easily. I also tested on bare metal and did not hit it one time, so it may be specific to VMs (or slow systems or something). I will try with a non-SPICE VM shortly.

Comment 7 Adam Williamson 2015-10-19 16:16:16 UTC
Hit this with the std/VNC test. First logout attempt: https://bugzilla.redhat.com/show_bug.cgi?id=1273112 . Second logout attempt worked. Third logout attempt (after logging in again) hit the black screen.

Comment 8 Adam Williamson 2015-10-19 16:22:12 UTC
Created attachment 1084455 [details]
full log of an affected boot (std/VNC)

Here's a full journal from my affected test boot - this will include g-i-s user creation, initial login, first (failed back to desktop) logout attempt, second (successful) logout attempt, and third (failed to black screen) logout attempt. I let it sit at the black screen for a while then shut down via the VM's 'power button', which seemed to trigger a clean shutdown.

Comment 9 Adam Williamson 2015-10-19 17:27:51 UTC
This seems to be Wayland-specific - neither kparal nor I can reproduce it with Wayland disabled. OTOH, I can still reproduce after bumping my VM to 4GB RAM.

Comment 10 Kamil Páral 2015-10-19 17:31:42 UTC
For the record, I also tried installing systemd 227.fc24 on F23 machine (using wayland gdm), and I could still reproduce this. So this doesn't seem to be fixed by a recent systemd version.

Comment 11 Kamil Páral 2015-10-19 17:56:49 UTC
Created attachment 1084468 [details]
journal with debug gdm info

Comment 12 Adam Williamson 2015-10-19 17:57:09 UTC
Created attachment 1084469 [details]
log with GDM debugging enabled

This is a log with GDM debugging enabled. This is a straightforward boot, log in, reproduce on first logout attempt: there should be nothing else in the logs.

Comment 13 Adam Williamson 2015-10-19 19:59:09 UTC
Discussed at 2015-10-19 blocker review meeting: https://meetbot.fedoraproject.org/fedora-blocker-review/2015-10-19/f23-blocker-review.2015-10-19-16.00.html . This is a hard call with the limited data we have available, but for now we rejected it as a blocker but accepted it as a freeze exception. It's clearly a problem for affected systems, and we have the criterion for a reason, but it doesn't happen *every* time, and it doesn't affect many systems at all. Affected systems can be fixed with an update, and can also work around the bug by disabling GDM-on-Wayland (which we can document).

So far we're more or less working on the assumption this is a timing issue of some kind and most likely to be encountered on slower systems; we have reproduced it in several VM configurations and on a bare metal ARM system, but have not reproduced it on any bare metal Intel system yet.

If further testing (or discovery of the actual code cause of the bug) indicates the impact of this is broader than currently thought, it can be re-considered for blocker status.

Comment 14 Kamil Páral 2015-10-20 10:21:36 UTC
Created attachment 1084673 [details]
hang on bare metal - journal

I reproduced this on bare metal machine. Once I had to log out 10-15 times, once it happened on the 2nd logout. The screen is black with a cursor shown on it (can't be moved). (Cursor is not shown in VMs because of bug 1273247).

Comment 15 Kamil Páral 2015-10-20 10:24:43 UTC
I'm re-proposing this for a blocker discussion due to comment 14.

Comment 16 Adam Williamson 2015-10-20 16:15:32 UTC
what was the system in question?

Comment 17 Adam Williamson 2015-10-20 16:17:38 UTC
hmm, let's see if I can get it from the logs:

M5A97 PRO (that's an AMD AM3+ board)
AMD FX(tm)-4100 Quad-Core Processor (yup!)
Memory: 8097720K/8333760K available

so, that's a fairly mid-spec system. Ray, have you got anywhere with this one?

Comment 18 Adam Williamson 2015-10-20 16:19:31 UTC
Oh, also looks like kparal's test box has an 80GB SSD and a 500GB HDD:

Oct 20 12:14:36 dhcp-28-122.brq.redhat.com kernel: scsi 2:0:0:0: Direct-Access     ATA      INTEL SSDSC2CT08 335u PQ: 0 ANSI: 5
Oct 20 12:14:36 dhcp-28-122.brq.redhat.com kernel: scsi 3:0:0:0: Direct-Access     ATA      ST500DM002-1BD14 KC45 PQ: 0 ANSI: 5

Comment 19 Adam Williamson 2015-10-20 18:09:43 UTC
I tried again to reproduce this on my bare metal test box, could not (with ~20 loops).

Comment 20 Kamil Páral 2015-10-21 06:54:27 UTC
Thanks Adam, that's exactly this system. F23 was installed to the SSD, HDD was unused, in case that information is useful.

Comment 21 Lukas Brabec 2015-10-21 11:40:39 UTC
After several attempts, I finally reproduced this bug (twice out of ~20 attempts) on my thinkpad x201i with:
Core i3 M370
00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)
4GB RAM and HDD.

Comment 22 Lukas Brabec 2015-10-21 11:44:18 UTC
Created attachment 1085096 [details]
hang journal (thinkpad x201i)

Comment 23 Lukas Brabec 2015-10-21 11:44:46 UTC
Created attachment 1085097 [details]
rpm -qa (thinkpad x201i)

Comment 24 Adam Williamson 2015-10-21 19:08:54 UTC
Funnily, in RC2 testing I haven't managed to hit this yet, instead I'm hitting https://bugzilla.redhat.com/show_bug.cgi?id=1273247 - I keep getting a missing cursor on logout, but I haven't managed to hit the black screen yet.

Comment 25 Ray Strode [halfline] 2015-10-21 20:05:31 UTC
I think I figured this out.  cogl will stop doing page flipping and revert to manual drmSetCrtc calls if the drmModePageFlip call fails. Doing that, will cause tearing but provides a compatibility path for drivers that don't support page flipping (mga for instance).  drmModePageFlip can fail under normal conditions, though, if it's called when using a non-render-node (as is the case for us at the moment) and the VT is inactive.  In this case, it will return an error code of EACCES.  We try to detect this space case and leave page flipping enabled, so the user doesn't start seeing tearing after one of these "expected" failures.  The code isn't quite right though.  It not only ignores the error, but it proceeds as if the page flip request succeeded, and sits and waits for the flip event to come back. Since the page flip request didn't succeed the event never comes back.

Will build a fix.

Comment 26 Giulio 'juliuxpigface' 2015-10-21 20:17:05 UTC
For the record...

I encounter the same behavior of Adam (comment 24) on TC11, after having updated selinux-policy related packages to 3.13.1-152.fc23 (http://koji.fedoraproject.org/koji/buildinfo?buildID=693419).

I've also applied a fix suggested by SeLinux Alarm Browser (grep spice-vdagent /var/log/audit/audit.log | audit2allow -M mypol;semodule -i mypol.pp), but I don't know it this is relevant...

Comment 27 Fedora Update System 2015-10-21 20:36:43 UTC
cogl-1.22.0-2.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-e42c5a0215

Comment 28 Adam Williamson 2015-10-21 21:29:57 UTC
Fix looks good to me, thanks!

Comment 29 Kamil Páral 2015-10-22 08:48:24 UTC
It's hard to reproduce the bug when you actually need it, but I haven't seen any regression with this build and I haven't managed to reproduce the black screen with it. So it looks like it works :)

Comment 30 Kamil Páral 2015-10-22 13:33:14 UTC
For the purpose of the blocker bug discussion, I'm +1 blocker here.

Comment 31 Adam Williamson 2015-10-22 16:31:23 UTC
Discussed at 2015-10-22 Go/No-Go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting-2/2015-10-22/f23-final-go_no_go-meeting.2015-10-22-16.00.log.txt . Accepted as a blocker per criterion cited in #c1, since it's now been reproduced on bare metal and we have the dev's confirmation that it's not possible to be sure it will only happen rarely.

Comment 32 Adam Williamson 2015-10-22 19:46:30 UTC
kparal: For the record I could reproduce it very reliably with 2 CPUs in a KVM, and it definitely went away when I updated cogl.

Comment 33 Fedora Update System 2015-10-24 12:08:41 UTC
cogl-1.22.0-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update cogl'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-e42c5a0215

Comment 34 Fedora Update System 2015-10-24 12:24:02 UTC
cogl-1.22.0-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 35 lgwalts 2016-02-12 19:26:02 UTC
I have this issue with a clean install on a Dell T110 server.  cogl-1.22.0-2.fc23 is installed but the problem persists.  
Wayland is disabled in gdm so the system starts and seems to function but always hangs on shutdown, reboot, or log out.

Comment 36 lgwalts 2016-02-12 19:28:16 UTC
ctrl-alt-F2 immediately locks up gnome