Created attachment 1084113 [details]
Output of 'journalctl -u gdm -b -2 | tail'
Description of problem:
When closing a Gnome session, gdm doesn't pop up and the screen becomes black.
Switching to other consoles (CTRL+ALT+F*) isn't also possible.
Moreover, this seems only reproducible on qemu-kvm guests and not on bare metal (thanks kalev for this information!).
Version-Release number of involved components:
Steps to Reproduce:
1. Log in through gdm to a standard Gnome session.
2. Log out from the top-right menu.
The screen becomes black, but the system isn't entirely stuck.
Gdm should be accessible after a logout.
I'm attaching journalctl's log.
Proposed as a Blocker and Freeze Exception for 23-final by Fedora user juliuxpigface using the blocker tracking app because:
Even if this seems to work on bare metal, it is (in my opinion) worth discussing at least for FE.
The release criteria involved is "2.6.6 Shutdown, reboot, logout" from Beta's set.
"Shutting down, logging out and rebooting must work using standard console commands and the mechanisms offered (if any) by all release-blocking desktops. [...] Logging out must return the user to the environment from which they logged in, working as expected."
I can reproduce this, but it's a race condition. If I try to log out several times, I hit this issue quite soon (e.g. 3rd ot 5th logout, sometimes 1st). The system still works, but you need to connect remotely, local access is completely frozen.
I tried only VM, not sure if it affects bare metal as well.
Created attachment 1084409 [details]
gdm traceback from gdb
I'm not sure how to tell which process is stuck. I retrieved traceback from gdm using gdb, in hope it's the right process.
Created attachment 1084412 [details]
Created attachment 1084413 [details]
Yeah, like Kamil I found I couldn't hit this every time, but quite easily. I also tested on bare metal and did not hit it one time, so it may be specific to VMs (or slow systems or something). I will try with a non-SPICE VM shortly.
Hit this with the std/VNC test. First logout attempt: https://bugzilla.redhat.com/show_bug.cgi?id=1273112 . Second logout attempt worked. Third logout attempt (after logging in again) hit the black screen.
Created attachment 1084455 [details]
full log of an affected boot (std/VNC)
Here's a full journal from my affected test boot - this will include g-i-s user creation, initial login, first (failed back to desktop) logout attempt, second (successful) logout attempt, and third (failed to black screen) logout attempt. I let it sit at the black screen for a while then shut down via the VM's 'power button', which seemed to trigger a clean shutdown.
This seems to be Wayland-specific - neither kparal nor I can reproduce it with Wayland disabled. OTOH, I can still reproduce after bumping my VM to 4GB RAM.
For the record, I also tried installing systemd 227.fc24 on F23 machine (using wayland gdm), and I could still reproduce this. So this doesn't seem to be fixed by a recent systemd version.
Created attachment 1084468 [details]
journal with debug gdm info
Created attachment 1084469 [details]
log with GDM debugging enabled
This is a log with GDM debugging enabled. This is a straightforward boot, log in, reproduce on first logout attempt: there should be nothing else in the logs.
Discussed at 2015-10-19 blocker review meeting: https://meetbot.fedoraproject.org/fedora-blocker-review/2015-10-19/f23-blocker-review.2015-10-19-16.00.html . This is a hard call with the limited data we have available, but for now we rejected it as a blocker but accepted it as a freeze exception. It's clearly a problem for affected systems, and we have the criterion for a reason, but it doesn't happen *every* time, and it doesn't affect many systems at all. Affected systems can be fixed with an update, and can also work around the bug by disabling GDM-on-Wayland (which we can document).
So far we're more or less working on the assumption this is a timing issue of some kind and most likely to be encountered on slower systems; we have reproduced it in several VM configurations and on a bare metal ARM system, but have not reproduced it on any bare metal Intel system yet.
If further testing (or discovery of the actual code cause of the bug) indicates the impact of this is broader than currently thought, it can be re-considered for blocker status.
Created attachment 1084673 [details]
hang on bare metal - journal
I reproduced this on bare metal machine. Once I had to log out 10-15 times, once it happened on the 2nd logout. The screen is black with a cursor shown on it (can't be moved). (Cursor is not shown in VMs because of bug 1273247).
I'm re-proposing this for a blocker discussion due to comment 14.
what was the system in question?
hmm, let's see if I can get it from the logs:
M5A97 PRO (that's an AMD AM3+ board)
AMD FX(tm)-4100 Quad-Core Processor (yup!)
Memory: 8097720K/8333760K available
so, that's a fairly mid-spec system. Ray, have you got anywhere with this one?
Oh, also looks like kparal's test box has an 80GB SSD and a 500GB HDD:
Oct 20 12:14:36 dhcp-28-122.brq.redhat.com kernel: scsi 2:0:0:0: Direct-Access ATA INTEL SSDSC2CT08 335u PQ: 0 ANSI: 5
Oct 20 12:14:36 dhcp-28-122.brq.redhat.com kernel: scsi 3:0:0:0: Direct-Access ATA ST500DM002-1BD14 KC45 PQ: 0 ANSI: 5
I tried again to reproduce this on my bare metal test box, could not (with ~20 loops).
Thanks Adam, that's exactly this system. F23 was installed to the SSD, HDD was unused, in case that information is useful.
After several attempts, I finally reproduced this bug (twice out of ~20 attempts) on my thinkpad x201i with:
Core i3 M370
00:02.0 VGA compatible controller : Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)
4GB RAM and HDD.
Created attachment 1085096 [details]
hang journal (thinkpad x201i)
Created attachment 1085097 [details]
rpm -qa (thinkpad x201i)
Funnily, in RC2 testing I haven't managed to hit this yet, instead I'm hitting https://bugzilla.redhat.com/show_bug.cgi?id=1273247 - I keep getting a missing cursor on logout, but I haven't managed to hit the black screen yet.
I think I figured this out. cogl will stop doing page flipping and revert to manual drmSetCrtc calls if the drmModePageFlip call fails. Doing that, will cause tearing but provides a compatibility path for drivers that don't support page flipping (mga for instance). drmModePageFlip can fail under normal conditions, though, if it's called when using a non-render-node (as is the case for us at the moment) and the VT is inactive. In this case, it will return an error code of EACCES. We try to detect this space case and leave page flipping enabled, so the user doesn't start seeing tearing after one of these "expected" failures. The code isn't quite right though. It not only ignores the error, but it proceeds as if the page flip request succeeded, and sits and waits for the flip event to come back. Since the page flip request didn't succeed the event never comes back.
Will build a fix.
For the record...
I encounter the same behavior of Adam (comment 24) on TC11, after having updated selinux-policy related packages to 3.13.1-152.fc23 (http://koji.fedoraproject.org/koji/buildinfo?buildID=693419).
I've also applied a fix suggested by SeLinux Alarm Browser (grep spice-vdagent /var/log/audit/audit.log | audit2allow -M mypol;semodule -i mypol.pp), but I don't know it this is relevant...
cogl-1.22.0-2.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-e42c5a0215
Fix looks good to me, thanks!
It's hard to reproduce the bug when you actually need it, but I haven't seen any regression with this build and I haven't managed to reproduce the black screen with it. So it looks like it works :)
For the purpose of the blocker bug discussion, I'm +1 blocker here.
Discussed at 2015-10-22 Go/No-Go meeting, acting as a blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-meeting-2/2015-10-22/f23-final-go_no_go-meeting.2015-10-22-16.00.log.txt . Accepted as a blocker per criterion cited in #c1, since it's now been reproduced on bare metal and we have the dev's confirmation that it's not possible to be sure it will only happen rarely.
kparal: For the record I could reproduce it very reliably with 2 CPUs in a KVM, and it definitely went away when I updated cogl.
cogl-1.22.0-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update cogl'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-e42c5a0215
cogl-1.22.0-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.
I have this issue with a clean install on a Dell T110 server. cogl-1.22.0-2.fc23 is installed but the problem persists.
Wayland is disabled in gdm so the system starts and seems to function but always hangs on shutdown, reboot, or log out.
ctrl-alt-F2 immediately locks up gnome