Created attachment 742027 [details]
Xorg log file for active system in which GUI disappeared earlier
Description of problem:
I never saw this till I got a 3.8 kernel on fedora 18, and I have no idea
if it is an intel driver problem, but it seems like a good a component
as any to submit this bug against:
Every few weeks, I'll turn on the Samsung TV I use as a monitor, expecting
to use the computer and find, instead of my normal GUI, the leftover
text from the last time the system booted (I've got graphical boot disabled).
Today, I discovered that I can do Ctrl-Alt-F2 to go to a console on VT2,
find the system is running, find that KDM is running, discover, in fact, that
nothing at all is abnormal. If I then switch back to VT1 with a
Ctrl-Alt-F1 sequence, suddenly, the GUI is back with all the X clients still
running, nothing wrong at all.
I just thought it might be weird enough to report, if only for people
to marvel at how weird it is :-).
This is running on an Asus P8H67-V motherboard with a core-i7 chip using the
native intel HD graphics (not additional graphics card in the system at all)
shown in lspci as:
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
Version-Release number of selected component (if applicable):
Wait a few weeks and maybe it will happen.
Steps to Reproduce:
1. see above
GUI disappears somewhere
GUI stays visible
I'll attach the Xorg log. There is nothing that looks strange in it though,
just the message at the end from doing the VT switch, but no messages
prior to that that weren't there at boot time a week ago.
This gets weirder and weirder. I came home today to find the console VT being displayed, but instead of the boot messages, it was the VT 2 screen I had switched to yesterday (having it happen two days in a row is also unusual). I wasn't actually in VT 2 however, I was just looking at the screen - typing had no effect. When I tried Ctrl-Alt-F2 to see what would happen, I suddenly got a blinking cursor and could use the VT 2 console.
When I did Ctrl-Alt-F1, I once again got the GUI X session back.
I have ssh'ed into this system from work two days in a row as well. I have no idea why that might have an effect since the ssh sessions should certainly be independent from the local X display. I'll see if there is a correlation in the future between ssh'ing into the system and the X display going funny in the head
(or maybe it just happens more frequently when I don't reboot the system - I always rebooted before when I didn't realize there was any way to get the X display back).
This continues to happen with latest kernel-3.8.9-200.fc18.x86_64
The more it happens, the less it seems to be correlated with any other
activity on the system (the ssh theory is definitely busted).
The only thing there might be a correlation with is heavy system load.
I've been doing a lot of video transcoding lately to prepare TV shows I've downloaded from my TiVo to play on my tablet. The CPUs (all 8 of 'em) sit at 100% for an hour or two while doing each show, and perhaps the only time the GUI disappears on me is when the monitor is turned off during this transcoding.
Could the X driver be trying to poll something that times out when the system is really busy? (Doesn't make much sense, but is the only straw I have to grasp at :-).
This developed a new symptom today. When I use mplayer to play a movie in full screen mode using opengl video, sometimes it will freeze, then the sound will start again, but the picture remains frozen. I get out of it via the same console switch sequence.
I'm running memtest overnight and if it finds no errors, I'll try installing something like fedora 16 to see if the problem goes away (at least it seems to be frequent enough when playing movies to be able to reproduce it)
No memtest errors with 7 full passes overnight, so I guess I need to try fedora 16 and see if an old kernel and video driver will manifest the problem.
Finally got fedora 16 installed on a spare partition. On the exact same hardware, I can play a movie all the way through with no problems under Fedora 16 with the 3.1.0-7.fc16.x86_64 kernel. On Fedora 18 with 3.9.4-200.fc18.x86_64 playing the movie always winds up in the state described in comment 3 within about 10 to 20 minutes. This latest time it screwed up, I did manage to find some info in dmesg. The relevant bit seems to be this:
[ 1221.110322] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1221.110332] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state
I'll attach the complete dmesg and the i915_error_state file. The mplayer command line I used to play the movie was:
mplayer -vo gl_nosw -monitoraspect 16:9 -fs *.mkv
The .mkv file is 1280x720 aspect being scaled up by mplayer to 1920x1080.
Created attachment 758373 [details]
Complete dmesg output following video flake out
Created attachment 758374 [details]
/sys/kernel/debug/dri/0/i915_error_state file after video flake out
Note: I submitted this upstream as well as bug:
A reply there suggests that a fix in mesa 9.1.3 might solve this problem.
Of course mesa 9.1.3 seems to be a bit of a mystery. There is a 9.1.3 directory in the download site, but there is no 9.1.3 announcement on the mesa web page, nor are there any 9.1.3 release notes in the release notes page :-).
As another data point, I just tried playing a movie on f19 beta running on the same hardware, and the drm:i915_hangcheck_hung errors still happen from time to time, but when it recovers, it is in the correct screen now, so the Ctrl-Alt vterm switching is no longer required.
This is on kernel 3.9.4-301.fc19.x86_64.
Just to add another data point, the new mesa updates with the 20130610 date in their version don't fix the drm:i915_hangcheck_hung bugs, still happening every 10 to 20 minutes when playing a movie fullscreen. (Though at least it recovers to the correct screen still).
New news on this. I noticed this bit in the dmesg output:
[ 2.534340] [drm] Initialized drm 1.1.0 20060810
[ 2.541133] alg: No test for crc32 (crc32-pclmul)
[ 2.547794] [drm] Memory usable by graphics device = 2048M
[ 2.547799] i915 0000:00:02.0: setting latency timer to 64
[ 2.568065] i915 0000:00:02.0: irq 54 for MSI/MSI-X
[ 2.568070] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[ 2.568071] [drm] Driver supports precise vblank timestamp query.
[ 2.568103] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[ 2.581593] [drm] Wrong MCH_SSKPD value: 0x16040307
[ 2.581596] [drm] This can cause pipe underruns and display issues.
[ 2.581597] [drm] Please upgrade your BIOS to fix this.
So I decided to try upgrading the BIOS on my P8H67-V motherboard. That, naturally, bricked the motherboard (so I'm waiting for a replacement BIOS chip to arrive :-), but I also used this an an excuse to upgrade my system with a new asus sabertooth z87 motherboard and haswell cpu. This new system has no messages about drm problems in dmesg and I was actually able to play an entire movie with no hangcheck hangs (still using intel graphics). The new lspci looks like:
00:00.0 Host bridge: Intel Corporation Haswell DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Haswell PCI Express x16 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Haswell Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Haswell HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation Lynx Point USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation Lynx Point MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V (rev 04)
00:1a.0 USB controller: Intel Corporation Lynx Point USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation Lynx Point High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #1 (rev d4)
00:1c.3 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #4 (rev d4)
00:1c.7 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #8 (rev d4)
00:1d.0 USB controller: Intel Corporation Lynx Point USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Lynx Point LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation Lynx Point SMBus Controller (rev 04)
03:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)
So it does seem like it might have been fixed by the BIOS update if only it had worked :-). Either that, or the haswell is so much faster that whatever underrun/overrun problem might have caused it doesn't have a chance to happen.
Got the new BIOS chip, got the system rebuilt enough to work, and I see the same MCH_SSKPD message in the dmesg output, so updating the BIOS certainly didn't fix that message, so I'm guessing it won't fix the video problems (but I haven't gotten far enough to test that yet).
OK, the new Haswell system isn't fast enough all the time. I found it with the original symptom, sitting in the console screen again this morning, but no message shows up in dmesg.
This was the first night in a long time that I had some TV programs to transcode overnight, so it definitely appears to be correlated with the system being under heavy CPU load.
Perhaps the video driver is resetting something because it imagines there has been a timeout, but really the system is just real busy?
Just to keep adding clues to this mystery, I now find it showing the console every morning, even when I'm not doing any transcoding at night. I'm officially back to completely mystified again :-).
I'm now running:
No messages seem to show up in any logs that I can see, I'm just pointing at the old console framebuffer every morning now and I get the real display by doing a VT switch away then back.
This bug has been driving me to do crazy things like leave my android camera running a time lapse photo app all night to discover that the screen switches between 1:16:26 am and 1:17:26am and that I run a cron job that does a couple of rsync commands at 1:17am.
I also find /var/log/messages entries saying bluetooth resets seem to happen at 1:17am and pulesaudio resets seem to happen at 1:17 am.
None of this makes any sense, mind you, but it is data :-). I have no idea why doing a few directory read and ethernet operations would reset my screen or my bluetooth dongle or the pluseaudio daemon. (The rsync commands will actually find there is nothing to do after comparing the directories involved).
Maybe there is something wacky about running /etc/bashrc from cron? There is certainly a lot of complicated nonsense I can't understand in the /etc/bashrc script.
This is getting more and more fun. It is indeed my cron job that seems to trigger the video switch, but not if I run it normally from a terminal.
If I run it via "at" or in cron, the screen flips to funny frame buffer as soon as it starts running.
Time to start trying the individual commands to see which one does it I guess.
OK, found it. This command when run with the $DISPLAY environment variable unset will switch the display to the old contents of the frame buffer from before starting X 100% of the times that I run it:
mplayer -identify -frames 0 /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg
That mpg file is unfortunately large:
zooty> ls -l /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg
-rw-r--r-- 1 tom users 2695557746 Aug 5 01:25 /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg
I have a feeling this bit is what triggers the problem, but there are lots of other ioctl calls that fail (I'll attach the full strace):
ioctl(4, FBIOGET_VSCREENINFO, 0x7f6e6402dee0) = 0
ioctl(4, FBIOGET_CONTRAST or FBIOPUT_VSCREENINFO, 0x7f6e6402df80) = 0
ioctl(5, KDSETMODE, 0) = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "Can't restore text mode: Inappro"..., 56) = 56
If I do have $DISPLAY set, then it runs with no problems.
Created attachment 787488 [details]
strace of mplayer command that kills display
Created attachment 787489 [details]
for contrast, here is a working mplayer run with $DISPLAY set correctly
Created attachment 787490 [details]
1st meg of the mpg file
Turns out I can reproduce the problem with a truncated copy of the mpg file that is just the 1st meg, so I'm attaching that as well.
mplayer -identify -frames 0 trunc.mpg
wacks the X display for me.
I'm using the mplayer from the rpmfusion repos:
Created attachment 787568 [details]
screen shot of strace just before video goes
I built a special strace that delayed 1/10 of a second for each syscall and was able to film the strace output as the screen switches. This is the final screen shot just before the GUI disappears. Looks like an ioctl() on a file descriptor for /dev/fb0 is what is causing the screen switch.
My theory at this point is that mplayer is trying to fall back to low level video frame buffer direct rendering when $DISPLAY is unset, and the Intel video driver isn't reacting well to that.
All my problems go away if I add the "-vo null" option to the mplayer command, so I at least know how to make this stop happening, but it does seem like some sort of bug really exists here.
Anyway, I've written up the complete saga of this bug at:
I'll want to try the same sample video file on my desktop at work (which has an ATI card) and see what happens there.
For good measure, I reported this upstream at:
Holy cow! Now that I know how to reproduce it, I see it does the same thing
on my system at work which is using the ATI drivers:
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '18'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 18's end of life.
Thank you for reporting this issue and we are sorry that we may not be
able to fix it before Fedora 18 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior to Fedora 18's end of life.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.