Bug 958326 - Sometimes find system in text mode console on VT1 instead of in GUI
Summary: Sometimes find system in text mode console on VT1 instead of in GUI
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-intel
Version: 18
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Adam Jackson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-30 22:58 UTC by Tom Horsley
Modified: 2014-02-05 21:10 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-02-05 21:10:43 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Xorg log file for active system in which GUI disappeared earlier (36.75 KB, text/x-log)
2013-04-30 22:58 UTC, Tom Horsley
no flags Details
Complete dmesg output following video flake out (67.70 KB, text/plain)
2013-06-08 00:28 UTC, Tom Horsley
no flags Details
/sys/kernel/debug/dri/0/i915_error_state file after video flake out (2.22 MB, text/plain)
2013-06-08 00:29 UTC, Tom Horsley
no flags Details
strace of mplayer command that kills display (214.60 KB, text/plain)
2013-08-17 00:10 UTC, Tom Horsley
no flags Details
for contrast, here is a working mplayer run with $DISPLAY set correctly (198.77 KB, text/plain)
2013-08-17 00:12 UTC, Tom Horsley
no flags Details
1st meg of the mpg file (1.00 MB, application/octet-stream)
2013-08-17 00:20 UTC, Tom Horsley
no flags Details
screen shot of strace just before video goes (387.39 KB, image/png)
2013-08-17 16:06 UTC, Tom Horsley
no flags Details

Description Tom Horsley 2013-04-30 22:58:41 UTC
Created attachment 742027 [details]
Xorg log file for active system in which GUI disappeared earlier

Description of problem:

I never saw this till I got a 3.8 kernel on fedora 18, and I have no idea
if it is an intel driver problem, but it seems like a good a component
as any to submit this bug against:

Every few weeks, I'll turn on the Samsung TV I use as a monitor, expecting
to use the computer and find, instead of my normal GUI, the leftover
text from the last time the system booted (I've got graphical boot disabled).

Today, I discovered that I can do Ctrl-Alt-F2 to go to a console on VT2,
find the system is running, find that KDM is running, discover, in fact, that
nothing at all is abnormal. If I then switch back to VT1 with a
Ctrl-Alt-F1 sequence, suddenly, the GUI is back with all the X clients still
running, nothing wrong at all.

I just thought it might be weird enough to report, if only for people
to marvel at how weird it is :-).

This is running on an Asus P8H67-V motherboard with a core-i7 chip using the
native intel HD graphics (not additional graphics card in the system at all)
shown in lspci as:

00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)

Version-Release number of selected component (if applicable):
xorg-x11-drv-intel-2.21.6-1.fc18.x86_64

How reproducible:
Wait a few weeks and maybe it will happen.

Steps to Reproduce:
1. see above
2.
3.
  
Actual results:
GUI disappears somewhere

Expected results:
GUI stays visible

Additional info:
I'll attach the Xorg log. There is nothing that looks strange in it though,
just the message at the end from doing the VT switch, but no messages
prior to that that weren't there at boot time a week ago.

Comment 1 Tom Horsley 2013-05-01 21:00:29 UTC
This gets weirder and weirder. I came home today to find the console VT being displayed, but instead of the boot messages, it was the VT 2 screen I had switched to yesterday (having it happen two days in a row is also unusual). I wasn't actually in VT 2 however, I was just looking at the screen - typing had no effect. When I tried Ctrl-Alt-F2 to see what would happen, I suddenly got a blinking cursor and could use the VT 2 console.

When I did Ctrl-Alt-F1, I once again got the GUI X session back.

I have ssh'ed into this system from work two days in a row as well. I have no idea why that might have an effect since the ssh sessions should certainly be independent from the local X display. I'll see if there is a correlation in the future between ssh'ing into the system and the X display going funny in the head
(or maybe it just happens more frequently when I don't reboot the system - I always rebooted before when I didn't realize there was any way to get the X display back).

Comment 2 Tom Horsley 2013-05-25 12:03:20 UTC
This continues to happen with latest kernel-3.8.9-200.fc18.x86_64
and xorg-x11-drv-intel-2.21.6-1.fc18.x86_64.

The more it happens, the less it seems to be correlated with any other
activity on the system (the ssh theory is definitely busted).

The only thing there might be a correlation with is heavy system load.
I've been doing a lot of video transcoding lately to prepare TV shows I've downloaded from my TiVo to play on my tablet. The CPUs (all 8 of 'em) sit at 100% for an hour or two while doing each show, and perhaps the only time the GUI disappears on me is when the monitor is turned off during this transcoding.

Could the X driver be trying to poll something that times out when the system is really busy? (Doesn't make much sense, but is the only straw I have to grasp at :-).

Comment 3 Tom Horsley 2013-06-04 00:57:57 UTC
This developed a new symptom today. When I use mplayer to play a movie in full screen mode using opengl video, sometimes it will freeze, then the sound will start again, but the picture remains frozen. I get out of it via the same console switch sequence.

I'm running memtest overnight and if it finds no errors, I'll try installing something like fedora 16 to see if the problem goes away (at least it seems to be frequent enough when playing movies to be able to reproduce it)

Comment 4 Tom Horsley 2013-06-04 11:20:47 UTC
No memtest errors with 7 full passes overnight, so I guess I need to try fedora 16 and see if an old kernel and video driver will manifest the problem.

Comment 5 Tom Horsley 2013-06-08 00:27:47 UTC
Finally got fedora 16 installed on a spare partition. On the exact same hardware, I can play a movie all the way through with no problems under Fedora 16 with the 3.1.0-7.fc16.x86_64 kernel. On Fedora 18 with 3.9.4-200.fc18.x86_64 playing the movie always winds up in the state described in comment 3 within about 10 to 20 minutes. This latest time it screwed up, I did manage to find some info in dmesg. The relevant bit seems to be this:

[ 1221.110322] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1221.110332] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state

I'll attach the complete dmesg and the i915_error_state file. The mplayer command line I used to play the movie was:

mplayer -vo gl_nosw -monitoraspect 16:9 -fs *.mkv

The .mkv file is 1280x720 aspect being scaled up by mplayer to 1920x1080.

Comment 6 Tom Horsley 2013-06-08 00:28:43 UTC
Created attachment 758373 [details]
Complete dmesg output following video flake out

Comment 7 Tom Horsley 2013-06-08 00:29:42 UTC
Created attachment 758374 [details]
/sys/kernel/debug/dri/0/i915_error_state file after video flake out

Comment 8 Tom Horsley 2013-06-08 11:56:31 UTC
Note: I submitted this upstream as well as bug:

https://bugs.freedesktop.org/show_bug.cgi?id=65526

A reply there suggests that a fix in mesa 9.1.3 might solve this problem.

Comment 9 Tom Horsley 2013-06-08 12:10:27 UTC
Of course mesa 9.1.3 seems to be a bit of a mystery. There is a 9.1.3 directory in the download site, but there is no 9.1.3 announcement on the mesa web page, nor are there any 9.1.3 release notes in the release notes page :-).

Comment 10 Tom Horsley 2013-06-12 23:56:00 UTC
As another data point, I just tried playing a movie on f19 beta running on the same hardware, and the drm:i915_hangcheck_hung errors still happen from time to time, but when it recovers, it is in the correct screen now, so the Ctrl-Alt vterm switching is no longer required.

This is on kernel 3.9.4-301.fc19.x86_64.

Comment 11 Tom Horsley 2013-06-14 17:17:39 UTC
Just to add another data point, the new mesa updates with the 20130610 date in their version don't fix the drm:i915_hangcheck_hung bugs, still happening every 10 to 20 minutes when playing a movie fullscreen. (Though at least it recovers to the correct screen still).

Comment 12 Tom Horsley 2013-06-21 13:37:30 UTC
New news on this. I noticed this bit in the dmesg output:

[    2.534340] [drm] Initialized drm 1.1.0 20060810
[    2.541133] alg: No test for crc32 (crc32-pclmul)
[    2.547794] [drm] Memory usable by graphics device = 2048M
[    2.547799] i915 0000:00:02.0: setting latency timer to 64
[    2.568065] i915 0000:00:02.0: irq 54 for MSI/MSI-X
[    2.568070] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    2.568071] [drm] Driver supports precise vblank timestamp query.
[    2.568103] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    2.581593] [drm] Wrong MCH_SSKPD value: 0x16040307
[    2.581596] [drm] This can cause pipe underruns and display issues.
[    2.581597] [drm] Please upgrade your BIOS to fix this.

So I decided to try upgrading the BIOS on my P8H67-V motherboard. That, naturally, bricked the motherboard (so I'm waiting for a replacement BIOS chip to arrive :-), but I also used this an an excuse to upgrade my system with a new asus sabertooth z87 motherboard and haswell cpu. This new system has no messages about drm problems in dmesg and I was actually able to play an entire movie with no hangcheck hangs (still using intel graphics). The new lspci looks like:

00:00.0 Host bridge: Intel Corporation Haswell DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Haswell PCI Express x16 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Haswell Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Haswell HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation Lynx Point USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation Lynx Point MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V (rev 04)
00:1a.0 USB controller: Intel Corporation Lynx Point USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation Lynx Point High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #1 (rev d4)
00:1c.3 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #4 (rev d4)
00:1c.7 PCI bridge: Intel Corporation Lynx Point PCI Express Root Port #8 (rev d4)
00:1d.0 USB controller: Intel Corporation Lynx Point USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Lynx Point LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation Lynx Point SMBus Controller (rev 04)
03:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

So it does seem like it might have been fixed by the BIOS update if only it had worked :-). Either that, or the haswell is so much faster that whatever underrun/overrun problem might have caused it doesn't have a chance to happen.

Comment 13 Tom Horsley 2013-06-22 13:03:29 UTC
Got the new BIOS chip, got the system rebuilt enough to work, and I see the same MCH_SSKPD message in the dmesg output, so updating the BIOS certainly didn't fix that message, so I'm guessing it won't fix the video problems (but I haven't gotten far enough to test that yet).

Comment 14 Tom Horsley 2013-06-30 12:51:22 UTC
OK, the new Haswell system isn't fast enough all the time. I found it with the original symptom, sitting in the console screen again this morning, but no message shows up in dmesg.

This was the first night in a long time that I had some TV programs to transcode overnight, so it definitely appears to be correlated with the system being under heavy CPU load.

Perhaps the video driver is resetting something because it imagines there has been a timeout, but really the system is just real busy?

Comment 15 Tom Horsley 2013-08-11 13:28:43 UTC
Just to keep adding clues to this mystery, I now find it showing the console every morning, even when I'm not doing any transcoding at night. I'm officially back to completely mystified again :-).

I'm now running:

kernel-3.10.5-201.fc19.x86_64
xorg-x11-drv-intel-2.21.12-2.fc19.x86_64

No messages seem to show up in any logs that I can see, I'm just pointing at the old console framebuffer every morning now and I get the real display by doing a VT switch away then back.

Comment 16 Tom Horsley 2013-08-16 13:23:26 UTC
This bug has been driving me to do crazy things like leave my android camera running a time lapse photo app all night to discover that the screen switches between 1:16:26 am and 1:17:26am and that I run a cron job that does a couple of rsync commands at 1:17am.

I also find /var/log/messages entries saying bluetooth resets seem to happen at 1:17am and pulesaudio resets seem to happen at 1:17 am.

None of this makes any sense, mind you, but it is data :-). I have no idea why doing a few directory read and ethernet operations would reset my screen or my bluetooth dongle or the pluseaudio daemon. (The rsync commands will actually find there is nothing to do after comparing the directories involved).

Maybe there is something wacky about running /etc/bashrc from cron? There is certainly a lot of complicated nonsense I can't understand in the /etc/bashrc script.

Comment 17 Tom Horsley 2013-08-16 21:41:40 UTC
This is getting more and more fun. It is indeed my cron job that seems to trigger the video switch, but not if I run it normally from a terminal.

If I run it via "at" or in cron, the screen flips to funny frame buffer as soon as it starts running.

Time to start trying the individual commands to see which one does it I guess.

Comment 18 Tom Horsley 2013-08-17 00:10:00 UTC
OK, found it. This command when run with the $DISPLAY environment variable unset will switch the display to the old contents of the frame buffer from before starting X 100% of the times that I run it:

mplayer -identify -frames 0 /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg

That mpg file is unfortunately large:
zooty> ls -l /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg
-rw-r--r-- 1 tom users 2695557746 Aug  5 01:25 /huge/vids/DoctorWho/2013-08-04-2595018-TheDoctorsRevisitedTheSeventhDoctor/TheDoctorsRevisitedTheSeventhDoctor.mpg

I have a feeling this bit is what triggers the problem, but there are lots of other ioctl calls that fail (I'll attach the full strace):


ioctl(4, FBIOGET_VSCREENINFO, 0x7f6e6402dee0) = 0
ioctl(4, FBIOGET_CONTRAST or FBIOPUT_VSCREENINFO, 0x7f6e6402df80) = 0
ioctl(5, KDSETMODE, 0)                  = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "Can't restore text mode: Inappro"..., 56) = 56

If I do have $DISPLAY set, then it runs with no problems.

Comment 19 Tom Horsley 2013-08-17 00:10:48 UTC
Created attachment 787488 [details]
strace of mplayer command that kills display

Comment 20 Tom Horsley 2013-08-17 00:12:35 UTC
Created attachment 787489 [details]
for contrast, here is a working mplayer run with $DISPLAY set correctly

Comment 21 Tom Horsley 2013-08-17 00:20:13 UTC
Created attachment 787490 [details]
1st meg of the mpg file

Turns out I can reproduce the problem with a truncated copy of the mpg file that is just the 1st meg, so I'm attaching that as well.

mplayer -identify -frames 0 trunc.mpg

wacks the X display for me.

I'm using the mplayer from the rpmfusion repos:

mplayer-1.1-9.20130416svn.fc19.x86_64

Comment 22 Tom Horsley 2013-08-17 16:06:16 UTC
Created attachment 787568 [details]
screen shot of strace just before video goes

I built a special strace that delayed 1/10 of a second for each syscall and was able to film the strace output as the screen switches. This is the final screen shot just before the GUI disappears. Looks like an ioctl() on a file descriptor for /dev/fb0 is what is causing the screen switch.

Comment 23 Tom Horsley 2013-08-17 16:48:04 UTC
My theory at this point is that mplayer is trying to fall back to low level video frame buffer direct rendering when $DISPLAY is unset, and the Intel video driver isn't reacting well to that.

All my problems go away if I add the "-vo null" option to the mplayer command, so I at least know how to make this stop happening, but it does seem like some sort of bug really exists here.

Anyway, I've written up the complete saga of this bug at:

http://home.comcast.net/~tomhorsley/game/heisenbug.html

I'll want to try the same sample video file on my desktop at work (which has an ATI card) and see what happens there.

Comment 24 Tom Horsley 2013-08-17 18:00:04 UTC
For good measure, I reported this upstream at:

https://bugs.freedesktop.org/show_bug.cgi?id=68231

Comment 25 Tom Horsley 2013-08-19 12:10:26 UTC
Holy cow! Now that I know how to reproduce it, I see it does the same thing
on my system at work which is using the ATI drivers:

xorg-x11-drv-ati-7.1.0-5.20130408git6e74aacc5.fc19.x86_64

Comment 26 Fedora End Of Life 2013-12-21 13:19:57 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 27 Fedora End Of Life 2014-02-05 21:10:43 UTC
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.