691084 – [Sandybridge] X crash when unlocking OpenGL screensaver (KDE)

Bug 691084 - [Sandybridge] X crash when unlocking OpenGL screensaver (KDE)

Summary: [Sandybridge] X crash when unlocking OpenGL screensaver (KDE)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mesa
Sub Component:
Version:	15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Dave Airlie
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-26 16:20 UTC by Ian Pilcher
Modified:	2018-04-11 12:43 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-05-07 18:51:30 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
X.org log (39.65 KB, text/plain) 2011-03-26 16:20 UTC, Ian Pilcher	no flags	Details
dmesg (90.15 KB, text/plain) 2011-03-26 16:23 UTC, Ian Pilcher	no flags	Details
99-display-positions.conf (261 bytes, text/plain) 2011-03-26 16:25 UTC, Ian Pilcher	no flags	Details
dmesg with drm.debug=14 (4.96 MB, text/plain) 2011-03-26 18:57 UTC, Ian Pilcher	no flags	Details
gdb backtrace of X crash (bt full) (2.93 KB, text/plain) 2011-03-27 15:34 UTC, Ian Pilcher	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
FreeDesktop.org	32534	None	None	None	Never
FreeDesktop.org	35452	None	None	None	Never
KDE Software Compilation	252817	None	None	None	Never

Description Ian Pilcher 2011-03-26 16:20:50 UTC

Created attachment 487867 [details]
X.org log

Description of problem:
Fedora 15 Alpha on a brand new Sandy Bridge system (Core i7 2600 with
"Intel HD" graphics).  Running KDE and using GLMatrix screensaver.

When I unlocked the system this morning, the password entry dialog
was not displayed correctly.  When I pressed a key, only the password
entry field became visible.  The "Switch User ...", "Unlock", and
"Cancel" buttons only appeared when I moved the mouse pointer over
them; the dialog background remained black.

When I entered my password, X crashed.  The backtrace in the log is:

[ 49562.907] 0: /usr/bin/X (xorg_backtrace+0x2f) [0x4a117f]
[ 49562.907] 1: /usr/bin/X (0x400000+0x621c6) [0x4621c6]
[ 49562.907] 2: /lib64/libpthread.so.0 (0x37ae800000+0xf4e0) [0x37ae80f4e0]
[ 49562.907] 3: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x84251) [0x7fd70c43d251]
[ 49562.907] 4: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x6e116) [0x7fd70c427116]
[ 49562.907] 5: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x5d5cd) [0x7fd70c4165cd]
[ 49562.907] 6: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x148b93) [0x7fd70c501b93]
[ 49562.907] 7: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x14690c) [0x7fd70c4ff90c]
[ 49562.907] 8: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x146b0a) [0x7fd70c4ffb0a]
[ 49562.907] 9: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x1090cb) [0x7fd70c4c20cb]
[ 49562.907] 10: /usr/lib64/xorg/modules/extensions/libglx.so (0x7fd70d839000+0x310b9) [0x7fd70d86a0b9]
[ 49562.907] 11: /usr/lib64/xorg/modules/extensions/libglx.so (0x7fd70d839000+0x33831) [0x7fd70d86c831]
[ 49562.907] 12: /usr/bin/X (0x400000+0x2ebd1) [0x42ebd1]
[ 49562.907] 13: /usr/bin/X (0x400000+0x22e5a) [0x422e5a]
[ 49562.907] 14: /lib64/libc.so.6 (__libc_start_main+0xed) [0x37adc2131d]
[ 49562.907] 15: /usr/bin/X (0x400000+0x23141) [0x423141]
[ 49562.907] Segmentation fault at address (nil)

The graphics hardware is:

00:02.0 VGA compatible controller: Intel Corporation Sandy Bridge Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Giga-byte Technology Device d000
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 42
        Region 0: Memory at fb800000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at ff00 [size=64]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee4400c  Data: 4171
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a4] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: i915
        Kernel modules: i915

Version-Release number of selected component (if applicable):
mesa-dri-drivers-7.10.1-1.fc15.x86_64

How reproducible:
Not sure.  I just built the system, and this is the first time I've tried
running an OpenGL screensaver overnight on it.

Steps to Reproduce:
See above.
  
Actual results:
X crash.

Expected results:
No X crash.

Additional info:
I will install the relevant debuginfo patches and try to reproduce.

Comment 1 Ian Pilcher 2011-03-26 16:23:31 UTC

Created attachment 487869 [details]
dmesg

Comment 2 Ian Pilcher 2011-03-26 16:25:26 UTC

Created attachment 487874 [details]
99-display-positions.conf

Configuration file I use to set my screen positions (just for the
sake of completeness).

Comment 3 Ian Pilcher 2011-03-26 16:32:18 UTC

Note that I've rebuilt xorg-x11-server, because of bug #680684.

Comment 4 Ian Pilcher 2011-03-26 18:57:39 UTC

Created attachment 487943 [details]
dmesg with drm.debug=14

dmesg from X crash and subsequent X restart (by kdm).

It seems that this may be related to DPMS.  I was only able to reproduce the
crash by allowing the screensaver to run until DPMS turned off the displays
before I attempted to unlock it.

BTW, I've installed the relevant debuginfo packages, but the backtrace in
the X log does not appear to be using that information.  I there a tool that
I can feed the backtrace from the log through to get a more useful version?

Comment 5 Ian Pilcher 2011-03-27 15:34:52 UTC

Created attachment 488015 [details]
gdb backtrace of X crash (bt full)

Attaching full backtrace.  Here is the short version:

#0  prepare_wm_surfaces (brw=0x2cd33c0) at brw_wm_surface_state.c:602
#1  0x00007f653c0e6116 in brw_validate_state (brw=0x2cd33c0) at brw_state_upload.c:397
#2  0x00007f653c0d55cd in brw_try_draw_prims (max_index=<optimized out>, min_index=<optimized out>, ib=0x0, nr_prims=1, prim=0x2c63f04, 
    arrays=0x2c657e8, ctx=0x2cd33c0) at brw_draw.c:362
#3  brw_draw_prims (ctx=0x2cd33c0, arrays=0x2c657e8, prim=0x2c63f04, nr_prims=1, ib=0x0, index_bounds_valid=<optimized out>, 
    min_index=0, max_index=3) at brw_draw.c:447
#4  0x00007f653c1c0b93 in vbo_exec_vtx_flush (exec=0x2c63c20, unmap=1 '\001') at vbo/vbo_exec_draw.c:381
#5  0x00007f653c1be90c in vbo_exec_FlushVertices_internal (ctx=<optimized out>, unmap=<optimized out>) at vbo/vbo_exec_api.c:911
#6  0x00007f653c1beb0a in vbo_exec_FlushVertices (ctx=<optimized out>, flags=1) at vbo/vbo_exec_api.c:945
#7  0x00007f653c1810cb in _mesa_set_scissor (ctx=0x2cd33c0, x=3279, y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at main/scissor.c:75
#8  0x00007f653d5290b9 in __glXDisp_Render (cl=<optimized out>, pc=<optimized out>) at glxcmds.c:2000
#9  0x00007f653d52b831 in __glXDispatch (client=0x2c1bae0) at glxext.c:583
#10 0x000000000042ebd1 in Dispatch () at dispatch.c:431
#11 0x0000000000422e5a in main (argc=<optimized out>, argv=0x7fffbe6f4ef8, envp=<optimized out>) at main.c:287

Comment 6 Ian Pilcher 2011-03-27 17:49:08 UTC

I found http://cgit.freedesktop.org/mesa/mesa/commit/?id=13bab58f04c1ec6d0d52760eab490a0997d9abe2 and rebuilt mesa with that patch.  I now get this crash:

#0  0x0000000000000000 in ?? ()
#1  0x00007fb315acf3ad in _swrast_write_rgba_span (ctx=<optimized out>, span=0x7fff36a95750) at swrast/s_span.c:1275
#2  0x00007fb315aef705 in general_triangle (ctx=0x1c543b0, v0=<optimized out>, v1=<optimized out>, v2=0x2c3a8c0)
    at swrast/s_tritemp.h:819
#3  0x00007fb315a9798a in _tnl_render_poly_elts (ctx=0x1c543b0, start=0, count=4, flags=<optimized out>) at tnl/t_vb_rendertmp.h:352
#4  0x00007fb315a97fe1 in _tnl_RenderClippedPolygon (ctx=<optimized out>, elts=<optimized out>, n=<optimized out>)
    at tnl/t_vb_render.c:245
#5  0x00007fb315a920df in clip_quad_4 (ctx=<optimized out>, v0=<optimized out>, v1=<optimized out>, v2=<optimized out>, v3=3, 
    mask=<optimized out>) at tnl/t_vb_cliptmp.h:310
#6  0x00007fb315a94045 in clip_render_quads_verts (ctx=0x1c543b0, start=<optimized out>, count=4, flags=<optimized out>)
    at tnl/t_vb_rendertmp.h:383
#7  0x00007fb315a97f59 in run_render (ctx=0x1c543b0, stage=<optimized out>) at tnl/t_vb_render.c:321
#8  0x00007fb315a8c919 in _tnl_run_pipeline (ctx=0x1c543b0) at tnl/t_pipeline.c:153
#9  0x00007fb315a8d2c9 in _tnl_draw_prims (ctx=<optimized out>, arrays=0x1da2528, prim=0x1da0c44, nr_prims=1, ib=0x0, 
    min_index=<optimized out>, max_index=3) at tnl/t_draw.c:518
#10 0x00007fb315998254 in brw_draw_prims (ctx=0x1c543b0, arrays=0x1da2528, prim=0x1da0c44, nr_prims=1, ib=0x0, 
    index_bounds_valid=<optimized out>, min_index=0, max_index=3) at brw_draw.c:455
#11 0x00007fb315a83bb3 in vbo_exec_vtx_flush (exec=0x1da0960, unmap=1 '\001') at vbo/vbo_exec_draw.c:381
#12 0x00007fb315a8192c in vbo_exec_FlushVertices_internal (ctx=<optimized out>, unmap=<optimized out>) at vbo/vbo_exec_api.c:911
#13 0x00007fb315a81b2a in vbo_exec_FlushVertices (ctx=<optimized out>, flags=1) at vbo/vbo_exec_api.c:945
#14 0x00007fb315a440eb in _mesa_set_scissor (ctx=0x1c543b0, x=3279, y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at main/scissor.c:75
#15 0x00007fb316dec0b9 in __glXDisp_Render (cl=<optimized out>, pc=<optimized out>) at glxcmds.c:2000
#16 0x00007fb316dee831 in __glXDispatch (client=0x1c3f130) at glxext.c:583
#17 0x000000000042ebd1 in Dispatch () at dispatch.c:431
#18 0x0000000000422e5a in main (argc=<optimized out>, argv=0x7fff36a96d08, envp=<optimized out>) at main.c:287

This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534

Comment 7 Matěj Cepl 2011-04-07 23:19:00 UTC

(In reply to comment #6)
> This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534

Just that the other bug is Arrandale, this one is Sandybridge.

Also, when analyzing backtrace in comment 0, I get to the similar backtrace as what's in the comment 5:

/usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x84251) [0x7fd70c43d251]

is the last line was line 602 of src/mesa/drivers/dri/i965/brw_wm_surface_state.c:

static void
prepare_wm_surfaces(struct brw_context *brw)
{
   struct gl_context *ctx = &brw->intel.ctx;
   int i;
   int nr_surfaces = 0;

   if (ctx->DrawBuffer->_NumColorDrawBuffers >= 1) {
      for (i = 0; i < ctx->DrawBuffer->_NumColorDrawBuffers; i++) {
     struct gl_renderbuffer *rb = ctx->DrawBuffer->_ColorDrawBuffers[i];
     struct intel_renderbuffer *irb = intel_renderbuffer(rb);
     struct intel_region *region = irb ? irb->region : NULL;

>>>>     brw_add_validated_bo(brw, region->buffer);
     nr_surfaces = SURF_INDEX_DRAW(i) + 1;
      }
   }

Comment 8 Ian Pilcher 2011-04-07 23:29:05 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534
> 
> Just that the other bug is Arrandale, this one is Sandybridge.

I was just going by the similarity of the backtraces.  I have no idea
how [dis-]similar the two graphics controllers are.

> Also, when analyzing backtrace in comment 0, I get to the similar backtrace as
> what's in the comment 5:

That makes sense; they're the same crash.  The first one was just before
I figured out how to use gdb to get a better backtrace.

> >>>>     brw_add_validated_bo(brw, region->buffer);

And that's what led me to the commit referenced in comment #6, which gave
me a different crash.

One other thing I've noticed, which may or may not be significant ... If I
see a corrupted version of the KDE unlock screensaver dialog, I am able to
tab to the cancel key, let the screensaver run for a while, and try again.
Thus far, I have always been able to eventually get a properly rendered
unlock dialog and successfully unlock the screensaver without a crash.

Comment 9 Ian Pilcher 2011-04-18 03:06:40 UTC

Comments I just posted to the upstream bug (along with a backtrace of the
memory allocation failure):

I spent a significant amount of time digging into this today, and I've been
able to figure out the following sequence of events:

* Starting point is GLMatrix screensaver running on KDE 4.6.2 (Fedora 15
  x86_64, Core i7 2600 "HD 2000" GPU).  At this point everything appears
  to be working fine.

* Hit a key, move the mouse, etc. to bring up the screensaver unlock dialog.
  If the dialog is rendered properly at this point, then the crash will not
  occur.  Everything from here on is the incorrectly rendered case.

* The screensaver unlock dialog is not rendered correctly.  Most or all of
  it is invisible (black on black).  Various portions may appear is one
  "mouses over" or tabs to them.

* Type the password and press Enter.

* This is where I am able to catch the first sign of failure in the Mesa
  code (although the rendering problems indicate that something has already
  gone wrong, at least at the KDE level).

  drm_intel_bo_gem_create_from_name returns NULL to
  intel_region_alloc_for_handle.  This NULL gets propagated up to
  intel_update_renderbuffers, which sets the region of the renderbuffer
  to NULL.

* When prepare_wm_surfaces tries to use this renderbuffer, it encounters
  the NULL region.  This used to cause an immediate segfault, but it now
  detects the NULL region, sets brw->intel.Fallback to GL_TRUE, and bails.

* brw_draw_prims detects that brw_try_draw_prims failed, so it falls back
  to the software rasterizer, calling _swsetup_Wakeup and _tnl_draw_prims
  in turn.

* Eventually, it gets to _swrast_write_rgba_span, which tries to call the
  renderbuffer's PutRow function.  Of course, the renderbuffer is an
  intel_renderbuffer, so it's PutRow function is NULL, which causes the
  segfault we're seeing now.

Based on the last point, it seems like the software fallback that was
introduced in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2 is
fundamentally broken.  It clearly isn't possible to simply pass an
intel_renderbuffer to the software rasterizer.

I really feel that I've done as much digging on this as someone unfamiliar
with the codebase can be reasonably expected to do.  My wife agrees, BTW.
;-)  It would be *really* nice if someone familiar with how all of this is
supposed to work could take a look at this.

Comment 10 Ian Pilcher 2011-04-18 17:05:23 UTC

A bit more information.  The failure in drm_intel_bo_gem_create_from_name
occurs when drmIoctl is called with DRM_IOCTL_GEM_OPEN.  It is returning a
"No such file or directory" error.

Comment 11 Ian Pilcher 2011-04-18 20:32:25 UTC

(Referenced spreadsheet it attached to upstream bug.)

I modified drmIoctl to log GEM object lifecycle-related calls to syslog.  The
attached spreadsheet shows the log from a crash.  (I used a spreadsheet,
because it allowed me to hide 1,300+ calls that aren't related to the
problematic object, without actually deleting those lines; I might have
missed something.)  The interesting lines are:

  DRM_IOCTL_I915_GEM_CREATE(size: 14680064) succeeded -- handle: 8e
  DRM_IOCTL_GEM_FLINK(handle: 8e) succeeded -- name: 3
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: f6, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: f6) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 221, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 221) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 431, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 431) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 43a, size: 14680064
  DRM_IOCTL_GEM_CLOSE(handle: 43a) succeeded
  DRM_IOCTL_GEM_CLOSE(handle: 8e) succeeded
  DRM_IOCTL_GEM_OPEN(name: 3) failed: No such file or directory

So there appear to be at least two things happening here:

  1.  Based on the fact that unlocking works sometimes, the root cause is
      almost certainly a race condition in KDE.  However ...

  2.  There's very little prospect of that race condition ever being
      fixed (or even acknowledged) as long as Mesa is swallowing these
      errors and creating unusable renderbuffers.

I propose that, at the very least, a failure in intel_region_alloc_for_handle
(and probably intel_region_alloc as well) needs cause an error to be returned
to the application.  I will attempt to create a patch that does this, but it
would be *really* helpful if someone with more knowledge of the internals of
Mesa, GLX, etc. would step in and help out here.

Comment 12 Ian Pilcher 2011-04-24 02:02:18 UTC

Initial testing of the patch at
http://lists.x.org/archives/xorg-devel/2011-March/020716.html is looking good
for solving the KDE/OpenGL screensaver unlock crash.

The issue of "swallowing" GEM errors and creating render buffers with NULL
regions and functions pointers still exists.

Comment 13 Ian Pilcher 2011-04-30 04:11:44 UTC

I have been using the patches at the URLs below for almost a week now, and
I have not seen the corrupted screensaver unlock dialog or its associated
X crash.

http://article.gmane.org/gmane.comp.freedesktop.xorg.devel/20287/match=dri2+always+re+generate+front+buffer+information+asked

http://article.gmane.org/gmane.comp.freedesktop.xorg.devel/20762/match=dri2+invalidate+dri2+buffers+all+windows+same+pixmap+swap

I also tested on my old system with an ATI X1650 PRO, and it got rid of
the screensaver unlock dialog corruption there as well.

I was hoping that the X guys would commit these patches, but they seem
completely indifferent.

Dave/Ajax - Maybe you can get a reaction?

Comment 14 Ian Pilcher 2011-05-07 18:51:30 UTC

* Thu Apr 28 2011 Dave Airlie <airlied> 1.10.1-14
- backport upstream DRI2 fixes that are being screwed around with upstream

Thank you!

Note You need to log in before you can comment on or make changes to this bug.