Created attachment 487867 [details] X.org log Description of problem: Fedora 15 Alpha on a brand new Sandy Bridge system (Core i7 2600 with "Intel HD" graphics). Running KDE and using GLMatrix screensaver. When I unlocked the system this morning, the password entry dialog was not displayed correctly. When I pressed a key, only the password entry field became visible. The "Switch User ...", "Unlock", and "Cancel" buttons only appeared when I moved the mouse pointer over them; the dialog background remained black. When I entered my password, X crashed. The backtrace in the log is: [ 49562.907] 0: /usr/bin/X (xorg_backtrace+0x2f) [0x4a117f] [ 49562.907] 1: /usr/bin/X (0x400000+0x621c6) [0x4621c6] [ 49562.907] 2: /lib64/libpthread.so.0 (0x37ae800000+0xf4e0) [0x37ae80f4e0] [ 49562.907] 3: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x84251) [0x7fd70c43d251] [ 49562.907] 4: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x6e116) [0x7fd70c427116] [ 49562.907] 5: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x5d5cd) [0x7fd70c4165cd] [ 49562.907] 6: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x148b93) [0x7fd70c501b93] [ 49562.907] 7: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x14690c) [0x7fd70c4ff90c] [ 49562.907] 8: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x146b0a) [0x7fd70c4ffb0a] [ 49562.907] 9: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x1090cb) [0x7fd70c4c20cb] [ 49562.907] 10: /usr/lib64/xorg/modules/extensions/libglx.so (0x7fd70d839000+0x310b9) [0x7fd70d86a0b9] [ 49562.907] 11: /usr/lib64/xorg/modules/extensions/libglx.so (0x7fd70d839000+0x33831) [0x7fd70d86c831] [ 49562.907] 12: /usr/bin/X (0x400000+0x2ebd1) [0x42ebd1] [ 49562.907] 13: /usr/bin/X (0x400000+0x22e5a) [0x422e5a] [ 49562.907] 14: /lib64/libc.so.6 (__libc_start_main+0xed) [0x37adc2131d] [ 49562.907] 15: /usr/bin/X (0x400000+0x23141) [0x423141] [ 49562.907] Segmentation fault at address (nil) The graphics hardware is: 00:02.0 VGA compatible controller: Intel Corporation Sandy Bridge Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller]) Subsystem: Giga-byte Technology Device d000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 42 Region 0: Memory at fb800000 (64-bit, non-prefetchable) [size=4M] Region 2: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at ff00 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee4400c Data: 4171 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: i915 Kernel modules: i915 Version-Release number of selected component (if applicable): mesa-dri-drivers-7.10.1-1.fc15.x86_64 How reproducible: Not sure. I just built the system, and this is the first time I've tried running an OpenGL screensaver overnight on it. Steps to Reproduce: See above. Actual results: X crash. Expected results: No X crash. Additional info: I will install the relevant debuginfo patches and try to reproduce.
Created attachment 487869 [details] dmesg
Created attachment 487874 [details] 99-display-positions.conf Configuration file I use to set my screen positions (just for the sake of completeness).
Note that I've rebuilt xorg-x11-server, because of bug #680684.
Created attachment 487943 [details] dmesg with drm.debug=14 dmesg from X crash and subsequent X restart (by kdm). It seems that this may be related to DPMS. I was only able to reproduce the crash by allowing the screensaver to run until DPMS turned off the displays before I attempted to unlock it. BTW, I've installed the relevant debuginfo packages, but the backtrace in the X log does not appear to be using that information. I there a tool that I can feed the backtrace from the log through to get a more useful version?
Created attachment 488015 [details] gdb backtrace of X crash (bt full) Attaching full backtrace. Here is the short version: #0 prepare_wm_surfaces (brw=0x2cd33c0) at brw_wm_surface_state.c:602 #1 0x00007f653c0e6116 in brw_validate_state (brw=0x2cd33c0) at brw_state_upload.c:397 #2 0x00007f653c0d55cd in brw_try_draw_prims (max_index=<optimized out>, min_index=<optimized out>, ib=0x0, nr_prims=1, prim=0x2c63f04, arrays=0x2c657e8, ctx=0x2cd33c0) at brw_draw.c:362 #3 brw_draw_prims (ctx=0x2cd33c0, arrays=0x2c657e8, prim=0x2c63f04, nr_prims=1, ib=0x0, index_bounds_valid=<optimized out>, min_index=0, max_index=3) at brw_draw.c:447 #4 0x00007f653c1c0b93 in vbo_exec_vtx_flush (exec=0x2c63c20, unmap=1 '\001') at vbo/vbo_exec_draw.c:381 #5 0x00007f653c1be90c in vbo_exec_FlushVertices_internal (ctx=<optimized out>, unmap=<optimized out>) at vbo/vbo_exec_api.c:911 #6 0x00007f653c1beb0a in vbo_exec_FlushVertices (ctx=<optimized out>, flags=1) at vbo/vbo_exec_api.c:945 #7 0x00007f653c1810cb in _mesa_set_scissor (ctx=0x2cd33c0, x=3279, y=<optimized out>, width=<optimized out>, height=<optimized out>) at main/scissor.c:75 #8 0x00007f653d5290b9 in __glXDisp_Render (cl=<optimized out>, pc=<optimized out>) at glxcmds.c:2000 #9 0x00007f653d52b831 in __glXDispatch (client=0x2c1bae0) at glxext.c:583 #10 0x000000000042ebd1 in Dispatch () at dispatch.c:431 #11 0x0000000000422e5a in main (argc=<optimized out>, argv=0x7fffbe6f4ef8, envp=<optimized out>) at main.c:287
I found http://cgit.freedesktop.org/mesa/mesa/commit/?id=13bab58f04c1ec6d0d52760eab490a0997d9abe2 and rebuilt mesa with that patch. I now get this crash: #0 0x0000000000000000 in ?? () #1 0x00007fb315acf3ad in _swrast_write_rgba_span (ctx=<optimized out>, span=0x7fff36a95750) at swrast/s_span.c:1275 #2 0x00007fb315aef705 in general_triangle (ctx=0x1c543b0, v0=<optimized out>, v1=<optimized out>, v2=0x2c3a8c0) at swrast/s_tritemp.h:819 #3 0x00007fb315a9798a in _tnl_render_poly_elts (ctx=0x1c543b0, start=0, count=4, flags=<optimized out>) at tnl/t_vb_rendertmp.h:352 #4 0x00007fb315a97fe1 in _tnl_RenderClippedPolygon (ctx=<optimized out>, elts=<optimized out>, n=<optimized out>) at tnl/t_vb_render.c:245 #5 0x00007fb315a920df in clip_quad_4 (ctx=<optimized out>, v0=<optimized out>, v1=<optimized out>, v2=<optimized out>, v3=3, mask=<optimized out>) at tnl/t_vb_cliptmp.h:310 #6 0x00007fb315a94045 in clip_render_quads_verts (ctx=0x1c543b0, start=<optimized out>, count=4, flags=<optimized out>) at tnl/t_vb_rendertmp.h:383 #7 0x00007fb315a97f59 in run_render (ctx=0x1c543b0, stage=<optimized out>) at tnl/t_vb_render.c:321 #8 0x00007fb315a8c919 in _tnl_run_pipeline (ctx=0x1c543b0) at tnl/t_pipeline.c:153 #9 0x00007fb315a8d2c9 in _tnl_draw_prims (ctx=<optimized out>, arrays=0x1da2528, prim=0x1da0c44, nr_prims=1, ib=0x0, min_index=<optimized out>, max_index=3) at tnl/t_draw.c:518 #10 0x00007fb315998254 in brw_draw_prims (ctx=0x1c543b0, arrays=0x1da2528, prim=0x1da0c44, nr_prims=1, ib=0x0, index_bounds_valid=<optimized out>, min_index=0, max_index=3) at brw_draw.c:455 #11 0x00007fb315a83bb3 in vbo_exec_vtx_flush (exec=0x1da0960, unmap=1 '\001') at vbo/vbo_exec_draw.c:381 #12 0x00007fb315a8192c in vbo_exec_FlushVertices_internal (ctx=<optimized out>, unmap=<optimized out>) at vbo/vbo_exec_api.c:911 #13 0x00007fb315a81b2a in vbo_exec_FlushVertices (ctx=<optimized out>, flags=1) at vbo/vbo_exec_api.c:945 #14 0x00007fb315a440eb in _mesa_set_scissor (ctx=0x1c543b0, x=3279, y=<optimized out>, width=<optimized out>, height=<optimized out>) at main/scissor.c:75 #15 0x00007fb316dec0b9 in __glXDisp_Render (cl=<optimized out>, pc=<optimized out>) at glxcmds.c:2000 #16 0x00007fb316dee831 in __glXDispatch (client=0x1c3f130) at glxext.c:583 #17 0x000000000042ebd1 in Dispatch () at dispatch.c:431 #18 0x0000000000422e5a in main (argc=<optimized out>, argv=0x7fff36a96d08, envp=<optimized out>) at main.c:287 This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534
(In reply to comment #6) > This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534 Just that the other bug is Arrandale, this one is Sandybridge. Also, when analyzing backtrace in comment 0, I get to the similar backtrace as what's in the comment 5: /usr/lib64/dri/i965_dri.so (0x7fd70c3b9000+0x84251) [0x7fd70c43d251] is the last line was line 602 of src/mesa/drivers/dri/i965/brw_wm_surface_state.c: static void prepare_wm_surfaces(struct brw_context *brw) { struct gl_context *ctx = &brw->intel.ctx; int i; int nr_surfaces = 0; if (ctx->DrawBuffer->_NumColorDrawBuffers >= 1) { for (i = 0; i < ctx->DrawBuffer->_NumColorDrawBuffers; i++) { struct gl_renderbuffer *rb = ctx->DrawBuffer->_ColorDrawBuffers[i]; struct intel_renderbuffer *irb = intel_renderbuffer(rb); struct intel_region *region = irb ? irb->region : NULL; >>>> brw_add_validated_bo(brw, region->buffer); nr_surfaces = SURF_INDEX_DRAW(i) + 1; } }
(In reply to comment #7) > (In reply to comment #6) > > This looks like https://bugs.freedesktop.org/show_bug.cgi?id=32534 > > Just that the other bug is Arrandale, this one is Sandybridge. I was just going by the similarity of the backtraces. I have no idea how [dis-]similar the two graphics controllers are. > Also, when analyzing backtrace in comment 0, I get to the similar backtrace as > what's in the comment 5: That makes sense; they're the same crash. The first one was just before I figured out how to use gdb to get a better backtrace. > >>>> brw_add_validated_bo(brw, region->buffer); And that's what led me to the commit referenced in comment #6, which gave me a different crash. One other thing I've noticed, which may or may not be significant ... If I see a corrupted version of the KDE unlock screensaver dialog, I am able to tab to the cancel key, let the screensaver run for a while, and try again. Thus far, I have always been able to eventually get a properly rendered unlock dialog and successfully unlock the screensaver without a crash.
Comments I just posted to the upstream bug (along with a backtrace of the memory allocation failure): I spent a significant amount of time digging into this today, and I've been able to figure out the following sequence of events: * Starting point is GLMatrix screensaver running on KDE 4.6.2 (Fedora 15 x86_64, Core i7 2600 "HD 2000" GPU). At this point everything appears to be working fine. * Hit a key, move the mouse, etc. to bring up the screensaver unlock dialog. If the dialog is rendered properly at this point, then the crash will not occur. Everything from here on is the incorrectly rendered case. * The screensaver unlock dialog is not rendered correctly. Most or all of it is invisible (black on black). Various portions may appear is one "mouses over" or tabs to them. * Type the password and press Enter. * This is where I am able to catch the first sign of failure in the Mesa code (although the rendering problems indicate that something has already gone wrong, at least at the KDE level). drm_intel_bo_gem_create_from_name returns NULL to intel_region_alloc_for_handle. This NULL gets propagated up to intel_update_renderbuffers, which sets the region of the renderbuffer to NULL. * When prepare_wm_surfaces tries to use this renderbuffer, it encounters the NULL region. This used to cause an immediate segfault, but it now detects the NULL region, sets brw->intel.Fallback to GL_TRUE, and bails. * brw_draw_prims detects that brw_try_draw_prims failed, so it falls back to the software rasterizer, calling _swsetup_Wakeup and _tnl_draw_prims in turn. * Eventually, it gets to _swrast_write_rgba_span, which tries to call the renderbuffer's PutRow function. Of course, the renderbuffer is an intel_renderbuffer, so it's PutRow function is NULL, which causes the segfault we're seeing now. Based on the last point, it seems like the software fallback that was introduced in commit 13bab58f04c1ec6d0d52760eab490a0997d9abe2 is fundamentally broken. It clearly isn't possible to simply pass an intel_renderbuffer to the software rasterizer. I really feel that I've done as much digging on this as someone unfamiliar with the codebase can be reasonably expected to do. My wife agrees, BTW. ;-) It would be *really* nice if someone familiar with how all of this is supposed to work could take a look at this.
A bit more information. The failure in drm_intel_bo_gem_create_from_name occurs when drmIoctl is called with DRM_IOCTL_GEM_OPEN. It is returning a "No such file or directory" error.
(Referenced spreadsheet it attached to upstream bug.) I modified drmIoctl to log GEM object lifecycle-related calls to syslog. The attached spreadsheet shows the log from a crash. (I used a spreadsheet, because it allowed me to hide 1,300+ calls that aren't related to the problematic object, without actually deleting those lines; I might have missed something.) The interesting lines are: DRM_IOCTL_I915_GEM_CREATE(size: 14680064) succeeded -- handle: 8e DRM_IOCTL_GEM_FLINK(handle: 8e) succeeded -- name: 3 DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: f6, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: f6) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 221, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 221) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 431, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 431) succeeded DRM_IOCTL_GEM_OPEN(name: 3) succeeded -- handle: 43a, size: 14680064 DRM_IOCTL_GEM_CLOSE(handle: 43a) succeeded DRM_IOCTL_GEM_CLOSE(handle: 8e) succeeded DRM_IOCTL_GEM_OPEN(name: 3) failed: No such file or directory So there appear to be at least two things happening here: 1. Based on the fact that unlocking works sometimes, the root cause is almost certainly a race condition in KDE. However ... 2. There's very little prospect of that race condition ever being fixed (or even acknowledged) as long as Mesa is swallowing these errors and creating unusable renderbuffers. I propose that, at the very least, a failure in intel_region_alloc_for_handle (and probably intel_region_alloc as well) needs cause an error to be returned to the application. I will attempt to create a patch that does this, but it would be *really* helpful if someone with more knowledge of the internals of Mesa, GLX, etc. would step in and help out here.
Initial testing of the patch at http://lists.x.org/archives/xorg-devel/2011-March/020716.html is looking good for solving the KDE/OpenGL screensaver unlock crash. The issue of "swallowing" GEM errors and creating render buffers with NULL regions and functions pointers still exists.
I have been using the patches at the URLs below for almost a week now, and I have not seen the corrupted screensaver unlock dialog or its associated X crash. http://article.gmane.org/gmane.comp.freedesktop.xorg.devel/20287/match=dri2+always+re+generate+front+buffer+information+asked http://article.gmane.org/gmane.comp.freedesktop.xorg.devel/20762/match=dri2+invalidate+dri2+buffers+all+windows+same+pixmap+swap I also tested on my old system with an ATI X1650 PRO, and it got rid of the screensaver unlock dialog corruption there as well. I was hoping that the X guys would commit these patches, but they seem completely indifferent. Dave/Ajax - Maybe you can get a reaction?
* Thu Apr 28 2011 Dave Airlie <airlied> 1.10.1-14 - backport upstream DRI2 fixes that are being screwed around with upstream Thank you!