Bug 106566
Summary: | OpenGL programs segfault | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Rahul Karnik <rahul> |
Component: | XFree86 | Assignee: | Mike A. Harris <mharris> |
Status: | CLOSED RAWHIDE | QA Contact: | David Lawrence <dkl> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | anovikov, behdad, davej, jakub, michel.salim |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | athlon | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | 4.3.0-42 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-10-24 18:39:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Rahul Karnik
2003-10-08 14:05:22 UTC
gdb backtrace follows: (gdb) r Starting program: /usr/X11R6/bin/glxinfo Error while mapping shared library sections: linux-gate.so.1: Success. Error while reading shared library symbols: linux-gate.so.1: No such file or directory. (no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 1077917344 (LWP 3291)] (no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...Error while reading shared library symbols: linux-gate.so.1: No such file or directory. Error while reading shared library symbols: linux-gate.so.1: No such file or directory. name of display: :0.0 libGL: XF86DRIGetClientDriverName: 4.0.1 r200 (screen 0) libGL: OpenDriver: trying /usr/X11R6/lib/modules/dri/tls/r200_dri.so Error while reading shared library symbols: linux-gate.so.1: No such file or directory. (no debugging symbols found)...Error while reading shared library symbols: linux-gate.so.1: No such file or directory. libGL: XF86DRIGetClientDriverName: 4.0.1 r200 (screen 0) drmOpenByBusid: busid is PCI:2:0:0 drmOpenDevice: minor is 0 drmOpenDevice: node name is /dev/dri/card0 drmOpenDevice: open result is 6, (OK) drmOpenByBusid: drmOpenMinor returns 6 drmOpenByBusid: drmGetBusid reports PCI:2:0:0 Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 1077917344 (LWP 3291)] 0x4053a2cb in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so (gdb) bt #0 0x4053a2cb in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so #1 0x40539ff1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so #2 0x40539ea0 in sigill_handler () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so Previous frame inner to this frame (corrupt stack?) I am using a kernel.org 2.6.0-test6 kernel with NForce2 AGP and Radeon DRM support compiled in. The hardware is a Radeon 9000 on a MSI K7N2L motherboard with an Athlon XP 2100+ processor. Please continue in gdb from the SIGFPE to find the actual segfault... The 2.6.0 kernel is not supported, nor are user compiled kernels supported. As Bill said, when you are debugging OpenGL apps in gdb, the first SEGV that occurs is expected, and will always occur. Use "cont" to continue on to the real problem. Is this problem reproduceable with the 2.4.x kernel supplied with test2, or 2.4.x updates released to rawhide since then? You mean like this? Not much difference that I can see, so please let me know if I am doing something wrong. Will try 2.4.x now. (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x4053a2ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so (gdb) bt #0 0x4053a2ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so #1 0x40539ff1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so #2 0x40539ea0 in sigill_handler () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so Previous frame inner to this frame (corrupt stack?) 2.4.x works fine. Feel free to close, but I am willing to debug the 2.6.x case. 0x4053a2ce in _mesa_test_os_sse_exception_support () That is SSE detection code. It tests for SSE by installing a signal handler, then executing SSE code. If the processor doesn't support SSE, a signal is triggered and handled properly, then things proceed. However if you're running it in a debugger, then the SEGV will get trapped by gdb instead of going to the Mesa handler. So you have to "cont" to let it be handled, and let the code proceed to the real problem. You should join the dri-devel.net mailing list and involve them in the discussion, and also join the #dri-devel channel on irc.freenode.net Both forums will be much more helpful for debugging the problem than using bugzilla alone here. I monitor both the IRC channel and the mailing list, so we can discuss things there, and then provide status results back here for tracking purposes. Sound good? Okay, will follow up on dri-devel. FYI, I experience the same problem using arjan's 2.6 kernels and self-compiled 2.6.0-test{5,6} kernels on a Radeon Mobility M6. A user posted this on dri-devel - funnily for him Radeon worked but R128 did not; replacing the XFree DRI drivers with the latest CVS snapshots seemed to work for him after he deleted the original tls-enabled version. http://marc.theaimsgroup.com/?l=dri-devel&m=106461641128696&w=2 Not an Athlon-specific problem, my notebook has a Pentium-M. That is a useful datapoint, but only in a limited way as multiple things were changed which resulted in a working setup. There isn't any way to know what solved the problem. Also, it isn't clear if you mean "latest XFree86.org snapshot, ie: 4.3.99.14" or if you mean "latest DRI project dri.sf.net snapshot". It's recommended to not ever delete things, instead move them out of the way if you want to test something. That way they can always be moved back for further tests. What I need to know, is using the current rawhide XFree86 as supplied by us, if you rename the tls directories (both of them) to tmp-tls and then restart XFree86 and run the apps does the problem go away? I need to know if this is a problem that only occurs when TLS is used, or if it is a bug in stock X code which goes away when using XFree86 CVS or DRI CVS code, and in particular exactly what from DRI or XFree86 CVS - the 2D driver? 3D DRI driver? libGL? We obviously can't wholesale update our driver to an entire new driver codebase, however if a single bug fix can be isolated, it is possible we may be able to investigate including it. Or if TLS is the problem, we may be able to determine what that is and fix it. Thanks in advance. I will try and find the minimal changes required to get DRI working on 2.6 then. Do you want testing against the latest Rawhide XFree86 (-37) or the ones in your yum repo (-39) ? Thanks, Michel It does not seem to be TLS - renaming /usr/X11R6/lib/tls and /usr/X11R6/lib/modules/dri/tls to tls-bak still results in segfaults. Renaming them and then replacing dri/radeon_dri.so and drivers/{ati,radeon}_drv.o with drivers from the latest dri.sf.net snapshot gets glxinfo working, but without direct rendering. The segfault thus seems to be caused by something common in both the normal and TLS drivers shipped in Rawhide, but not by TLS per se. What is TLS, actually? Turns out they still do not support kernel 2.6 either - the installation script provided had a notice to that effect. I've got the same problem on my laptop. On RH9: * With 2.4 kernel, worked without acceleration. * With 2.6 kernel, worked with acceleration. Now with FC2, FC3, and Rawhide: * With shipped 2.4 kernel, works with acceleration. * With 2.6 kernel, segfault. So something that came in after RH 9 has broke that... On RHL 9: * With 2.4 kernel, 2D and 3D acceleration work fine for me. On current Fedora Core: * With 2.4 kernel, 2D and 3D acceleration work fine for me. I haven't even attempted to boot a 2.6.x kernel, and wont be doing so until rawhide contains a 2.6.x kernel and it is a work priority to even investigate such issues. We DO NOT support 2.6.x kernels right now, I just want to make that clear so that people understand why I refuse to make ANY 2.6.x kernel related issues a priority. I have only so many hours in a day, and I'm spending them strictly on configurations we plan on shipping in the OS, and that excludes 2.6.x kernels right now. Above, when I said you must continue past the initial crash, you didn't do it right. Run a GL app, and when the following occurs: 0x4053a2ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r200_dri.so *after* that, type "cont" in gdb to continue on to the real failure. Other suggestions that might aide you in troubleshooting and debugging this issue: - Disable exec-shield, as there is a known issue with DRI modules which prevents them from working properly if exec-shield is enabled. - Rename the TLS library dir to "tls-disabled" and ditto for the TLS DRI module directory Since this issue is strictly related to using a 2.6.x test kernel, I am changing the status to DEFERRED for the time being until we actually have a 2.6.x kernel in our rawhide tree. If the problem still exists then, at some point it will be investigated if someone hasn't already attached a bug fix here or determined what the problem is. I however.. don't have time for hacking on 2.6.x kernel problems, and I'll soon enough be hacking on XFree86 4.3.99.x/4.4.0 problems. ;o) In the mean time, feel free to further debug this if you wish, and continue to update the report, as that will be useful information later on. Thanks. Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 1073837376 (LWP 7273)] 0x40156cab in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x40156cae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) cont Continuing. Couldn't get registers: No such process. --------------- It's not TLS anyway, as X still does not run in DRI mode with TLS disabled. Anyway, the problem will probably disappear with 4.4.0, no point spending too much time on it. It could be anything - NPTL upgrade? What we need is for someone to recompile Rawhide XFree on a RH9 machine and run kernel 2.6 on it (upgrade initscripts and modutils too otherwise it'll be rather painful). I'll probably try it this weekend if nobody else has done it. Regards, Michel [behdad@mces behdad]$ uname -r 2.6.0-0.test8.1.64 [behdad@mces behdad]$ rpm -q XFree86 XFree86-4.3.0-40 [behdad@mces behdad]$ glxinfo name of display: :0.0 Segmentation fault [behdad@mces behdad]$ gdb glxinfo GNU gdb Red Hat Linux (5.3.90-0.20030710.41rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib/libthread_db.so.1". (gdb) r Starting program: /usr/X11R6/bin/glxinfo (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 16384 (LWP 3683)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...name of display: :0.0 (no debugging symbols found)... Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 16384 (LWP 3683)] 0x00dd2cab in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x00dd2cae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) bt #0 0x00dd2cae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so #1 0x00dd29d1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so #2 0x00dd2880 in sigill_handler () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so Previous frame inner to this frame (corrupt stack?) (gdb)quit [root@mces behdad]# cd /usr/X11R6/lib/modules/dri [root@mces dri]# mv tls.old tls [root@mces behdad]# cd /usr/X11R6/lib [root@mces dri]# mv tls.old tls [...restart X...] [behdad@mces behdad]$ glxinfo name of display: :0.0 Segmentation fault [behdad@mces behdad]$ gdb glxinfo GNU gdb Red Hat Linux (5.3.90-0.20030710.41rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib/libthread_db.so.1". (gdb) r Starting program: /usr/X11R6/bin/glxinfo (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 16384 (LWP 4264)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...name of display: :0.0 (no debugging symbols found)... Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 16384 (LWP 4264)] 0x00f50dab in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/radeon_dri.so (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x00f50dae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/radeon_dri.so (gdb) bt #0 0x00f50dae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/radeon_dri.so #1 0x00f50ad1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/radeon_dri.so #2 0x00f50980 in sigill_handler () from /usr/X11R6/lib/modules/dri/radeon_dri.so Previous frame inner to this frame (corrupt stack?) (gdb)quit BTW, turning off exec-shield and exec-shield-randomize in /proc and restarting X does not help too. More, did run, set a breakpoint on the point that it was supposed to SEGFAULT, when stopped for SIGFPE, and continue. A disassemble on the point of SIGSEGV: Breakpoint 1, 0x401abcae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) disassemble Dump of assembler code for function _mesa_test_os_sse_exception_support: 0x401abc74 <_mesa_test_os_sse_exception_support+0>: push %ebp 0x401abc75 <_mesa_test_os_sse_exception_support+1>: mov %esp,%ebp 0x401abc77 <_mesa_test_os_sse_exception_support+3>: sub $0x8,%esp 0x401abc7a <_mesa_test_os_sse_exception_support+6>: stmxcsr 0xfffffffc(%ebp) 0x401abc7e <_mesa_test_os_sse_exception_support+10>: stmxcsr 0xfffffff8(%ebp) 0x401abc82 <_mesa_test_os_sse_exception_support+14>: andl $0xfffffdff,0xfffffff8(%ebp) 0x401abc89 <_mesa_test_os_sse_exception_support+21>: ldmxcsr 0xfffffff8(%ebp) 0x401abc8d <_mesa_test_os_sse_exception_support+25>: xorps %xmm0,%xmm0 0x401abc90 <_mesa_test_os_sse_exception_support+28>: push $0x3f800000 0x401abc95 <_mesa_test_os_sse_exception_support+33>: push $0x3f800000 0x401abc9a <_mesa_test_os_sse_exception_support+38>: push $0x3f800000 0x401abc9f <_mesa_test_os_sse_exception_support+43>: push $0x3f800000 0x401abca4 <_mesa_test_os_sse_exception_support+48>: movups (%esp,1),%xmm1 0x401abca8 <_mesa_test_os_sse_exception_support+52>: add $0x20,%esp 0x401abcab <_mesa_test_os_sse_exception_support+55>: divps %xmm0,%xmm1 0x401abcae <_mesa_test_os_sse_exception_support+58>: ldmxcsr 0xfffffffc(%ebp) 0x401abcb2 <_mesa_test_os_sse_exception_support+62>: leave 0x401abcb3 <_mesa_test_os_sse_exception_support+63>: ret 0x401abcb4 <_mesa_test_os_sse_exception_support+64>: nop 0x401abcb5 <_mesa_test_os_sse_exception_support+65>: nop 0x401abcb6 <_mesa_test_os_sse_exception_support+66>: nop 0x401abcb7 <_mesa_test_os_sse_exception_support+67>: nop 0x401abcb8 <_mesa_test_os_sse_exception_support+68>: nop 0x401abcb9 <_mesa_test_os_sse_exception_support+69>: nop 0x401abcba <_mesa_test_os_sse_exception_support+70>: nop 0x401abcbb <_mesa_test_os_sse_exception_support+71>: nop 0x401abcbc <_mesa_test_os_sse_exception_support+72>: nop 0x401abcbd <_mesa_test_os_sse_exception_support+73>: nop 0x401abcbe <_mesa_test_os_sse_exception_support+74>: nop 0x401abcbf <_mesa_test_os_sse_exception_support+75>: nop End of assembler dump. (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x401abcae in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/radeon_dri.so (gdb) And one of those weird experiences (which is quite common): Setting a breakpoint *before* starting the run, makes it run smoothly, without even the expected SIGFPE, but ofcourse with no direct-rendering: (no matter the breakpoint is at where SIGFPE was expected or the SIGSEGV place) (gdb) b *0x401abcae Breakpoint 1 at 0x401abcae (gdb) r Starting program: /usr/X11R6/bin/glxinfo (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 16384 (LWP 5353)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...name of display: :0.0 display: :0 screen: 0 direct rendering: No server glx vendor string: SGI server glx version string: 1.2 server glx extensions: GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_EXT_import_context client glx vendor string: SGI client glx version string: 1.2 client glx extensions: GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_EXT_import_context GLX extensions: GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_EXT_import_context OpenGL vendor string: Mesa project: www.mesa3d.org OpenGL renderer string: Mesa GLX Indirect OpenGL version string: 1.3 Mesa 4.0.4 OpenGL extensions: GL_ARB_imaging, GL_ARB_multitexture, GL_ARB_texture_border_clamp, GL_ARB_texture_cube_map, GL_ARB_texture_env_add, GL_ARB_texture_env_combine, GL_ARB_texture_env_dot3, GL_ARB_transpose_matrix, GL_EXT_abgr, GL_EXT_blend_color, GL_EXT_blend_minmax, GL_EXT_blend_subtract, GL_EXT_texture_env_add, GL_EXT_texture_env_combine, GL_EXT_texture_env_dot3, GL_EXT_texture_lod_bias glu version: 1.3 glu extensions: GLU_EXT_nurbs_tessellator, GLU_EXT_object_space_tess visual x bf lv rg d st colorbuffer ax dp st accumbuffer ms cav id dep cl sp sz l ci b ro r g b a bf th cl r g b a ns b eat ---------------------------------------------------------------------- 0x23 24 tc 0 24 0 r . . 8 8 8 8 0 24 0 0 0 0 0 0 0 None 0x24 24 tc 0 24 0 r . . 8 8 8 8 0 24 8 0 0 0 0 0 0 None 0x25 24 tc 0 24 0 r . . 8 8 8 8 0 24 0 16 16 16 16 0 0 Slow 0x26 24 tc 0 24 0 r . . 8 8 8 8 0 24 8 16 16 16 16 0 0 Slow 0x27 24 tc 0 24 0 r y . 8 8 8 8 0 24 0 0 0 0 0 0 0 None 0x28 24 tc 0 24 0 r y . 8 8 8 8 0 24 8 0 0 0 0 0 0 None 0x29 24 tc 0 24 0 r y . 8 8 8 8 0 24 0 16 16 16 16 0 0 Slow 0x2a 24 tc 0 24 0 r y . 8 8 8 8 0 24 8 16 16 16 16 0 0 Slow 0x2b 24 dc 0 24 0 r . . 8 8 8 8 0 24 0 0 0 0 0 0 0 None 0x2c 24 dc 0 24 0 r . . 8 8 8 8 0 24 8 0 0 0 0 0 0 None 0x2d 24 dc 0 24 0 r . . 8 8 8 8 0 24 0 16 16 16 16 0 0 Slow 0x2e 24 dc 0 24 0 r . . 8 8 8 8 0 24 8 16 16 16 16 0 0 Slow 0x2f 24 dc 0 24 0 r y . 8 8 8 8 0 24 0 0 0 0 0 0 0 None 0x30 24 dc 0 24 0 r y . 8 8 8 8 0 24 8 0 0 0 0 0 0 None 0x31 24 dc 0 24 0 r y . 8 8 8 8 0 24 0 16 16 16 16 0 0 Slow 0x32 24 dc 0 24 0 r y . 8 8 8 8 0 24 8 16 16 16 16 0 0 Slow Program exited normally. (gdb) *** Bug 107829 has been marked as a duplicate of this bug. *** Here's an update on this issue: The 2.6.x kernel and libGL are incompatible with each other due to the following patch: XFree86-4.3.0-Mesa-SSE-fixes-from-MesaCVS.patch That patch fixes a rather important bug in SSE detection from Mesa CVS, and without it, many people using Intel CPUs will be unable to use 3D acceleration at all with our supplied setup due to what I believe is a CPU bug. This patch works around that problem, however something causes a problem with 2.6.x kernels. Fedora Core 1 will ship with a 2.4 kernel, and that's what will be supported so our XFree86 and Mesa must be compatible with that, and this bug fix is important for users using the OS with the supplied 2.4 kernel. Since I'm not yet working on 2.6.x compatibility issues, this problem is defered until 2.6.x is included in rawhide officially at least. Once that occurs I'll reopen this issue, and investigate it at that time. In the mean time, anyone interested in volunteering to troubleshoot this and determine what the problem is, and possibly provide a patch to fix it that doesn't break 2.4.x kernel behaviour at all or disable the functionality of the existing patch, is encouraged and welcome to give it a shot. Well, that was sure fast... Linus just sent details of this problem along with the fix to use. I've added his proposed fix to 4.3.0-42, and this problem should be fixed in the next rawhide build. For those interested in the technical explanation behind this problem, please see bug #107932 Very glad to get a fix for this, as I know I wouldn't been able to dedicate any time for it between now and our final release of Fedora Core 1. Be sure to put Linus on your Christmas card lists. ;o) Everyone on this Cc list should pool and buy him something nice :) Now that I've said it, maybe Amazon will try and patent it too |