From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030703 Description of problem: DRM is loaded... eth0: no IPv6 routers present [drm] Initialized r128 2.5.0 20030725 on minor 0 cdrom: This disc doesn't have any tracks I recognize! [jonsmirl@smirl fbdev-2.5]$ But all DRI programs Segfault.... [jonsmirl@smirl fbdev-2.5]$ gdb glxinfo GNU gdb Red Hat Linux (5.3.90-0.20030710.29rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run Starting program: /usr/X11R6/bin/glxinfo (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 1077937824 (LWP 10135)] (no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)...name of display: :0.0 (no debugging symbols found)... Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 1077937824 (LWP 10135)] 0x405379cb in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x405379ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so (gdb) I am using current Linus 2.6 tree for DRM drivers. Radeon runs without problem. 2.8Ghz P4 Hyperthreading turned on RHN account jonsmirl for hardware profile Version-Release number of selected component (if applicable): XFree86-4.3.0-32 How reproducible: Always Steps to Reproduce: 1. use a R128 2. run a 3D program 3. Additional info:
>Program received signal SIGSEGV, Segmentation fault. >0x405379ce in _mesa_test_os_sse_exception_support () > from /usr/X11R6/lib/modules/dri/tls/r128_dri.so The above SEGV will happen when you run any OpenGL application in a debugger, as Mesa is testing for SSE exception, which causes a SEGV on purpose. When not in a debugger, the SEGV is handled by the library, and it proceeds normally. When in the debugger, the debugger gets the SEGV instead of Mesa, so you need to do "cont" in gdb, to let it proceed to the real problem. Please do this, and provide an updated backtrace, however note that a backtrace without debugging symbols is pretty useless. I suspect the problem you're having however is due to the combination of DRI plus exec-shield. The DRI code allocates memory and then drops code into it and executes it. This worked before because malloc'd memory was executable by default coincidentally, and so programmers just assume memory is always executable from malloc(), and don't actually _request_ executable memory. This is due to DRI not calling mprotect() with PROT_EXEC as a parameter to mark the memory region as executable prior to trying to execute code in it. Why this problem happens now all of a sudden and didn't in the past, is because Red Hat has written new security enhancement called exec-shield, which disables executable stack, and also disables executable memory by default. This causes many programs not programmed correctly to now break because they don't use PROT_EXEC. If you disable exec-shield and this problem goes away, then we know it is the same problem you are experiencing. echo 0 >/proc/sys/kernel/exec-shield Please let me know if that works around the problem. If so, we are working on trying to fix the broken DRI code right now, however it might be a while before an update is available. If the problem occurs wether or not exec-shield is enabled, then it is probably some other bug.
They segfault without the debugger too. Is there some way I can give you more info on the segfault? I am on 2.6-test5 + Linus bk kernel. I don't think I have exec-shield. /proc/sys/kernel/exec-shield doesn't exist. I am able to run DRI on my Radeon card without problem on same hardware and OS install. Also, this is a PCI Rage128 02:02.0 VGA compatible controller: ATI Technologies Inc Rage 128 PD/PRO TMDS
I'm not getting anywhere trying to continue from gdb... [root@smirl log]# gdb glxinfo GNU gdb Red Hat Linux (5.3.90-0.20030710.29rh) Copyright 2003 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib/tls/libthread_db.so.1". (gdb) run Starting program: /usr/X11R6/bin/glxinfo (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...[Thread debugging using libthread_db enabled] [New Thread 1077937824 (LWP 3228)] (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)... (no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...name of display: :0.0 (no debugging symbols found)... Program received signal SIGFPE, Arithmetic exception. [Switching to Thread 1077937824 (LWP 3228)] 0x405379cb in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. 0x405379ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so (gdb) bt #0 0x405379ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so #1 0x405376f1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so Previous frame inner to this frame (corrupt stack?) (gdb) cont Continuing. Couldn't get registers: No such process. (gdb) bt #0 0x405379ce in _mesa_test_os_sse_exception_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so #1 0x405376f1 in check_os_sse_support () from /usr/X11R6/lib/modules/dri/tls/r128_dri.so (gdb)
I somehow missed the fact you're using a non-Red Hat kernel. We don't support user compiled kernels. I'm adding Arjan and Dave to CC in case either are interested/willing to investigate the 2.6 DRM issue at all. If not though, I'll have to close this as NOTABUG as we don't support your kernel, in which case your bug should be reported on lkml and/or dri-devel and I neither use nor test 2.6.x as my hands are more than full with 2.4.x and existing issues I need to support. Any comments Arjan/Dave?
I have verified that R128 works on 2.4 2051 SMP and UNI. But it is definitely broken on 2.6 SMP.
Sounds to me like a bug in 2.6.x kernel DRM for r128.
r128 worked fine on 2.6 until I upgraded to RawHide. What about the changes that were made to build this? /usr/X11R6/lib/modules/dri/tls/r128_dri.so Can I force it to use non-tls version on SMP box?
Even if the problem is in XFree86, 2.6.x is not supported, so it's a very low priority to even investigate. It could be the TLS changes, or any number of other things, or it could be a bug in the 2.6 kernel you're using. Either way, it's unsupported status makes investigating the issue a low priority unless you can reproduce it on the kernel we ship. Feel free however to debug the issue if you like, and if you find the cause is a bug in our XFree86, and can isolate it, or pinpoint it down any further, then there might be something I can do about it, however my priority is getting XFree86 fixed and working on what we will be shipping and will be supporting first and foremost, so any 2.6.x kernel related issues are low priority. I just don't have the engineering time to support 2 kernels nor user compiled kernels. HTH
DRI CVS r128_dri.so works with the 2.6 kernel DRM driver so it's definitely the r128_dri.so files in the Rawhide build that are the culprit. I checked both the tls and non-tls versions and both are broken on 2.6. So this doesn't look like a TLS issue. The Radeon dri.so files from Rawhide work on 2.6 without problem.
Feel free to attach patches that fix the problem for you with 2.6.x kernels, and I'll review them for integration if you like.
Deferring for investigation once kernel 2.6.x gets integrated into the distribution officially. If anyone has patches for X or for the kernel which fix this problem in the mean time, feel free to attach them and reopen this for review earlier.
Gentoo has hit similar problem. Fix may be the same: http://bugs.gentoo.org/show_bug.cgi?id=30541
Could be perhaps... however the proposed Gentoo solution is "break the system for many 2.4.x Pentium III/IV users by disabling the patch that fixes the problem for them" so that 2.6.x works for the relative few people brave enough to experiment with it. I won't even consider any solution that regresses existing behaviour in a supported configuration. We don't support the 2.6.x kernel until it is officially released, and in rawhide at least. If people want to submit patches in the mean time, which fix bugs without creating regressions, feel free to chase the bugs. ;o)
A minor correction to the Gentoo solution. We perform a compile-time check for a 2.4 kernel, and if that is the case, apply the patch. Otherwise, the patch isn't applied. Yes this screws things up when moving from a 2.4 to a 2.6 kernel, but it works for 2.6 people and it works for 2.4 people. After discussing some with Mike, a runtime hack would probably be best.
Just to clarify from our discussion... I believe that a runtime check is superior to a compile time check, as it should then just work no matter what the kernel is on the system. Also, there is never any guarantee whatsoever that the kernel installed and running on the machine *compiling* the software, is the kernel which will be used on the machine *running* the software. For example, our buildsystems are all running one Linux 2.4.x kernel or another, and probably will be for the forseeable future. Our buildsystems don't necessarily reflect the kernel the user is using to run the software. In general compile time checks like that are very bad, at least for the case of a software distributor, for this reason alone. So there's pretty much no way I'd even remotely consider doing a compile time hack like that. While I think doing a runtime hack is a bit better, it's only marginally better, as it is ignoring the *real* problem and just bandaiding over it, which has the side effect usually of nobody ever caring to fix the real problem as the ugly hacks end up being deemed "good enough". As such, I won't apply a compile time nor runtime hack for this to our XFree86 package, as I'd like to see the issue fixed by someone properly at some point in time, and not just permanently bandaided over. That someone could even perhaps be me, but it isn't a priority until we're shipping the 2.6.x kernel, and by that time, we'll also be shipping XFree86 4.4.0 more likely than not, and I've got a feeling the problem will have just magically disappeared by then. ;o) That said, it would however be useful information to know either way if disabling the Mesa patch does in fact work around the specific problem reported in this bug report. Jon, if you can test and confirm that, it might help to quantify the extent of the problem along with other bug reports, and may justify allocation of resources sooner than later. TIA
This comment is a bit offtopic, but something to think about. Mike, you tell us that RedHat will *not* support 2.6 as long as you don't include it at least in rawhide. Ok, from a pure business point of view. But remember why Linus states he will call beta versions of the kernel with an even number suffixed with testx. He wants to get testers. Many testers. Testers which won't test the software as long as they expect too much trouble with testing. And then RedHat, which, in some sense, depends on this kernel development work, tells the community that RedHat sees that this DRI problem is a bug in RedHat's XFree rpm but RedHat does not care. Reading something like: "Build your own XFree rpms and backout that patch if you need to. We don't care ..." Don't you think that this will prevent users from using 2.6.0-test? Using 2.6.0-test to do their usual work. And while doing this find some problems with it. Help to eliminate those problems. ... and then run against a wall at RedHat. I don't want that you _support_ 2.6 within a given time to fix or something like that. But you could help development with not simply rejecting such requests. Greetings Mathias Fröhlich
>Mike, you tell us that RedHat will *not* support 2.6 as long as you >don't include it at least in rawhide. Ok, from a pure business point >of view. No. That really has nothing to do with business. We have designed and tested the entire Fedora Core 1 OS around the 2.4 kernel. Arjan has produced unofficial rpms of the 2.6.x kernel in order both for people who want them to have something easy and prebuilt to play with, and also since he is a kernel developer and wanting people to test 2.6.x out earlier and help get the upstream 2.6.x kernel stabilized sooner. Me and John are but 2 developers here with finite time, and we don't have the time to support XFree86 on 2 completely different kernels. I could easily spend 2 weeks or more of my time right now trying to fix known problems in the XFree86 4.3.0 code which will happen on 2.6.x kernels. That time would be mostly wasted since we will be shipping XFree86 4.4.0 with the next OS release along with a 2.6.x kernel more likely than not. Either 4.4.0 will not have these problems due to being fixed hopefully upstream, or they'll be a priority in the development of Fedora Core 2 for us to fix. There is a *FINITE* amount of time for me to fix bugs before Fedora Core 1 shipped. Now I could waste that time installing a 2.6.x kernel, possibly frying my working system and having to reinstall the OS, and then to debug the various reported issues, as well as working on other known issues that XFree86 has with newer kernels which users haven't ran into or noticed yet. *OR* I could fix bugs and problems that are HIGH PRIORITY *MUSTFIX* bugs for this release and other high priority bugs that affect many thousands more users than any 2.6.x problem. What is my choice? My choice is to keep my job. And to fix as many problems as possible for the Fedora release that affect the most number of users who are actually using the OS as shipped, not with unsupported add ons. >But remember why Linus states he will call beta versions of the >kernel with an even number suffixed with testx. He wants to get >testers. Many testers. Testers which won't test the software as >long as they expect too much trouble with testing. So what? That means I should drop what I'm doing, tell my manager "I'm sorry I can't fix this MUSTFIX bug and do the other high priority tasks I've been assigned, I have to work on 2.6.x kernel issues. Yes, I know we're not shipping a 2.6.x kernel, but there are 3 people using 2.6.x who are pissed off. Yes I know it makes no sense for me to ignore high priority bugs that affect hundreds of thousands or more users, and instead fix bugs that could really wait for months before they get fixed, but I don't want to piss off 2.6.x users."? Um no. >And then RedHat, which, in some sense, depends on this kernel >development work, tells the community that RedHat sees that this DRI >problem is a bug in RedHat's XFree rpm but RedHat does not care. >Reading something like: "Build your own XFree rpms and backout that >patch if you need to. We don't care ..." This is a crock. This "problem" was caused by a patch written by Linus in the *first* place, which fixed a Mesa bug that caused DRI to fail on a large number of Pentium 4 processors. I don't recall the extreme details of this problem but the number of users that would be affected should Linus' patch not be applied would be far greater than the number of people using 2.6.x test kernels in any case, and it is FAR more important to fix serious bugs that affect large number of users systems with SUPPORTED SETUPS than to throw away the fix and make an unsupported experimental setup work, REGARDLESS of what is the cause of the problem in the unsupported setup. >"Build your own XFree rpms and backout that patch if you need to. >We don't care ..." I really don't like your ignorant attitude. This has nothing to do with wether I care (or Red Hat for that matter). I *DO* care, but there are thousands of things I care about, and the amount of time I have to DO SOMETHING about those THOUSAND things is LIMITED. VERY LIMITED in fact. I *MUST* prioritize the most critically important work FIRST _PERIOD_. And that means shipping an OS release which works with the supplied kernel out of the box, and doesnt break 3D acceleration on half of the Pentium 4 processors out there in order to fix a problem with 2.6.x kernels. This is all about PRIORITIZATION, and nothing to do with not caring. >Don't you think that this will prevent users from using 2.6.0-test? No. For 2 reasons. 1) Not everyone needs 3D acceleration in order to test a kernel. I mean really, come on. 2) People who want to experiment with the kernel can also very easily rebuild the XFree86 rpm if they care that much. 3) The current RPM contains an updated patch which Linus sent to me which should fix this problem anyway. Yes, that's right. The current Fedora XFree86 4.3.0-42 has FIXED this issue with information sent from Linus. So your pointless negative comments are 100% totally in vain anyway. Why is this FIXED in Fedora Core 1? Specifically because I *DO* care. I just did not have ANY time to work on 2.6.x kernel related problems as that complicates my workload significantly. Once Linus pointed out the flaw in his original patch which we were shipping, I was able to fix the issue in less than 5 minutes, which was trivial and definitely worth doing. Without Linus's information I could have easily spent a day or more on this depending on how long it would have taken me to set up a test environment, reproduce the problem, and then debug this right down to the assembly language level, since this problem was directly in hand coded assembler in Mesa. At the last few weeks of an OS release, I have zero time, in fact I have to spend my own personal time volunteering to fix things because I don't have time to accomplish all of the tasks on my plate if I were to just work 40 hours then stop. While I wish I could work 24 hours a day and not sleep or eat, and while that could allow me to perhaps fix 2.6.x kernel bugs, I'm unfortunately human and mortal, and I must sleep, eat and do other things. That unfortunately calls for prioritization of work to the most important things, and that isn't the 2.6.x kernel where XFree86 is concerned. >Using 2.6.0-test to do their usual work. And while doing this find >some problems with it. Help to eliminate those problems. >... and then run against a wall at RedHat. The world is full of software problems. Unfortunately neither Red Hat, nor any other distribution or developer out there can fix every bug or class of bug for every user out there within a certain small period of time. It's just not realisitic. I fix your 2.6.x kernel bug, I don't get other work done, then I have 200000 angry Pentium 4 users telling me DRI doesn't work all of a sudden in an XFree86 update. What do I tell them? Well fortunately I don't have to worry about that because of the fix that is in my latest 4.3.0 now for this very problem, but the same rules apply to other bug reports too. If users are that interested in a bug getting fixed, and they are clearly told the bug is dwarfed in priority by other high priority things that MUST get done before developer time can be assigned for their issue, there is one thing that the user can do to increase the priority of their bug report, and that is by doing exactly what Linus was so very kind to do, and which I appreciate very much - supply a patch that fixes the bug. In Linus' case he didn't supply a patch, but instead he describe exactly what the problem was, and the obvious fix. Then it took a few minutes to generate the patch and add it. So really, you can vote on bug priorities by supplying patches and doing developmental work yourself too. >I don't want that you _support_ 2.6 within a given time to fix or >something like that. But you could help development with not simply >rejecting such requests. I DID *NOT* *REJECT* the request. I stated IMHO quite clearly that it was totally unacceptable to fix this problem for 2.6.x kernels by officially disabling the bug fix for P4 processors, because that would be trading one semi problematic bug with 2.6.x kernels for a bug that would be catastrophic for an extremely large number of P4 users using our official 2.anything kernels. That is an unacceptable solution, and a regression period. You just don't throw away a working fix for something and regress back to something older like that. You fix BOTH problems. However that takes TIME, and time doesn't come for free. It is a limited and finite resource, and being such, prioritization is what decides when time will be allocated. That time would have come sometime AFTER the Fedora Core release, since this problem was NOT a showstopper by any stretch of the imagination. If I sound a bit upset about this, I am, and for multiple reasons: 1) This problem *IS* fixed already and has been for over a week or more. Someone tracking Fedora development and updating XFree86 would notice this. I also clearly documented it in the changelog as always. 2) I'm tired of people thinking I have infinite time, and that their individual problem is more important than all 400 bugs I have open in bugzilla at any given time, and more important than other development and work that I have to do. 3) People are very thankless sometimes and if they don't get their way immediately they complain and are very negative. They also rarely ever listen to reason and rationale. > But you could help development with not simply rejecting such > requests. I did not reject ANYTHING other than the poor solution offered above of disabling the patch, which is totally unacceptable. I "DEFERRED" the problem until high priority MUST_FIX work was done and other priorities that I have to accomplish. Those high priorities are much higher than this or any 2.6.x kernel related bug, and I'll have higher priorities for at least a few more weeks minimum, at which time I reprioritize all work on my plate again. I'm offended by your implication that I don't help development either. It is my precice decision to properly prioritize the work that I do which specifically DOES help development.
* Mon Oct 20 2003 Mike A. Harris <mharris> 4.3.0-42 - This release is the long awaited answer to the meaning of life, the universe and everything. - Added XFree86-4.3.0-redhat-exec-shield-GNU-stack.patch to make the complete XFree86 build including Mesa et al. exec-shield friendly (arjanv, mharris) - Updated to new XFree86-4.3.0-Mesa-SSE-fixes-from-MesaCVS-v2.patch which should fix compatibility problems between DRI and 2.6.x kernels which were caused by the previous version of this patch. Linus reported the fix for this with details of the problem, and explanation of the solution, which I extracted out of CVS (#107932,106566,107829) Because I do care enough to fix things on a prioritized basis in comparison with all other things, and since this issue had it's priority raised dramatically when the fix for the problem was provided, this bug has been fixed for 1.5 weeks. There may be various other duplicate bugs in bugzilla that are either DEFERRED or open status still as I didn't have time to scan for duplicates and close them all. Closing bug as a duplicate of bug 107932, which contains the details of the *real* problem and the real fix. *** This bug has been marked as a duplicate of 107932 ***
Read what you have written before. That sounds like you don't. It was the way you repeated multiple times that "you _will_not_support_ 2.6 <FULLSTOP>" which made me think that you/your manager/RedHat are/is not interrested in fixing a 2.6 issue. And this was what I could not believe for the given reasons. If you care, it's ok. Greetings Mathias
No, you misunderstand. "We wont be supporting 2.6.x in Fedora Core 1" and "we don't care at all whatsoever about 2.6.x related bugs, please go away" are two completely different meanings. You implied the latter meaning, which is VERY incorrect. The former meaning is the correct one. We definitely do not support 2.6.x in Fedora Core 1. Does this mean we don't care if 2.6.x works or not at all? No, not at all. Does this mean that since it is not supported, 2.6.x issues will not be given the highest possible priority? Yes. We just care _more_ about other issues that aren't 2.6.x kernel related, but are very important to Fedora Core 1. It's kindof a moot point though since Fedora Core 1 is frozen now for final release in a few days or so. I just reread every comment I posted above over again, and I feel that I correctly stated why this issue was not of high priority several times, and that it would become higher priority once Fedora Core 1 was released. I don't think I miscommunicated anywhere above that we don't care about 2.6.x, as that would be very far from the truth in every way. I suggest you re-read each of my comments and try to see that I've said these things all along in a fairly clear manner, even pointing out that the issue could have it's priority raised if someone else volunteered to find the exact problem and fix it, and include a patch. Perhaps Linus was the only one who saw the true meaning? ;o)
Ok, to me it sounds more like the second meaning. And yes I see that it's also possible to read the first one. Let's stop this here. So, please excuse if you felt offended. Greetings Mathias
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.