Bug 90615
Summary: | X spontaneously exits | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Need Real Name <dgl> | ||||||||
Component: | XFree86 | Assignee: | X/OpenGL Maintenance List <xgl-maint> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | David Lawrence <dkl> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 9 | CC: | ivan.makfinsky, menscher, plazonic | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2004-10-01 05:02:15 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Need Real Name
2003-05-11 03:42:25 UTC
Created attachment 91603 [details]
The beginning and end of XFree86.0.log after the crash
Created attachment 91604 [details]
XF86Config file
Your bug report doesn't contain any useful information that could be used to determine what the problem is. Also, you are not using a Red Hat supplied kernel. Please install the latest Red Hat official kernel for Red Hat Linux 9 from Red Hat Network by using "up2date -f kernel". While using this kernel, please reproduce the problem, and provide detailed step by step instructions on how someone else can easily reproduce this. Then please attach the new X server log file from the failure case running under the official Red Hat kernel, and we can try to investigate the problem further. Thanks in advance. Also, please attach your /var/log/messages from when this problem occurs, as well as the output of "lsmod" from while X is running. I'm sure the bug has nothing to do with the kernel ... the first time the problem happened, I was using the official redhat kernel. Since then, I found what I believe is a bug in that kernel (see bug 90462), so I reverted to the 2.4.20 kernel. I am changing back to the official redhat kernel using up2date -f kernel (although I got an error when I did that "error: db4 error(-30989) from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page not found", I'm hoping it still worked). Sorry the info I sent you wasn't useful... I was hoping the "bufferglEnable(GL_STENCIL_TEST) but no stencil" message, which repeated 12852 times, might be a clue. I didn't realize the debugging content of the XFree86.0.log file was a function of the kernel. Whenever I'm not using pthreads, I'll use the redhat kernel and see if this problem repeats. Both times I saw the problem, I was not at my computer when it happened. The second time, I know I had enabled the screen saver... I can't remember if I had enabled it before the first crash, but I think I did. Is is possible for a problem with the screensaver to cause X to crash? If so, is there anything I can do provide debugging info for the screensaver? The /var/log/messages file has no more useful info from the time of the crash. The syslogd restarted at 4am. :( The lsmod output (still from the 2.4.20 kernel) follows: [root@cartman dgl]# /sbin/lsmod Module Size Used by Not tainted ide-cd 33668 0 (autoclean) cdrom 33696 0 (autoclean) [ide-cd] parport_pc 19044 1 (autoclean) lp 8996 0 (autoclean) parport 37056 1 (autoclean) [parport_pc lp] autofs 13364 0 (autoclean) (unused) via-rhine 15760 1 mii 3912 0 [via-rhine] ohci1394 20136 0 (unused) ieee1394 47020 0 [ohci1394] nls_iso8859-1 3516 1 (autoclean) nls_cp437 5116 1 (autoclean) vfat 13068 1 (autoclean) fat 38840 0 (autoclean) [vfat] keybdev 2944 0 (unused) mousedev 5492 1 hid 22148 0 (unused) input 5728 0 [keybdev mousedev hid] usb-uhci 26316 0 (unused) ehci-hcd 17480 0 (unused) usbcore 77600 1 [hid usb-uhci ehci-hcd] ext3 70144 3 jbd 51540 3 [ext3] I will send the output from the 2.4.20-9 kernel after I reboot. If and when I see the problem again, I'll capture all of the requested information. Perhaps you should change the error output of xfree86 to request /var/log/messages in addition to /var/log/XFree86.0.log if you need that info. >I'm sure the bug has nothing to do with the kernel ... The problem you are experiencing may very well have nothing to do with the kernel at all, and it may theoretically occur under all kernels. Red Hat only provides support for systems which are running the official Red Hat kernel however, and there is indeed a potential that a custom built kernel or kernel obtained elsewhere, or kernel which has loaded proprietary or 3rd party kernel modules may be causing system problems of all sorts. As such, when problems do occur on a system, depending on the nature of the problem, users who are not using the official Red Hat kernel at the time the problem occurs, may be asked and/or required to reproduce the problem with an official Red Hat kernel. This is very important, to minimize the problem domain down to the officially supported operating system components. >(although I got an error when I did that "error: db4 error(-30989) >from dbcursor->c_get: DB_PAGE_NOTFOUND: Requested page not found", I'm hoping >it still worked). That sounds like either a corrupt rpm database, or a bug in rpm. You can obtain help/technical support via the Red Hat mailing lists for that problem if you require assistance, or possibly from your support representative if you have a support contract with us. >Sorry the info I sent you wasn't useful... I was hoping the >"bufferglEnable(GL_STENCIL_TEST) but no stencil" message, which repeated 12852 >times, might be a clue. It may turn out to be useful perhaps, but without additional information it is too early to say for sure. I do not have access to any S3 Savage hardware, so unfortunately I can't just fire up X myself on a Savage and hope to be able reproduce this problem. Assuming there is an XFree86 bug occuring here, in order to effectively troubleshoot the problem will likely require someone with physical access to the hardware and the ability to debug X related problems to investigate directly. It will also require the problem to be narrowed down to an easily reproduceable test case first. I might be able to help narrow it down, but I can't debug the hardware or the driver personally. Someone else will have to do that. >I didn't realize the debugging content of the >XFree86.0.log file was a function of the kernel. I'm not sure what lead you to believe that, but the XFree86 log file has nothing to do with the kernel. If you're refering to my request for you to attach /var/log/messages, this is because debugging problems on hardware to which you have no physical access is very highly dependant on receiving as much information of the problematic system as possible. Every piece of data received contributes to a useful pool of information in which some clues may be provided that can solve the problem. A bug report with little or no concrete information, no debugging, no specific details on how to reproduce or 100% reproduceable test case, and to which developers investigating the matter don't have access to the hardware the problem occurs on, unfortunately makes it next to impossible to even investigate. It's a process of gathering data, refining that data, offering suggestions to the user having the problem on how to further narrow it down, and repeating that process until a hypothesis can be made as to what the real problem is, and perhaps attempt a code change somewhere to try to resolve the issue. If there is something missing from that process, then sometimes all we can really do, is wait and hope a future release of XFree86, or of a given video driver, just happens to fix the problem someone is experiencing. >Is is possible for a problem with the screensaver to cause X to crash? Theoretically, a bug can occur anywhere in a piece of code such as the X server, so it is definitely possible for some codepath to be buggy enough to crash if the right sequence of events happens. A screensaver could thus theoretically trigger some codepath that other software doesn't trigger perhaps, and thus cause a crash to happen. In the context of screensavers, in problems that have been reported in the past, this occurs most frequently with OpenGL screensavers on video hardware which XFree86 has DRI 3D acceleration support for. The simple test for such problems is to disable DRI support and see if 3D crash problems go away. The savage driver does not contain 3D acceleration support however so that isn't a possibility here, although your error messages do show 3D related errors. >If so, is there anything I can do provide debugging info for the >screensaver? If the X server is crashing, then it isn't a problem with the screensaver, it could be a number of things, possibly 2D acceleration problems, possibly hardware problems, or a variety of other things. More information is be required to really be able to make a solid assessment of the problem. Again, without physical hardware access, as many other possibilities which could be causing the problem need to be ruled out first. Since these types of bug reports can often end up remaining open for very long periods of time due to the various difficulties mentioned above in trying to find a solution, you may wish to try and find a workaround in the interim which may be adequate enough for now until the specific nature of this problem you are experiencing is understood and can be explored more deeply. Here are some suggestions which you can try out which may or may not be adequate workarounds. They may provide valuable clues as to what the problem might be also: - Try disabling 2D acceleration by using Option "noaccel" and/or by experimenting with the various XaaNo options described on the XF86Config manpage. - Disable all OpenGL screen savers completely, or even disable the screensaver itself entirely. Or, pick a single screensaver out of the bunch, instead of "random". - Try using the "vesa" driver, which is unaccelerated and slow Thos are some options you can try at least which if this is really an XFree86 bug, may work around. Some of the options may also work around hardware flaws, bad video memory, and other possible problems. Please provide any updated info you can over time, and hopefully we can narrow things down and come up with a better assessment of the problem and possible fixes. Thanks. >The /var/log/messages file has no more useful info from the time of the crash.
You don't know what I am looking for in the output of these files. It is easier
for you to attach them and let me determine if they contain information useful
to me for troubleshooting purposes, than it is for me to explain the many
things that I might be looking for. When in doubt, attach more information
and let developers work it out. ;o)
Also, when I'm requesting such information, I want only information such as
lsmod and /var/log/messages obtained while you are booted into a Red Hat
supplied official kernel. This is very important.
Here is the output from /sbin/lsmod with the redhat kernel running: [dgl@cartman dgl]$ /sbin/lsmod Module Size Used by Not tainted ide-cd 35708 0 (autoclean) cdrom 33728 0 (autoclean) [ide-cd] parport_pc 19076 1 (autoclean) lp 8996 0 (autoclean) parport 37056 1 (autoclean) [parport_pc lp] autofs 13268 0 (autoclean) (unused) via-rhine 15856 1 mii 3976 0 [via-rhine] microcode 4668 0 (autoclean) ohci1394 20136 0 (unused) ieee1394 48780 0 [ohci1394] nls_iso8859-1 3516 1 (autoclean) nls_cp437 5116 1 (autoclean) vfat 13004 1 (autoclean) fat 38808 0 (autoclean) [vfat] keybdev 2944 0 (unused) mousedev 5492 1 hid 22148 0 (unused) input 5856 0 [keybdev mousedev hid] usb-uhci 26348 0 (unused) ehci-hcd 19976 0 (unused) usbcore 78816 1 [hid usb-uhci ehci-hcd] ext3 70784 2 jbd 51892 2 [ext3] As for the /var/log/message file, it may very well have had useful information after the crash, but by the time I got your email this morning, the syslogd had restarted overnight and var/log/message was empty. As for your lack of access to S3 Savage hardware, I purchased this computer from walmart.com. I did so because they allow you to buy semi-custom computers without a M$ operating system. Therefore, I suspect there are many others in the Linux community that also have this Via motherboard which has onboard video in the form of S3 Savage. Perhaps redhat could get a free motherboard from Via or Walmart? Just a thought. OK... I'll stop bugging you until I've repeated the problem. ;) Actually, this seems like a persistant problem one of my people is seeing on a radeon VE used with 2 monitors (hence xinerama used). It seems to be triggered by xscreensaver and I strongly suspect it is 3d, in fact, after turning off glx he can't get it to crash anymore. Symptoms are very similar - leave the desk and quite often come back to a logged out state. The only log found in XFree86.0.log file is signal 11. Misteriously, it happens often when logging on via gdm and takes a lot longer to happen when using startx. Now, I know this is pretty much useless for debugging but I'd like some pointers on how to do it effectively. I am currently trying to trigger the bug on another machine after setting ulimit to unlimited for crash dumps, as a common user. Any hope something will get dumped if I can manage to trigger it? Any other suggestions on how to do it properly, like flags to tell xfree to dump core or anything like it? I am far from afraid to dig into X source code or anything to help, just that the beast like X is not as easy as some other software to debug (e.g. can't immagine how to run it under gdb?).... May I s I should say - tried 4.3.0-10 with no improvement. Would using the -12 (that seems to have debugging symbols) help in debugging? btw, thanks for great work Mike I can also confirm that this happens on two sets of hardware i have running. Both exit with signal 11 randomly. One is a p4 with a matrox g450 dual headed card running xinerama and the other a p4 with an Nvida tnt card. Both are running kde with gdm login and both have screensavers enabled. On the Matrox dual headed machine I can successfully get it to crash by running the molecule opengl screensavers, however I cannot repeat on the Nvidia card. As a matter of fact, I have had the Nvidia machine crash while I was working on it - running xmms, mozilla, terminal, kde, kmix and several applets. Both times nothing is loggd in either /var/log/messages nor /var/log/XFree86.0.log.old except the following: (**) Mouse0: ZAxisMapping: buttons 4 and 5 (**) Mouse0: Buttons: 5 (II) Keyboard "Keyboard0" handled by legacy driver (II) XINPUT: Adding extended input device "Mouse0" (type: MOUSE) (II) Mouse0: ps2EnableDataReporting: succeeded *** If unresolved symbols were reported above, they might not *** be the reason for the server aborting. Fatal server error: Caught signal 11. Server aborting When reporting a problem related to a server crash, please send the full server output, not just the last messages. This can be found in the log file "/var/log/XFree86.0.log". Please report problems to xfree86. I can provide more log files, just le me know what to provide. Also, I am going to disable the Xscreensaver completely from both and see if that helps any as I suspect it might. One of my users has been plagued by this bug. About one X crash a week for the past 2-3 months. I finally got lucky and saw the "bufferglEnable(GL_STENCIL_TEST) but no stencil" message which led me here. Like everyone else, I suspect that XScreenSaver is the trigger (since it always fails when nobody is around). I'm running a fairly standard hardware/software configuration: Intel motherboard, dual P4 (with HT enabled), NVidia GForce2 graphics, RH stock kernel, default XF86Config file. My lsmod is even pretty generic: # lsmod Module Size Used by Not tainted es1371 34952 0 (autoclean) gameport 3508 0 (autoclean) [es1371] ac97_codec 14696 0 (autoclean) [es1371] soundcore 7044 4 (autoclean) [es1371] ide-cd 35808 0 (autoclean) cdrom 34176 0 (autoclean) [ide-cd] parport_pc 19204 1 (autoclean) lp 9188 0 (autoclean) parport 39072 1 (autoclean) [parport_pc lp] nfsd 81104 8 (autoclean) lockd 59536 1 (autoclean) [nfsd] sunrpc 87516 1 (autoclean) [nfsd lockd] e100 56356 1 ipt_REJECT 3992 2 (autoclean) ipt_LOG 4280 1 (autoclean) ipt_limit 1688 5 (autoclean) ipt_mac 1208 16 (autoclean) ipt_state 1080 1 (autoclean) ip_conntrack 29896 1 (autoclean) [ipt_state] iptable_filter 2412 1 (autoclean) ip_tables 15864 6 [ipt_REJECT ipt_LOG ipt_limit ipt_mac ipt_state iptable_filter] loop 12888 0 (autoclean) keybdev 2976 0 (unused) mousedev 5688 1 hid 22404 0 (unused) input 6208 0 [keybdev mousedev hid] usb-uhci 27468 0 (unused) ehci-hcd 20584 0 (unused) usbcore 82816 1 [hid usb-uhci ehci-hcd] ext3 73376 11 jbd 56368 11 [ext3] lvm-mod 64544 21 3w-xxxx 40128 3 sd_mod 13452 6 scsi_mod 110872 2 [3w-xxxx sd_mod] Here are the potentially-relevant lines from /var/log/messages. To make it easier to follow, the name/IP of the machine are astro/130.126.8.170 and the affected user is shapiro. He reports it was hung when he returned to his office at 9:13. He fixed it via a <Ctrl><Alt>-<Backspace> at 9:15. Jan 13 12:28:03 astro ypserv[805]: refused connect from 130.126.8.170:36784 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 13:26:12 astro ypserv[805]: refused connect from 130.126.8.170:36784 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 13:31:55 astro ypserv[805]: refused connect from 130.126.8.170:36807 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 13:54:52 astro ypserv[805]: refused connect from 130.126.8.170:36821 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 15:23:03 astro ypserv[805]: refused connect from 130.126.8.170:36838 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 17:18:49 astro ypserv[805]: refused connect from 130.126.8.170:36863 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 17:59:02 astro ypserv[805]: refused connect from 130.126.8.170:36863 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 13 23:26:14 astro ypserv[805]: refused connect from 130.126.8.170:36894 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 14 04:06:25 ontario kernel: nfs: server astro not responding, still trying Jan 14 04:06:27 ontario kernel: nfs: server astro OK Jan 14 08:44:36 astro ypserv[805]: refused connect from 130.126.8.170:36906 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 14 09:12:13 astro ypserv[805]: refused connect from 130.126.8.170:36928 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 14 09:15:29 astro gdm(pam_unix)[1145]: session closed for user shapiro Jan 14 09:15:30 astro gdm[1145]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0 Jan 14 09:15:30 astro modprobe: modprobe: Can't locate module char- major-10-134 Jan 14 09:15:42 astro gdm(pam_unix)[15106]: session opened for user shapiro by (uid=0) Jan 14 09:15:45 astro ypserv[805]: refused connect from 130.126.8.170:36928 to procedure ypproc_match (astro- theory,shadow.byname;-1) Jan 14 09:15:47 astro kernel: cdrom: This disc doesn't have any tracks I recognize! Interestingly, I have 8 identical machines, and only 1 causes this problem. Maybe other users just enable different screensavers. Please let me know if there's any other information I can provide. Created attachment 96987 [details]
XFree86.0.log
Log file showing the problem. Note the large section of repeated
"bufferglEnable(GL_STENCIL_TEST) but no stencil" errors.
Thought I'd provide a little more insight here: Unchecking the "Power Management Enabled" box in XScreenSaver (or setting "dpmsEnabled: False" in the ~/.xscreensaver) caused the problem to go away for my user. I tried testing with my own account by enabling DPMS support, but was unable to come up with a reliable testcase to reproduce the problem. Still, it would be interesting to find out if others experiencing the problem have power management enabled in their screensavers. Mike, should I open my findings as a new bug? Since this bugzilla report was filed, there have been several major updates to the X Window System, which may resolve this issue. Users who have experienced this problem are encouraged to upgrade to the latest version of Fedora Core, which can be obtained from: http://fedora.redhat.com If this issue turns out to still be reproduceable in the latest version of Fedora Core, please file a bug report in the X.Org bugzilla located at http://bugs.freedesktop.org in the "xorg" component. Once you've filed your bug report to X.Org, if you paste the new bug URL here, Red Hat will continue to track the issue in the centralized X.Org bug tracker, and will review any bug fixes that become available for consideration in future updates. |