Bug 67849
Description
jimomura
2002-07-02 22:55:35 UTC
Created attachment 63425 [details]
startx stderr output and notes
Created attachment 64226 [details]
More startx stderr output captures (3 sessions where console screen pointers were corrupted)
I guess it is an obvious comment, but at this point the problem looks like it is more likely a bad library where the "Gtk" calls are sitting. "Gnome toolkit" or something like that? Still, if this is correct, I am surprised that whatever is happening is able to affect the console pointers. Would it be bad stack handling? GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at all. It is not entirely clear to me what you mean by "console screen pointers". If you are refering to the Linux virtual text consoles (ie: not part of XFree86), are you refering to the text mode text cursor? Or are you refering to the gpm mouse cursor? In either case there isn't multiple pointers at the console so your statements are confusing. Are you using Ximian GNOME? Please attach your full X server log, and X config file as separate file attachments using the link below after starting up X. If possible can you also attach a screenshot, or digital picture of the screen to the bug report. The best guess I've got as to what the problem you are experiencing might be, is the video mode and console parameters not being restored correctly by the video driver. Again, this would have nothing to do with GNOME or GTK et al. if it is the case. Please provide more details. +GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at all. +It is not entirely clear to me what you mean by "console screen pointers". Gee, and here I thought I had been so clear. :-) Now that I think about it, I guess I have a huge problem with ambiguity in this situation. Moreover, my first guesses about the programs were probably wrong. First, what I meant by "console screen pointers": I was referring to pointer variables in the programming sense, pointers to structures or other indirect data accesses. Now that I think about it, it is unlikely that there were pointers being trashed. More likely it was actual component variables. +If you are refering to the Linux virtual text consoles (ie: not part of +XFree86), are you refering to the text mode text cursor? Yes, being an old person, I mean text cursor by default. I will almost always say "mouse cursor" if I mean the other one. :-) + Or are you refering to the gpm mouse cursor? No. That is "the other one." :-) +In either case there isn't multiple pointers at the console so your +statements are confusing. As I say, my first assumptions about the data variables was probably wrong. It appears that "Y" components are being trashed. Actually, "X" and "Y" both might be trashed at some point and it may be that the "X" components are simply restored by the action of the display process. In any case, what seems to be corrupted are: 1. A variable in the screen buffer which is causing "old data" to be scrolled up entering from the bottom of the screen (instead of "empty screen"). Actually, this could be a pointer to the start of a new line which should be erased before it is used. Oh this is pointless. I am guessing about the programs because I do not have time to unpack them. If I get the time later I will. Hopefully you guys are faster than that. :-) 2. The "Y" offset of the text cursor, making the keyboard echo to be displayed on a different line than the text cursor, if it is visible at all. There may be more data being trashed but that is all I can tell you. I'll try to illustrate this. In the following simulation the letter "v" is used to indicate screen display data (picture in your mind a cat of the screen data you see when you're booting Linux -- essentially dmesg). This stuff is scrolling up from the bottom. The "c" is the location of the cursor and "e" the display of keyboard echo of what I'm typing: before I start typing: vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvcvvvvvvvvvvvvvvvv after I finish typing: vvvvvvvvvvvvvvvvvvvv vvveeeeeevvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvcvvvvvvvvvvv ^ Notice that the screen echo output is lining up with the cursor, but it's not on the same line. That is, if the screen echo shows up. Sometimes it doesn't show up, but the cursor is still visible. In that case I am typing blind. But Linux is still working. I can type "exit" and then re-login and then "shutdown -h now" and it all works. But I might not actually see what I'm typing. +Are you using Ximian GNOME? No. I'm using a pure RHL 7.3 installation. I am currently treating this box as a testbed computer for RHL 7.3 and Windows ME. Both OSes are being kept very pure and up-to-date. + Please attach your full X server log, and X config file +as separate file attachments using the link below after +starting up X. I will attach the X config file shortly. I'm not sure about the server log. I have to find that one. I think I know where it is. :-) +If possible can you also attach a screenshot, or digital +picture of the screen to the bug report. I will see what I can do about that. +The best guess I've got as to what the problem you are +experiencing might be, is the video mode and console +parameters not being restored correctly by the video +driver. Again, this would have nothing to do with GNOME +or GTK et al. if it is the case. Well no it should not be possible. +Please provide more details. I will as best as I can. Unfortunately, I won't have much time to work on this for the coming week or two. Much is piling up around me. I will do my best. +GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at all. +It is not entirely clear to me what you mean by "console screen pointers". Gee, and here I thought I had been so clear. :-) Now that I think about it, I guess I have a huge problem with ambiguity in this situation. Moreover, my first guesses about the programs were probably wrong. First, what I meant by "console screen pointers": I was referring to pointer variables in the programming sense, pointers to structures or other indirect data accesses. Now that I think about it, it is unlikely that there were pointers being trashed. More likely it was actual component variables. +If you are refering to the Linux virtual text consoles (ie: not part of +XFree86), are you refering to the text mode text cursor? Yes, being an old person, I mean text cursor by default. I will almost always say "mouse cursor" if I mean the other one. :-) + Or are you refering to the gpm mouse cursor? No. That is "the other one." :-) +In either case there isn't multiple pointers at the console so your +statements are confusing. As I say, my first assumptions about the data variables was probably wrong. It appears that "Y" components are being trashed. Actually, "X" and "Y" both might be trashed at some point and it may be that the "X" components are simply restored by the action of the display process. In any case, what seems to be corrupted are: 1. A variable in the screen buffer which is causing "old data" to be scrolled up entering from the bottom of the screen (instead of "empty screen"). Actually, this could be a pointer to the start of a new line which should be erased before it is used. Oh this is pointless. I am guessing about the programs because I do not have time to unpack them. If I get the time later I will. Hopefully you guys are faster than that. :-) 2. The "Y" offset of the text cursor, making the keyboard echo to be displayed on a different line than the text cursor, if it is visible at all. There may be more data being trashed but that is all I can tell you. I'll try to illustrate this. In the following simulation the letter "v" is used to indicate screen display data (picture in your mind a cat of the screen data you see when you're booting Linux -- essentially dmesg). This stuff is scrolling up from the bottom. The "c" is the location of the cursor and "e" the display of keyboard echo of what I'm typing: before I start typing: vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvcvvvvvvvvvvvvvvvv after I finish typing: vvvvvvvvvvvvvvvvvvvv vvveeeeeevvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvvvvvvvvvvvvv vvvvvvvvcvvvvvvvvvvv ^ Notice that the screen echo output is lining up with the cursor, but it's not on the same line. That is, if the screen echo shows up. Sometimes it doesn't show up, but the cursor is still visible. In that case I am typing blind. But Linux is still working. I can type "exit" and then re-login and then "shutdown -h now" and it all works. But I might not actually see what I'm typing. +Are you using Ximian GNOME? No. I'm using a pure RHL 7.3 installation. I am currently treating this box as a testbed computer for RHL 7.3 and Windows ME. Both OSes are being kept very pure and up-to-date. + Please attach your full X server log, and X config file +as separate file attachments using the link below after +starting up X. I will attach the X config file shortly. I'm not sure about the server log. I have to find that one. I think I know where it is. :-) +If possible can you also attach a screenshot, or digital +picture of the screen to the bug report. I will see what I can do about that. +The best guess I've got as to what the problem you are +experiencing might be, is the video mode and console +parameters not being restored correctly by the video +driver. Again, this would have nothing to do with GNOME +or GTK et al. if it is the case. Well no it should not be possible. +Please provide more details. I will as best as I can. Unfortunately, I won't have much time to work on this for the coming week or two. Much is piling up around me. I will do my best. Created attachment 64804 [details]
XFree86 Log as of 2002/07/11
Created attachment 64805 [details]
XF84Config-4 as of 2002/07/11 (aside from my comments it is the log created by anaconda)
Created attachment 64806 [details]
XF86Config as of 2002/07/11 (again, aside from my comments, it is the file created by anaconda)
Created attachment 64807 [details]
boot.log 2002/07/11 (not requested, but I thought I would provide it)
Created attachment 64808 [details]
dmesg 2002/07/11 (also not requested and probably not helpful, but. . . .)
I have posted what information I could for now. Unfortunately, I expect that I will be too busy for the next week to provide more. Most likely I will be in contact again a week from tomorrow. Aside from the above material (which has some chance of being useful), I can add a couple of things which are unlikely to be related: 1. Recently (I think starting after I installed the latest Kernel 2.4.18-5) I have been getting the following message, shortly after started the computer: "spurious 8259A interrupt: IRQ7" I have not gotten around to looking into this, but I would guess it is a serial port device. 2. Cron has been sending me a couple of error messages. One has to do with "tripwire". That message should stop coming because I have just completed installing and initializing "tripwire". The other message reports missing modules. From root Mon Jul 8 22:14:17 2002 Return-Path: <root> Received: (from root@localhost) by localhost.localdomain (8.11.6/8.11.6) id g68MEH001705 for root; Mon, 8 Jul 2002 22:14:17 GMT Date: Mon, 8 Jul 2002 22:14:17 GMT From: root <root> Message-Id: <200207082214.g68MEH001705> To: root Subject: LogWatch for localhost.localdomain Status: RO ################## LogWatch 2.6 Begin ##################### --------------------- ModProbe Begin ------------------------ Can't locate these modules: char-major-13: 4 Time(s) char-major-45: 1 Time(s) char-major-30: 5 Time(s) char-major-81: 2 Time(s) char-major-82: 1 Time(s) block-major-25: 1 Time(s) fb0: 1 Time(s) ---------------------- ModProbe End ------------------------- --------------------- sendmail Begin ------------------------ 241 bytes transferred 1 messages sent ---------------------- sendmail End ------------------------- ###################### LogWatch End ######################### From root Mon Jul 8 22:16:30 2002 Return-Path: <root> Received: (from root@localhost) by localhost.localdomain (8.11.6/8.11.6) id g68MGUT02537 for root; Mon, 8 Jul 2002 22:16:30 GMT Date: Mon, 8 Jul 2002 22:16:30 GMT Message-Id: <200207082216.g68MGUT02537> From: root (Anacron) To: root Subject: Anacron job 'cron.daily' Status: RO /etc/cron.daily/tripwire-check: **** Error: Tripwire database for localhost.localdomain not found. **** **** Run /etc/tripwire/twinstall.sh and/or tripwire --init. **** Your X config file and log file show no problems. X is configured properly, and is starting up properly. I'm not able to reproduce the problem you are describing on any of my Radeon hardware using RHL 7.3. This is probably a hardware issue, or a kernel issue. I suspect the former. Your last comment posted contains information which has nothing to do with the problem, or with XFree86. They are tripwire logs. I'm Cc'ing Arjan for comment in case this is a known kernel console issue in our erratum kernel. I suspect it isn't a kernel issue though, and is probably a hardware issue. Any comments Arjan? >--- shadow/67849 Thu Jul 11 11:07:43 2002 >+++ shadow/67849.tmp.14776 Sat Jul 20 08:59:07 2002 . . . >+Summary: Console corruption when VTswitching from XFree86 to console > Description of Problem: >@@ -465,3 +466,18 @@ > **** Error: Tripwire database for localhost.localdomain not found. **** > **** Run /etc/tripwire/twinstall.sh and/or tripwire --init. **** >+------- Additional comments from mharris 2002-07-20 09:05:05 ------- >+Your X config file and log file show no problems. X is configured properly, >+and is starting up properly. I'm not able to reproduce the problem you >+are describing on any of my Radeon hardware using RHL 7.3. This is probably >+a hardware issue, or a kernel issue. I suspect the former. If it is driver related, I expect it is a combination problem. Let me provide a bit of history and put this into better perspective: This box was first assembled on around Oct 12, 2001 with a Maxtor 8.7 GB HD, AMD Duron 950 MHz processor and Pioneer DVD-114 (10X) and S3 Savage4 AGP graphics card with Win98 and RHL 7.1. We can ignore this bit of history mostly and just note that "it was working" except for sound and LAN, which are not the current issue. And that it will be still be a few months before I sing Happy Birthday. :-) Nov. 22, 2001 upgraded motherboard BIOS to 011016L.ROM. Nov. 23 upgraded to RHL 7.2 (Kernel 2.4.7-10). Dec. 28, 2001 I installed the ATI Radeon R6 DDR 64MB ViVo card. Kudzu identified it as "ATI Radeon QD". Jan 14, 2002, reloaded RHL 7.2 from scratch (changed a few choices). NOTE: XF86Config set by Anaconda included an option "nodri". Around May 27, 2002, I replaced the AMD Duron 950 with an AMD Duron 1.2GB. This uses a new core with XP instructions. June 1, 2002, I replaced the Pioneer DVD with an LG CD-RW/DVD drive. I did not install the scsi emulator at boot time. I used a script to install it as needed. NOTE: Up to this point I had used the Radeon 64DDR card without problems both under RHL 7.2 (which might not have used AGP functions) and under Windows 98 which definitely did. The only problem I had was with Windows 98 which had a bug relating to screen blanking during "sleep" modes. June 13, 2002, I replaced the Maxtor 8GB HD with a Maxtor 20GB HD (still a UDMA/ATA). June 13-14, 2002, I installed Windows ME, which fixed the Win98 screen blanking bug, and installed RHL 7.3. Actually, I installed it many times, having a few problems, including this particular problem. To summarize this, I have used the Radeon 64 card successfully with the rest of this hardware mix (except for the 20GB HD) with RHL 7.2, and WinME. Win98 had a bug, but that is not likely related to this problem. So is it a "hardware problem"? Perhaps, but not entirely, and certainly one that should be circumventable via software, since WinME uses this card fully. One aspect of "hardware" though is the AMD Duron processor. I have heard of problems relating to an anomaly in the instruction set and AGP block addressing. I do not recall the details, but apparently it is well know among driver writers. It is possible that this is the problem. I can also add that I have now applied all updates for RHL 7.3 up to July 19, 2002 so it is current, and the problem is still evident. >+Your last comment posted contains information which has nothing to do >+with the problem, or with XFree86. They are tripwire logs. I did not expect that it would. I was just being thorough. >+I'm Cc'ing Arjan for comment in case this is a known kernel console issue >+in our erratum kernel. I suspect it isn't a kernel issue though, and is >+probably a hardware issue. >+Any comments Arjan? I have taken pictures of the latest problem siting. I will look at them later and see if I have anything useful. However, this might take some time. I am very busy lately and while this issue is important to me, I have to get my main work done first. Created attachment 68124 [details]
First picture, immediately after closing XWindows
Created attachment 68125 [details]
One "CR" later. Text cursor is at the same point but new text appeared without my input.
This is a general update on what has happened. First, about the two pictures I uploaded: The pictures were taken on 2002/07/20, after upgrading RHL 7.3 to the 2002/07/19 level (the last was a "mod_ssl" package). The first picture was taken, as far as I remember immediately after the "logout" from X with no keyboard input. This might be wrong because I was not thinking about taking the picture. It occurred to me just as I was about to shut it down. But I am fairly sure I did not press any keys up to that point. The second picture is after a single CR. You can see that the cursor is at the same position but all the lines have moved up and there is a new line at the bottom. I did not enter any of the text that is on that line. It is exactly what showed up on its own. I took more pictures as I went through the "exit" and "login" as "root", then shutdown, but they were all blurred to the point that I thought they were not worth posting. If I think about it, I will try and take pictures again. 2002/07/22 I upgraded the BIOS to the "02/26/2002 S" w/LAN version. I confirmed the main functions still worked under Windows ME (expect for the onboard Modem circuitry, which I have never used). I confirmed DirectDraw was working using "dxdiag.exe". I later confirmed that the RHL 7.3 problems were still there -- no effective change. 2002/07/30 I completed applying upgrades to 2002/07/22 (which includes the GCC compiler and glibc updates). 2002/08/02 I noted a comment on the "comp.os.linux.misc" group (or was it ".hardware"?) that I could bypass the DRI module by commenting out "Load dri". I tried this. As far as I can tell from the error messages, the module did not load. This did not help the screen problem either. So either it is not in the "dri" driver module, or it is also in another module as well. Other Notes: A few weeks ago in the "comp.os.linux.portable" I noticed that someone reported having the same "spurious 8259A interrupt: IRQ7" message that I reported earlier. I did not think that it was likely connected with this screen problem. I do not know how busy your staff is (probably very busy :-), but I would suggest that you give that problem a priority over this screen problem. It occurred to me that both problems are "new" to 7.3 and were not in 7.2. It might be that there is a compiler or common code problem between them. If, for example the problem is a rare compiler problem it could be that the same bad code was created in 2 different modules. The point of putting the effort into the other bug first is that I am guessing that it might be easier to isolate and correct. In that case, it might indicate where this bug is coming from. If it is completely unrelated to this problem, well, it is still a bug to be fixed. . . . The interrupt report is showing up at random times after booting. I have seen it late during the boot process, or later during or even after login. I do not think I have seen it before "run level 3". I have the same problem on my system. I am running a fully updated 7.3 redhat (kernel 2.4.18-5 with custom configuration), on a Trinity KT-A motherboard (KT-133a chipset), with a 64 mb ATI radeon video card and an Athlon XP 1700. I do use DRI (Quake 3), but I'm not sure if it is correlated. The problem happens consistently after a few days (or even sooner?) making the text consoles basically unusable. Please let me know if you need any more information. - Andre Some additional information I ran accross on http://www.xfree86.org/cvs/changes_4_2.html There is an entry there which reads: 682. Delay before restoring VGA registers for Radeons to "fix" VT switch problems (Kevin Martin). which sounds like this problem (it also sounds like the "fix" was just a hack). Just updating the situation: On Aug. 24, I applied all updates up to Aug. 19, 2002 and then installed the latest Kernel (2.4.18-10). Upon testing I found the screen trashing still occurs. I also noticed that the "spurious . . . interrupt . . ." is still occuring. In fact, this was the first time I actually noticedit BEFORE "run level 3". It may have happened before, but without my noticing it. spurious interrupt is not an XFree86 problem. It is generally a hardware bug. Wow, this bug report is considerably long. Rather than read it all over again... It came up in a bugzilla search for "savage", so whoever mentioned that word above, please test: ftp://people.redhat.com/mharris/test-drivers/savage_drv.o It is a new S3 Savage driver. Fixes all known bug, cures all known diseases, causes world peace, etc. ;o) I have tried ftp://people.redhat.com/mharris/test-drivers/radeon-4.2.0-vtswitch-hang/radeon_drv.o for my system and it seems to have fixed the problem. No corruption in the last week. Thanks! - Andre See, told you it would fix everything. ;o) Almost? First, I should say that I am sorry for taking so long responding. I have been very busy lately. On Sept. 27, I finally had some time to look at the new driver. I renamed "/usr/X11R6/lib/modules/drivers/radeon_drv.o" to "/.../radeon_drv.old" and put the new driver in the directory and rebooted. I tried "XBoard" (Chess) briefly and logged out of the X session and everything was OK. I tried "Mahjongg" a few times inconclusively (I will get back to this later). I tried the "Chromium Setup" program and then ran "Chromium." This cause the screen to trash. Yup. It failed. I rebooted and ran "Chromium" again with out the "Setup" program. After logging out of the X session everything was OK. Unfortunately, the most likely failure has aways been if I finished a game of "Mahjongg" (completely emptying the board). I tried for many hours to complete a game. Finally, late Saturday, I completed one game and logged out of the X session. There was no problem. I *think* that this driver is almost right. There was still the one failure after running both "Chromium Setup" and "Chromium", but so far, that was the only failure. But a stats professor would be jumping on my head for saying so -- not enough samples to establish a probability. I think you're describing 2 completely different problems then. 1) A VT switching bug 2) Some other bug causing a crash? +------- Additional comments from mharris 2002-09-28 21:27:49 ------- +I think you're describing 2 completely different problems then. +1) A VT switching bug +2) Some other bug causing a crash? This has been my assumption since before I opened this bug -- that there was a good chance that this is an interaction between at least a couple of issues, which is why I mentioned libraries in the beginning. Sample Cases: 1. In order to update RHL 7.3 I download RPM files to my main "media" computer and then burn it to a CD-RW. The resulting CD-RW disc is not readable under Linux because the ISO driver cannot handle the format created by that software (which is not necessarily a bug). So I boot the K7S5A box (which is the box we are discussing with this problem) and copy the files to "C:\data\temp\xxxx". I then reboot the box under RHL 7.3. Under Linux I have 2 accounts. The "root" account is set up for KDE and the "user" account is set up for Gnome. To transfer the files to Linux, I login as "root", "mount /mnt/c" and then "startx". I then use the "quick browse" (I think that is what it is called" to open "/mnt/c/data/" in one window and then "quick browse" again to open "/home/Storage/RHL7_3/hold/". I then use the KDE "copy" and "paste" functions to copy the file structure from the Windows partition to Linux side. Then I "log out" of the X session. I have done this fairly often and I do not recall ever having the screen trashed, no matter how long I took. In some cases I took "a bit of time" getting it done. Note that I have run into other problems with KDE which I have not mentioned, but I think we can ignore that for now. 2. As "User" (with the Gnome interface) in a couple of cases I started up the Chess program (XBoard) and looked it over, then logged out of X. Sometimes the screen was trashed and sometimes it was not. I have never tried playing a whole game of Chess (Mahjongg takes me about an hour a game and Chess would take me even longer. I do not really even like Mahjongg when it comes down to it. I am playing it now as a test program.) 3. As "User" (with the Gnome interface) I have run "Chromium" a few times briefly and when exiting, sometimes the screen has been trashed and sometime it was ok. As mentioned previously, it seems to be trashed if I run the "Chromium Setup" program. 4. I have run programs briefly under "KDE" and under "Gnome" and had the screen trashed fairly often, but not consistently. Is the problem "time related?" Not predominently. Some of the screen trashing occured after brief sessions and sometimes fairly lengthy sessions resulted in a clean log out. Is it program related? Apparently yes. In fact, it is probably related to specific subroutines. I would guess that it might be a specific message window or dialog window call. I think I mentioned before that it seems less of a problem if I stay within the KDE programs. Could simply increasing the delay time "fix" the problem? Possibly. Here is a hypothetical: Assume that the card has an anomally returning a "ready" signal under certain conditions. The KDE library "fixes" the problem" by avoiding the fault generating condition in the first place, and uses the "ready" signal. The "Gnome" library does not avoid the fault generating condition and thus, programs using the library may have a problem. But if you write the driver to delay long enough for the the card to react, you do not need the "ready" signal -- so all programs from either group will work. Another fix would be to change the "Gnome" library to operate like the KDE library. Another fix would be to write the driver to avoid the fault generating conditions. This is not necessarily what is going on, but this is roughly the range of possibilities that were on my mind when I first posted the bug. I think I have isolated a better test: Open "Time Tracking Tool" If you exit using the "Quit" icon/button, and then immediately log out of X, it trashes. If you exit using the window closing "X" button (upper right corner) and log out of X, it exits clean. I wish I had found this one months ago. It would have saved me hours of Mahjongg. :-) If you post the delay patch and how the delay is calculated, I might have a suggestion -- probably just making it longer. One other change that should be made to the driver system is that Anaconda is setting the default screen for 24-bit mode in the Config files. I have only looked at consumer level documentation, but as far as I can tell, there is no such thing as a 24-bit hardware mode on this card. The chips seem to only support 16-bit and 32-bit (and 8-bit for pure VGA). The only way to have a 24-bit mode is via driver translation. According to the latest online documentation for XFree86, 32-bit modes can be specified in the Config files. I would guess that the 24-bit setting was used for compatibility with older software that might not have been expecting 32-bit support. I do not recall if the driver itself accepts the 32-bit setting. If it does not, then it should be changed to accept 32-bit mode. Either way, Anaconda should be changed. It is never a good idea to have bogus data when true data is allowable. In this case it wasted my time because I had to test these settings to find out if it affected the screen trashing. It did not have any affect. But having such bogus data is just about begging for unnecessary problems. Latest tests: 2002-10-10 - played Gnu Mahjongg, cleared board, ran Help, exited, - log out of X, screen OK - applied update RPMs up to Sept. 1, 2002: "krb5", "mailman", "PHP", "scrollkeeper", "ethereal", "PXE" - continued tests using "Time Tracking Tool": - "Quit" button did not trash screen - "closing button" did trash screen (possibly because I did not reboot between attempts -- not sure) - re-tested 2x for each termination method, making sure I rebooted between each test - the screen was not trashed when I exited X for either method of exit Conclusions: 1. I although I have only completely cleared the "Mahjongg" board twice, I believe that it is unlikely that doing so will cause the screen to trash again. 2. I do not believe that the updates had any affect on the tests. I am applying the updates in the recommended order and I am trying to catch up now because there has been a recent update to "glibc" which actually does have some chance of affecting the result -- though it probaby will not. 3. The "Time Tracking Tool" tests are still the best test I have for trashing the screen, but I seem to have been wrong about the degree of predictability. Clearly the screen will not be trashed every time I exit with the "Quit" button. I am guessing that the ratio is close to 50%. I do not know what to think about the fact that it trashed once when I exited with the window closing button ("X" in the upper right corner). It is possible that I did not reboot between tests and that there was an interaction caused by a previous test. If so, then exiting with the window closing button is 100% successful. If not, then the percentage of "bad" exits still seems to be lower. It may require around 100 test samples to be certain. I do not feel like doing that much testing, even with this shorter test case. 4. From the above data, I think I can say that the current patch has improved the reliability of the driver. It is possible that increasing the delay slightly will be a sufficient fix. I will try to look into it a bit further before I make a recommendation about how long a delay should be used. Don't mean to complain... but could you please make your lines longer in the bug reports? It is hard to read bug reports that are 30 pages long with each line 3 inches wide on my monitor. Just makes it more difficult to assess the problem each time I look at the bug report as I can't fit as much information as possible on my 19 inch monitor. 2002-10-13 Sorry about the formatting, I will try to remember that. I use a number of computers and some have very narrow screens, and I am accustomed to message systems that have auto formatting in the reading phase. I updated the RPMs to Oct. 10,2002 which were for: "tar", "nss_ldap", "glibc", "fetchmail", "gv" and "ggv" for Postscript and PDF, "update2 and "rhn_register" Test "Time Tracking Tool": I repeated 3 x each alternating "closing button" & "Quit" button w/reboot between each test. The screen was not trashed in any of these tests. Thinking about previous tests, I wondered if recycling X (Gnome) sessions by itself was stable. So I used the following test: - boot the system, then "cycling X (Gnome)" -- just "startx" & immediately "log out" without running any programs, until screen trashes. The results were: 3 cycles, 2 cycles, 3 cycles, 2 cycles Assumptions: None of the updates should have made any difference to the tests, so they can be ignored. Conclusions: 1. For now, I would ignore the "Time Tracking Tool" program as any indicator of this problem. 2. In the "cycle through X (Gnome)" test, since the number of cycles before a error varies, it does look like a timing problem rather than a coding error. 3. If the current patch is a single "timing loop," my setup is probably near the borderline. A small increase of the loop should be enough. I would guess that if it is a fairly stable time base (like "10 hardware clock ticks of 1/100 sec.") then maybe a 10% increase would do. If the time base is more variable (like "an empty loop up to a number times the hardware clock speed") then I expect a 33% increase would be sufficient. I have not seen the patch, so these are crude guesses, but I do have some reasons for these particular numbers. This looks like something I see on all systems running XFree86. When Xfree86 is shut down, a few messages take a couple extra seconds to report to the tty, and it can paste over, or around your current prompt. The "workaround" is just to issue another "CR" when X (and it's friends) are all done messaging you. I don't really see this as a "bug" persay, as just a mear annoyance. Your attached pictures backup my assumptions, so if I'm wrong, please let me know. I'm not sure where you're getting these statistical numbers from but it seems to me you're just making random guesses as to what is causing the problems that you are seeing. The random data is not really useful in debugging the problem however. The only real way to debug this problem is to reproduce it locally with identical video hardware, and then run it in a debugger and single step the problem to reproduce it, possibly taking register snapshots of the card. Random applications being executed doesn't likely have anything at all to do with any of this, and so such information is rather useless in debugging the problem. At this point, I'm thinking that it is likely not going to be possible for me to debug this problem because I can't reproduce it on any hardware I've got here. Inability to reproduce hardware related problems, generally translates to inability to do anything about said problems. I strongly suggest reporting this problem on XFree86 mailing lists in hopes that someone else out there shares your problem, and hopefully some kind of useful information can be gathered amongst various people with the problem, that can aide in someone being able to determine what exactly is going wrong. At least more developers are aware of the problem then and can comment on it. As usual, sorry for the delay. Things have been very busy lately. . . . > then I expect a 33% increase would be sufficient. I have not seen the patch, so > these are crude guesses, but I do have some reasons for these particular numbers. >+------- Additional comments from hosting 2002-10-15 13:41:55 ------- >+This looks like something I see on all systems running XFree86. When Xfree86 is >+shut down, a few messages take a couple extra seconds to report to the tty, and >+it can paste over, or around your current prompt. The "workaround" is just to >+issue another "CR" when X (and it's friends) are all done messaging you. No, I know what you are talking about and that is not what is going on. There is definitely at minimum "a piece of data" being corrupted and hitting CR does not solve it. I stated that above, actually circumspectly in the original posting and again specifically regarding the screen photographs. >+ I don't really see this as a "bug" persay, as just a mear annoyance. Again, no. The amount of corruption is not determinable. As such, it has to be considered an unstable system. This is not acceptable for any business computing. I have not stated this before, but the whole point of this computer was to be used for business purposes. It was going to become what you might call my "main" computer. Unfortunately, that never happened. It never achieved acceptable stability. (Ironically, I think this was the first real Red Hat boxed package I have bought since 6.0, but that is another matter. :-) Moreover, since the main terminal screen is what I would have used for debugging in the first place, it is a bug that in essense defeats its own debugging. Theoretically, this is not quite that bad a problem since I should be able to wire up a terminal out the serial port for debugging, but physically that is quite difficult, due to the arrangement of the workspace. . . . > don't really see this as a "bug" persay, as just a mear annoyance. > Your attached pictures backup my assumptions, so if I'm wrong, please let me know. No, look at the pictures again. Oh never mind. :-) >+------- Additional comments from mharris 2002-10-15 16:40:28 ------- >+I'm not sure where you're getting these statistical numbers from >+but it seems to me you're just making random guesses as to what >+is causing the problems that you are seeing. The random data is >+not really useful in debugging the problem however. Which statistical numbers? I have reported tests in what I consider to be brief but sufficient detail. If you do not understand them, quote me a passage and I will expand it further. If you mean the "10%" and "33%" timing increase recommendations, well, I have not seen the patches, so yes, certainly those are only my guesses about what might help based on what I think has been done in your patch(es), based on the discussions I have read so far. As for why I recommended those numbers, I did not feel like taking the "column inches" to say in detail. In fact, the 10% has to do with the difference in performance between the SIS chipset in the motherboard and previous chipsets (mainly by VIA) when the SIS chipset was new. It was a particularly fast chipset at that time, but general performance differences never exceeded 10%. Thus, 10% should be enough to cover most timing differences resulting from this particular chipset. The 33% is a bit "softer" in origin, but again is rooted in known performance figures. In this case it is the difference between 100 MHz and 133 MHz which has to do with the fact that I am using a split speed setting. Memory accesses from the CPU to RAM is at 100 MHz, but memory accesses from the graphics card to main memory (through the 4X AGP port) should be at 133 MHz. The speed difference might not be expected in a timing loop calculation -- if it is relevant at all. Again, it is "not much more than a guess" because I do not know what the patch looks like. But it is not just a number picked out of thin air. >+The only real way to debug this problem is to reproduce it locally >+with identical video hardware, and then run it in a debugger and >+single step the problem to reproduce it, possibly taking register >+snapshots of the card. Well, no that is not the only way to debug it, but I understand the sentiment. :-) In fact, I expect that the type of debugging you have described would not be better than what we are doing right now -- a bit faster, but no more informative. I expect we really need an accurate real-time hardware emulation of the video card. Now, who would have such a thing? Uh, ATI would, would they not? . . . . ;-) >+Random applications being executed doesn't likely have anything >+at all to do with any of this, and so such information is rather >+useless in debugging the problem. It was not random. It was painstaking and took a lot of my time. Go back and read my postings. Unfortunately, what is needed (if the above mentioned emulator is not available) is an even far more thorough set of tests than I have the time to conduct. That was one of the problems I have had -- deriving a repeatable test that was quick enough to repeat enough times to form a useful statistical base. Ironically, we needed more people with the problem so that we could have gotten the statistical data. Since I was the only one working at it, and *I* do not have the time, well, that is what killed the effort. >+At this point, I'm thinking that it is likely not going to be possible >+for me to debug this problem because I can't reproduce it on any >+hardware I've got here. Inability to reproduce hardware related >+problems, generally translates to inability to do anything about >+said problems. I strongly suggest reporting this problem on >+XFree86 mailing lists in hopes thatsomeone else out there shares >+your problem, and hopefully some kind of useful information can be >+gathered amongst various people with the problem, that can aide >+in someone being able to determine what exactly is going wrong. At >+least more developers are aware of the problem then and can comment >+on it. Deciding where to go to solve the problem was, in itself a problem. In order to achieve a stable software base I decided to work completely within the "Red Hat system" if possible. That way I could avoid a mixed system. Going to the XFree86 people, I expect that they will (quite reasonably) insist on my first compiling the current XFree86 sources and then report what happens. There are two things wrong with this from my point of view. First, I still have not gotten around to doing any real compiling work under Linux yet. I was an experienced programmer years ago, but I am not looking forward to getting back in harness by facing a hardware/kernel/driver level debugging problem with new tools and an untrusworthy computer doing the compiling. Second, it will mean that any further debugging of other problems will have to be qualified by the degree to which my system strays from both canonical sources and Red Hat's development stream. Anyway, thanks for the efforts. I would suggest that the patchwork that you have done so far at least seems to have helped some systems, so you might as well release what you have done. In fact, although I have not used the K7S5A much, I do believe that it has become more stable using the patched driver. I will eventually try to consolidate the previous postings and notify the XFree86 people (though I would have expected them to have checked out this problem "here" by now). I assume that you will leave this bug open so others might add comments later. Maybe someone will happen along who will fix it. I have been upgrading the BIOS and Windows drivers on my ECS K7S5A motherboard system and it occured to me that there was a possibility which I have considered, but which I do not think I actually mentioned. The Windows system uses a separate AGP driver which is specific to the motherboard (or actually the SiS735 chipset). I do not know exactly what is in the AGP driver, but it would seem to me that the Linux setup is likely similar. As such, an equivalent "AGP driver" is probably part of the Kernel. Since this Radeon DDR VIVO card is an AGP card, clearly a problem with an AGP driver could be a problem showing up in the graphics systems. The AGP driver for the SiS 735 chipset has been updated a few times -- the last being around Dec. 2002. However, the your Kernel may be based on information as far back as 2001. It would be a good idea to check that out. In fact, I have not upgraded the Linux Kernel since the Aug. 20, 2002 version (there have been 2 upgrades I know of), so it is possible that the problem has been addressed. As it is, I might not ever know the answer to this problem. I am currently considering swapping the video cards in 2 of my computers. If so, I would probably end up with an older "Number 9" (S3 Savage4) graphics card in this box and move the ATI card to the other box. Assuming the S3 Savage4 card and drivers work better on this motherboard, that may resolve the issue for me. That would be good enough for me, but unfortunately, it would leave problem for someone else to trip over in the future -- which is not unlikely because the K7S5A has apparently been a particularly popular motherboard, and this video board was not that rare either. I've just reread all of the material in this bug report and I'm not sure what to tell you. It's the only bug report I've received of this nature ever, so I presume if it was a common reproduceable problem that I'd have received multiple bug reports by now, or heard of similar problems on mailing lists and IRC channels that I frequent. It is almost certainly IMHO a localized problem with your system, either some bad motherboard component such as the BIOS or chipset, or perhaps even a bad video card. Perhaps APM or something is messing with the video card while X is also trying to control it. I really can't investigate the matter unless I can reproduce it though, and I've never seen this kind of problem on any ATI Radeon hardware before. Your best bet, if you are not using Red Hat Linux 9 already, is to upgrade to that release, and if the problem still exists for you then try reporting it on the xfree86 mailing list and/or the XFree86 bug tracking database so that more people can see the problem, and perhaps someone else will have other suggestions or feedback. There isn't much I can do though, so I'm closing this as WORKSFORME because if it is a bug, it can't be fixed unless it can be reproduced and I can't reproduce it and don't know anyone else who can either. Radeon hardware being the most common hardware out there along with Nvidia, if it were a major common problem, I'd have almost certainly have heard more by now. Another suggestion is to try borrowing a different card and see if it happens on that card. Hope this helps. Closing WORKSFORME |