Bug 67849

Summary:

Console corruption when VTswitching from XFree86 to console

Product:

[Retired] Red Hat Linux

Reporter:

jimomura

Component:

XFree86

Assignee:

Mike A. Harris <mharris>

Status:

CLOSED WORKSFORME

QA Contact:

David Lawrence <dkl>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.3

Target Milestone:

---

Target Release:

---

Hardware:

athlon

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-04-15 08:59:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
startx stderr output and notes	none
More startx stderr output captures (3 sessions where console screen pointers were corrupted)	none
XFree86 Log as of 2002/07/11	none
XF84Config-4 as of 2002/07/11 (aside from my comments it is the log created by anaconda)	none
XF86Config as of 2002/07/11 (again, aside from my comments, it is the file created by anaconda)	none
boot.log 2002/07/11 (not requested, but I thought I would provide it)	none
dmesg 2002/07/11 (also not requested and probably not helpful, but. . . .)	none
First picture, immediately after closing XWindows	none
One "CR" later. Text cursor is at the same point but new text appeared without my input.	none

Description jimomura 2002-07-02 22:55:35 UTC

Description of Problem:

     After some X sessions, the local console screen
shows spurious data and the "keyboard echo"
is often on the wrong line.

Version-Release number of selected component (if applicable):

 RH Linux 7.3 updated to June 30, 2002
 using 2.4.18-5 Athlon Kernel
 (not sure which XFree96, whichever RHL 7.3 includes,
it was a fresh installation and all upgrades have been
applied, but XF86 has not been upgraded since 7.3's
release)

How Reproducible:

     Fairly often after Gnome/Nautilus sessions.
Not so much after KDE sessions.  Might be
related to Gnome libraries?

Steps to Reproduce:

1.  boot computer to console (text mode) and log in
2. "startx" to start X session (gnome/Nautilus)
3. run various programs (Gnome Mahjongg is an example)
4. close programs and log out of X session

Actual Results:

     The Console screen display is often not
consistent with expectations and in many cases,
essentially unusable.  In many cases it seems to
be displaying "dmesg" output overall, while the
cursor may not be displayed.  "Keyboard echo" output
may be higher than the cursor position (as an
example, in one case the keyboard echo was being
displayed on the 10th line from the top of the
screen while the cursor was on the bottom line,
and each new line was "dmesg" output, or the
equivalent.

Expected Results:

     I would expect the console screen to act as it
does when I log in.

Additional Information:
	
System:
 ECS K7S5A motherboard
 AMD Duron 1.2 GHz CPU (with XP instructions)
 128MB SD-RAM
 ATI Radeon 64MB DDR VIVO AGP graphics card
 RH Linux 7.3 updated to June 30, 2002
 using 2.4.18-5 Athlon Kernel

Comment 1 jimomura 2002-07-02 23:04:19 UTC

Created attachment 63425 [details]
startx stderr output and notes

Comment 2 jimomura 2002-07-08 14:34:16 UTC

Created attachment 64226 [details]
More startx stderr output captures (3 sessions where console screen pointers were corrupted)

Comment 3 jimomura 2002-07-09 00:41:24 UTC

     I guess it is an obvious comment, but at this point the problem
looks like it is more likely a bad library where the "Gtk" calls are
sitting.  "Gnome toolkit" or something like that?

     Still, if this is correct, I am surprised that whatever is happening
is able to affect the console pointers.  Would it be bad stack
handling?

Comment 4 Mike A. Harris 2002-07-10 04:21:08 UTC

GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at
all.

It is not entirely clear to me what you mean by "console screen pointers".
If you are refering to the Linux virtual text consoles (ie: not part of
XFree86), are you refering to the text mode text cursor?  Or are you refering
to the gpm mouse cursor?  In either case there isn't multiple pointers at
the console so your statements are confusing.

Are you using Ximian GNOME?  Please attach your full X server log, and X
config file as separate file attachments using the link below after starting
up X.

If possible can you also attach a screenshot, or digital picture of the
screen to the bug report.

The best guess I've got as to what the problem you are experiencing might
be, is the video mode and console parameters not being restored correctly
by the video driver.  Again, this would have nothing to do with GNOME or
GTK et al. if it is the case.

Please provide more details.

Comment 5 jimomura 2002-07-11 02:09:52 UTC

+GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at all.

+It is not entirely clear to me what you mean by "console screen pointers".

     Gee, and here I thought I had been so clear. :-)  Now that I think
about it, I guess I have a huge problem with ambiguity in this situation.
Moreover, my first guesses about the programs were probably wrong.

     First, what I meant by "console screen pointers":
I was referring to pointer variables in the programming sense,
pointers to structures or other indirect data accesses.  Now
that I think about it, it is unlikely that there were pointers
being trashed.  More likely it was actual component variables.

+If you are refering to the Linux virtual text consoles (ie: not part of
+XFree86), are you refering to the text mode text cursor?

Yes, being an old person, I mean text cursor by default.  I will
almost always say "mouse cursor" if I mean the other one. :-)

+  Or are you refering to the gpm mouse cursor?

No.  That is "the other one." :-)

+In either case there isn't multiple pointers at the console so your
+statements are confusing.

     As I say, my first assumptions about the data variables
was probably wrong.  It appears that "Y" components are
being trashed.  Actually, "X" and "Y" both might be trashed
at some point and it may be that the "X" components are simply
restored by the action of the display process.

     In any case, what seems to be corrupted are:

1.  A variable in the screen buffer which is causing
"old data" to be scrolled up entering from the bottom of
the screen (instead of "empty screen").  Actually, this
could be a pointer to the start of a new line which
should be erased before it is used.  Oh this is pointless.
I am guessing about the programs because I do not have time
to unpack them.  If I get the time later I will.  Hopefully
you guys are faster than that.
:-)

2.  The "Y" offset of the text cursor, making the keyboard
echo to be displayed on a different line than the text
cursor, if it is visible at all.

     There may be more data being trashed but that is
all I can tell you.

  I'll try to illustrate this.
In the following simulation the letter "v" is used to indicate
screen display data (picture in your mind a cat of the screen
data you see when you're booting Linux -- essentially dmesg).
This stuff is scrolling up from the bottom.  The "c" is the
location of the cursor and "e" the display of keyboard echo
of what I'm typing:

 before I start typing:

 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvcvvvvvvvvvvvvvvvv

 after I finish typing:

 vvvvvvvvvvvvvvvvvvvv
 vvveeeeeevvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvcvvvvvvvvvvv
         ^
         Notice that the screen echo output is lining up
with the cursor, but it's not on the same line.  That is,
if the screen echo shows up.  Sometimes it doesn't show
up, but the cursor is still visible.  In that case I am
typing blind.  But Linux is still working.  I can type
"exit" and then re-login and then "shutdown -h now" and
it all works.  But I might not actually see what I'm
typing.

+Are you using Ximian GNOME?

No.  I'm using a pure RHL 7.3 installation.  I am
currently treating this box as a testbed computer for
RHL 7.3 and Windows ME.  Both OSes are being kept very
pure and up-to-date.

+  Please attach your full X server log, and X config file
+as separate file attachments using the link below after
+starting up X.

     I will attach the X config file shortly.  I'm not
sure about the server log.  I have to find that one.
I think I know where it is. :-)

+If possible can you also attach a screenshot, or digital
+picture of the screen to the bug report.

     I will see what I can do about that.

+The best guess I've got as to what the problem you are
+experiencing might be, is the video mode and console
+parameters not being restored correctly by the video
+driver.  Again, this would have nothing to do with GNOME
+or GTK et al. if it is the case.

     Well no it should not be possible.

+Please provide more details.

     I will as best as I can.  Unfortunately, I won't have
much time to work on this for the coming week or two.
Much is piling up around me.  I will do my best.

Comment 6 jimomura 2002-07-11 02:11:33 UTC

+GNOME/KDE/GTK have absolutely nothing to do with the Linux text console at all.

+It is not entirely clear to me what you mean by "console screen pointers".

     Gee, and here I thought I had been so clear. :-)  Now that I think
about it, I guess I have a huge problem with ambiguity in this situation.
Moreover, my first guesses about the programs were probably wrong.

     First, what I meant by "console screen pointers":
I was referring to pointer variables in the programming sense,
pointers to structures or other indirect data accesses.  Now
that I think about it, it is unlikely that there were pointers
being trashed.  More likely it was actual component variables.

+If you are refering to the Linux virtual text consoles (ie: not part of
+XFree86), are you refering to the text mode text cursor?

Yes, being an old person, I mean text cursor by default.  I will
almost always say "mouse cursor" if I mean the other one. :-)

+  Or are you refering to the gpm mouse cursor?

No.  That is "the other one." :-)

+In either case there isn't multiple pointers at the console so your
+statements are confusing.

     As I say, my first assumptions about the data variables
was probably wrong.  It appears that "Y" components are
being trashed.  Actually, "X" and "Y" both might be trashed
at some point and it may be that the "X" components are simply
restored by the action of the display process.

     In any case, what seems to be corrupted are:

1.  A variable in the screen buffer which is causing
"old data" to be scrolled up entering from the bottom of
the screen (instead of "empty screen").  Actually, this
could be a pointer to the start of a new line which
should be erased before it is used.  Oh this is pointless.
I am guessing about the programs because I do not have time
to unpack them.  If I get the time later I will.  Hopefully
you guys are faster than that.
:-)

2.  The "Y" offset of the text cursor, making the keyboard
echo to be displayed on a different line than the text
cursor, if it is visible at all.

     There may be more data being trashed but that is
all I can tell you.

  I'll try to illustrate this.
In the following simulation the letter "v" is used to indicate
screen display data (picture in your mind a cat of the screen
data you see when you're booting Linux -- essentially dmesg).
This stuff is scrolling up from the bottom.  The "c" is the
location of the cursor and "e" the display of keyboard echo
of what I'm typing:

 before I start typing:

 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvcvvvvvvvvvvvvvvvv

 after I finish typing:

 vvvvvvvvvvvvvvvvvvvv
 vvveeeeeevvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvvvvvvvvvvvvv
 vvvvvvvvcvvvvvvvvvvv
         ^
         Notice that the screen echo output is lining up
with the cursor, but it's not on the same line.  That is,
if the screen echo shows up.  Sometimes it doesn't show
up, but the cursor is still visible.  In that case I am
typing blind.  But Linux is still working.  I can type
"exit" and then re-login and then "shutdown -h now" and
it all works.  But I might not actually see what I'm
typing.

+Are you using Ximian GNOME?

No.  I'm using a pure RHL 7.3 installation.  I am
currently treating this box as a testbed computer for
RHL 7.3 and Windows ME.  Both OSes are being kept very
pure and up-to-date.

+  Please attach your full X server log, and X config file
+as separate file attachments using the link below after
+starting up X.

     I will attach the X config file shortly.  I'm not
sure about the server log.  I have to find that one.
I think I know where it is. :-)

+If possible can you also attach a screenshot, or digital
+picture of the screen to the bug report.

     I will see what I can do about that.

+The best guess I've got as to what the problem you are
+experiencing might be, is the video mode and console
+parameters not being restored correctly by the video
+driver.  Again, this would have nothing to do with GNOME
+or GTK et al. if it is the case.

     Well no it should not be possible.

+Please provide more details.

     I will as best as I can.  Unfortunately, I won't have
much time to work on this for the coming week or two.  Much
is piling up around me.  I will do my best.

Comment 7 jimomura 2002-07-11 14:48:32 UTC

Created attachment 64804 [details]
XFree86 Log as of 2002/07/11

Comment 8 jimomura 2002-07-11 14:51:32 UTC

Created attachment 64805 [details]
XF84Config-4 as of 2002/07/11 (aside from my comments it is the log created by anaconda)

Comment 9 jimomura 2002-07-11 14:54:07 UTC

Created attachment 64806 [details]
XF86Config as of 2002/07/11 (again, aside from my comments, it is the file created by anaconda)

Comment 10 jimomura 2002-07-11 14:57:31 UTC

Created attachment 64807 [details]
boot.log 2002/07/11 (not requested, but I thought I would provide it)

Comment 11 jimomura 2002-07-11 14:59:40 UTC

Created attachment 64808 [details]
dmesg 2002/07/11 (also not requested and probably not helpful, but. . . .)

Comment 12 jimomura 2002-07-11 15:12:56 UTC

     I have posted what information I could for now.  Unfortunately, I expect
that I will be too busy for the next week to provide more.  Most likely I
will be in contact again a week from tomorrow.

     Aside from the above material (which has some chance of being
useful), I can add a couple of things which are unlikely to be related:

1.  Recently (I think starting after I installed the latest Kernel 2.4.18-5)
I have been getting the following message, shortly after started the
computer:

"spurious 8259A interrupt: IRQ7"

I have not gotten around to looking into this, but I would guess it
is a serial port device.

2.  Cron has been sending me a couple of error messages.  One
has to do with "tripwire".  That message should stop coming
because I have just completed installing and initializing
"tripwire".  The other message reports missing modules.


From root  Mon Jul  8 22:14:17 2002
Return-Path: <root>
Received: (from root@localhost)
        by localhost.localdomain (8.11.6/8.11.6) id g68MEH001705
        for root; Mon, 8 Jul 2002 22:14:17 GMT
Date: Mon, 8 Jul 2002 22:14:17 GMT
From: root <root>
Message-Id: <200207082214.g68MEH001705>
To: root
Subject: LogWatch for localhost.localdomain
Status: RO



 ################## LogWatch 2.6 Begin ##################### 

 --------------------- ModProbe Begin ------------------------ 

Can't locate these modules:
   char-major-13: 4 Time(s)
   char-major-45: 1 Time(s)
   char-major-30: 5 Time(s)
   char-major-81: 2 Time(s)
   char-major-82: 1 Time(s)
   block-major-25: 1 Time(s)
   fb0: 1 Time(s)


 ---------------------- ModProbe End ------------------------- 



 --------------------- sendmail Begin ------------------------ 

241 bytes transferred
1 messages sent
 ---------------------- sendmail End ------------------------- 


 ###################### LogWatch End ######################### 

From root  Mon Jul  8 22:16:30 2002
Return-Path: <root>
Received: (from root@localhost)
        by localhost.localdomain (8.11.6/8.11.6) id g68MGUT02537
        for root; Mon, 8 Jul 2002 22:16:30 GMT
Date: Mon, 8 Jul 2002 22:16:30 GMT
Message-Id: <200207082216.g68MGUT02537>
From: root (Anacron)
To: root
Subject: Anacron job 'cron.daily'
Status: RO

/etc/cron.daily/tripwire-check:

****    Error: Tripwire database for localhost.localdomain not found.    ****
**** Run /etc/tripwire/twinstall.sh and/or tripwire --init. ****

Comment 13 Mike A. Harris 2002-07-20 13:05:05 UTC

Your X config file and log file show no problems.  X is configured properly,
and is starting up properly.  I'm not able to reproduce the problem you
are describing on any of my Radeon hardware using RHL 7.3.  This is probably
a hardware issue, or a kernel issue.  I suspect the former.

Your last comment posted contains information which has nothing to do
with the problem, or with XFree86.  They are tripwire logs.

I'm Cc'ing Arjan for comment in case this is a known kernel console issue
in our erratum kernel.  I suspect it isn't a kernel issue though, and is
probably a hardware issue.

Any comments Arjan?

Comment 14 jimomura 2002-07-21 22:31:33 UTC

>--- shadow/67849	Thu Jul 11 11:07:43 2002
>+++ shadow/67849.tmp.14776	Sat Jul 20 08:59:07 2002

. . .

>+Summary: Console corruption when VTswitching from XFree86 to console

> Description of Problem:

>@@ -465,3 +466,18 @@
> ****    Error: Tripwire database for localhost.localdomain not found.    ****
> **** Run /etc/tripwire/twinstall.sh and/or tripwire --init. ****

>+------- Additional comments from mharris 2002-07-20 09:05:05 -------
>+Your X config file and log file show no problems.  X is configured properly,
>+and is starting up properly.  I'm not able to reproduce the problem you
>+are describing on any of my Radeon hardware using RHL 7.3.  This is probably
>+a hardware issue, or a kernel issue.  I suspect the former.

     If it is driver related, I expect it is a combination
problem.  Let me provide a bit of history and put this into
better perspective:

This box was first assembled on around Oct 12, 2001
with a Maxtor 8.7 GB HD, AMD Duron 950 MHz processor
and Pioneer DVD-114 (10X) and S3 Savage4 AGP graphics
card with Win98 and RHL 7.1.  We can ignore this bit
of history mostly and just note that "it was working"
except for sound and LAN, which are not the current
issue.  And that it will be still be a few months
before I sing Happy Birthday. :-)

Nov. 22, 2001 upgraded motherboard BIOS to 011016L.ROM.

Nov. 23 upgraded to RHL 7.2 (Kernel 2.4.7-10).

Dec. 28, 2001 I installed the ATI Radeon R6
DDR 64MB ViVo card.  Kudzu identified it as
"ATI Radeon QD".

Jan 14, 2002, reloaded RHL 7.2 from scratch
(changed a few choices).  NOTE:  XF86Config
set by Anaconda included an option "nodri".

Around May 27, 2002, I replaced the AMD
Duron 950 with an AMD Duron 1.2GB.  This
uses a new core with XP instructions.

June 1, 2002, I replaced the Pioneer DVD
with an LG CD-RW/DVD drive.  I did not
install the scsi emulator at boot time.
I used a script to install it as needed.

NOTE:  Up to this point I had used the Radeon
64DDR card without problems both under RHL 7.2
(which might not have used AGP functions)
and under Windows 98 which definitely did.
The only problem I had was with Windows 98
which had a bug relating to screen blanking
during "sleep" modes.

June 13, 2002, I replaced the Maxtor 8GB HD
with a Maxtor 20GB HD (still a UDMA/ATA).

June 13-14, 2002, I installed Windows ME,
which fixed the Win98 screen blanking bug,
and installed RHL 7.3.  Actually, I installed
it many times, having a few problems, including
this particular problem.

To summarize this, I have used the Radeon 64
card successfully with the rest of this
hardware mix (except for the 20GB HD) with
RHL 7.2, and WinME.  Win98 had a bug, but
that is not likely related to this problem.

So is it a "hardware problem"?  Perhaps, but
not entirely, and certainly one that should
be circumventable via software, since WinME
uses this card fully.

One aspect of "hardware" though is the
AMD Duron processor.  I have heard of
problems relating to an anomaly in the
instruction set and AGP block addressing.
I do not recall the details, but apparently
it is well know among driver writers.
It is possible that this is the problem.

I can also add that I have now applied all
updates for RHL 7.3 up to July 19, 2002
so it is current, and the problem is still
evident.

>+Your last comment posted contains information which has nothing to do
>+with the problem, or with XFree86.  They are tripwire logs.

I did not expect that it would.  I was just
being thorough.

>+I'm Cc'ing Arjan for comment in case this is a known kernel console issue
>+in our erratum kernel.  I suspect it isn't a kernel issue though, and is
>+probably a hardware issue.

>+Any comments Arjan?

I have taken pictures of the latest problem
siting.  I will look at them later and see
if I have anything useful.  However, this
might take some time.  I am very busy lately
and while this issue is important to me,
I have to get my main work done first.

Comment 15 jimomura 2002-08-01 02:58:07 UTC

Created attachment 68124 [details]
First picture, immediately after closing XWindows

Comment 16 jimomura 2002-08-01 03:06:49 UTC

Created attachment 68125 [details]
One "CR" later.  Text cursor is at the same point but new text appeared without my input.

Comment 17 jimomura 2002-08-05 15:07:27 UTC

This is a general update on what has happened.

First, about the two pictures I uploaded:

The pictures were taken on 2002/07/20, after
upgrading RHL 7.3 to the 2002/07/19 level
(the last was a "mod_ssl" package).

The first picture was taken, as far as I remember
immediately after the "logout" from X with no
keyboard input.  This might be wrong because I
was not thinking about taking the picture.  It
occurred to me just as I was about to shut it
down.  But I am fairly sure I did not press any
keys up to that point.

The second picture is after a single CR.  You
can see that the cursor is at the same position
but all the lines have moved up and there is
a new line at the bottom.  I did not enter any
of the text that is on that line.  It is exactly
what showed up on its own.

I took more pictures as I went through the "exit"
and "login" as "root", then shutdown, but they
were all blurred to the point that I thought they
were not worth posting.  If I think about it, I
will try and take pictures again.

2002/07/22
I upgraded the BIOS to the "02/26/2002 S" w/LAN
version.  I confirmed the main functions still
worked under Windows ME (expect for the onboard
Modem circuitry, which I have never used).
I confirmed DirectDraw was working using
"dxdiag.exe".  I later confirmed that the
RHL 7.3 problems were still there -- no effective
change.

2002/07/30
I completed applying upgrades to 2002/07/22
(which includes the GCC compiler and glibc
updates).

2002/08/02
I noted a comment on the "comp.os.linux.misc"
group (or was it ".hardware"?) that I could bypass
the DRI module by commenting out "Load dri".
I tried this.  As far as I can tell from the
error messages, the module did not load.  This
did not help the screen problem either.  So either
it is not in the "dri" driver module, or it is
also in another module as well.

Other Notes:

A few weeks ago in the "comp.os.linux.portable"
I noticed that someone reported having the same
"spurious 8259A interrupt: IRQ7" message that I
reported earlier.  I did not think that it was
likely connected with this screen problem.  I do
not know how busy your staff is (probably very
busy :-), but I would suggest that you give that
problem a priority over this screen problem.  It
occurred to me that both problems are "new" to
7.3 and were not in 7.2.  It might be that there
is a compiler or common code problem between them.
If, for example the problem is a rare compiler
problem it could be that the same bad code was
created in 2 different modules.  The point of
putting the effort into the other bug first is
that I am guessing that it might be easier to
isolate and correct.  In that case, it might
indicate where this bug is coming from.  If it is
completely unrelated to this problem, well, it
is still a bug to be fixed. . . .

The interrupt report is showing up at random
times after booting.  I have seen it late
during the boot process, or later during or
even after login.  I do not think I have seen
it before "run level 3".

Comment 18 Andre 2002-08-06 17:58:52 UTC

I have the same problem on my system.  I am running a fully updated 7.3 redhat
(kernel 2.4.18-5 with custom configuration), on a Trinity KT-A motherboard
(KT-133a chipset), with a 64 mb ATI radeon video card and an Athlon XP 1700.  I
do use DRI (Quake 3), but I'm not sure if it is correlated.  The problem happens
consistently after a few days (or even sooner?) making the text consoles
basically unusable.  Please let me know if you need any more information.

- Andre

Comment 19 Andre 2002-08-08 00:25:33 UTC

Some additional information I ran accross on 

http://www.xfree86.org/cvs/changes_4_2.html

There is an entry there which reads:

 682. Delay before restoring VGA registers for Radeons to "fix" VT switch
      problems (Kevin Martin).

which sounds like this problem (it also sounds like the "fix" was just a hack).

Comment 20 jimomura 2002-08-26 02:47:45 UTC

     Just updating the situation:

     On Aug. 24, I applied all updates up to Aug. 19, 2002 and
then installed the latest Kernel (2.4.18-10).  Upon testing
I found the screen trashing still occurs.  I also noticed
that the "spurious . . . interrupt . . ." is still occuring.  In
fact, this was the first time I actually noticedit BEFORE
"run level 3".  It may have happened before, but without
my noticing it.

Comment 21 Mike A. Harris 2002-09-15 09:59:39 UTC

spurious interrupt is not an XFree86 problem.  It is generally a
hardware bug.

Comment 22 Mike A. Harris 2002-09-16 01:03:43 UTC

Wow, this bug report is considerably long.

Rather than read it all over again...  It came up in a bugzilla search
for "savage", so whoever mentioned that word above, please test:

ftp://people.redhat.com/mharris/test-drivers/savage_drv.o

It is a new S3 Savage driver.   Fixes all known bug, cures all known
diseases, causes world peace, etc.   ;o)

Comment 23 Andre 2002-09-21 18:00:44 UTC

I have tried
ftp://people.redhat.com/mharris/test-drivers/radeon-4.2.0-vtswitch-hang/radeon_drv.o
for my system and it seems to have fixed the problem.  No corruption in the last
week.  Thanks! - Andre

Comment 24 Mike A. Harris 2002-09-21 18:50:39 UTC

See, told you it would fix everything.  ;o)

Comment 25 jimomura 2002-09-28 23:02:41 UTC

Almost?

First, I should say that I am sorry for taking so long
responding.  I have been very busy lately.

On Sept. 27, I finally had some time to look at the
new driver.  I renamed
 "/usr/X11R6/lib/modules/drivers/radeon_drv.o"
to "/.../radeon_drv.old" and put the new driver in
the directory and rebooted.

I tried "XBoard" (Chess) briefly and logged out of
the X session and everything was OK.  I tried
"Mahjongg" a few times inconclusively (I will get
back to this later).  I tried the "Chromium Setup"
program and then ran "Chromium."  This cause the
screen to trash.  Yup.  It failed.

I rebooted and ran "Chromium" again with out the
"Setup" program.  After logging out of the X session
everything was OK.

Unfortunately, the most likely failure has aways been
if I finished a game of "Mahjongg" (completely
emptying the board).  I tried for many hours to
complete a game.  Finally, late Saturday, I completed
one game and logged out of the X session.  There was
no problem.

I *think* that this driver is almost right.  There was
still the one failure after running both "Chromium
Setup" and "Chromium", but so far, that was the only
failure.  But a stats professor would be jumping on
my head for saying so -- not enough samples to establish
a probability.

Comment 26 Mike A. Harris 2002-09-29 01:27:49 UTC

I think you're describing 2 completely different problems then.
1) A VT switching bug
2) Some other bug causing a crash?

Comment 27 jimomura 2002-10-01 14:59:02 UTC

+------- Additional comments from mharris 2002-09-28 21:27:49 -------
+I think you're describing 2 completely different problems then.
+1) A VT switching bug
+2) Some other bug causing a crash?

     This has been my assumption since before I opened
this bug -- that there was a good chance that this is
an interaction between at least a couple of issues,
which is why I mentioned libraries in the beginning.

Sample Cases:

1.  In order to update RHL 7.3 I download RPM files to
    my main "media" computer and then burn it to a
CD-RW.  The resulting CD-RW disc is not readable under
Linux because the ISO driver cannot handle the format
created by that software (which is not necessarily a
bug).  So I boot the K7S5A box (which is the box we are
discussing with this problem) and copy the files to
"C:\data\temp\xxxx".  I then reboot the box under RHL 7.3.

Under Linux I have 2 accounts.  The "root" account is
set up for KDE and the "user" account is set up for
Gnome.  To transfer the files to Linux, I login as
"root", "mount /mnt/c" and then "startx".  I then use
the "quick browse" (I think that is what it is called"
to open "/mnt/c/data/" in one window and then "quick
browse" again to open "/home/Storage/RHL7_3/hold/".
I then use the KDE "copy" and "paste" functions to
copy the file structure from the Windows partition
to Linux side.  Then I "log out" of the X session.

I have done this fairly often and I do not recall ever
having the screen trashed, no matter how long I took.
In some cases I took "a bit of time" getting it done.
Note that I have run into other problems with KDE which
I have not mentioned, but I think we can ignore that
for now.

2.  As "User" (with the Gnome interface) in a couple of
    cases I started up the Chess program (XBoard) and
looked it over, then logged out of X.  Sometimes the
screen was trashed and sometimes it was not.  I have
never tried playing a whole game of Chess (Mahjongg takes
me about an hour a game and Chess would take me even
longer.  I do not really even like Mahjongg when it
comes down to it.  I am playing it now as a test
program.)

3.  As "User" (with the Gnome interface) I have run
    "Chromium" a few times briefly and when exiting,
sometimes the screen has been trashed and sometime
it was ok.  As mentioned previously, it seems to be
trashed if I run the "Chromium Setup" program.

4.  I have run programs briefly under "KDE" and
    under "Gnome" and had the screen trashed fairly
often, but not consistently.

Is the problem "time related?"

     Not predominently.  Some of the screen trashing
occured after brief sessions and sometimes fairly
lengthy sessions resulted in a clean log out.

Is it program related?

     Apparently yes.  In fact, it is probably
related to specific subroutines.  I would guess
that it might be a specific message window or
dialog window call.  I think I mentioned before
that it seems less of a problem if I stay within
the KDE programs.

Could simply increasing the delay time "fix" the
problem?

     Possibly.  Here is a hypothetical:

Assume that the card has an anomally returning
a "ready" signal under certain conditions.
The KDE library "fixes" the problem" by avoiding
the fault generating condition in the first
place, and uses the "ready" signal.  The
"Gnome" library does not avoid the fault
generating condition and thus, programs using
the library may have a problem.  But if you
write the driver to delay long enough for the
the card to react, you do not need the "ready"
signal -- so all programs from either group
will work.  Another fix would be to change
the "Gnome" library to operate like the KDE
library.  Another fix would be to write the
driver to avoid the fault generating conditions.

This is not necessarily what is going on, but
this is roughly the range of possibilities that
were on my mind when I first posted the bug.

Comment 28 jimomura 2002-10-04 14:41:02 UTC

I think I have isolated a better test:

Open "Time Tracking Tool"

If you exit using the "Quit" icon/button, and
then immediately log out of X, it trashes.

If you exit using the window closing "X"
button (upper right corner) and log out of
X, it exits clean.

I wish I had found this one months ago.
It would have saved me hours of Mahjongg. :-)

If you post the delay patch and how the
delay is calculated, I might have a
suggestion -- probably just making it longer.

One other change that should be made to the
driver system is that Anaconda is setting
the default screen for 24-bit mode in the
Config files.  I have only looked at consumer
level documentation, but as far as I can tell,
there is no such thing as a 24-bit hardware
mode on this card.  The chips seem to only
support 16-bit and 32-bit (and 8-bit for pure
VGA).  The only way to have a 24-bit mode is
via driver translation.  According to the
latest online documentation for XFree86,
32-bit modes can be specified in the Config
files.  I would guess that the 24-bit setting
was used for compatibility with older software
that might not have been expecting 32-bit
support.

     I do not recall if the driver itself
accepts the 32-bit setting.  If it does not,
then it should be changed to accept 32-bit
mode.  Either way, Anaconda should be changed.
It is never a good idea to have bogus data
when true data is allowable.  In this case it
wasted my time because I had to test these
settings to find out if it affected the screen
trashing.  It did not have any affect.  But
having such bogus data is just about begging
for unnecessary problems.

Comment 29 jimomura 2002-10-10 19:36:21 UTC

Latest tests:

2002-10-10
- played Gnu Mahjongg, cleared board, ran Help, exited,
- log out of X, screen OK

- applied update RPMs up to Sept. 1, 2002:
"krb5", "mailman", "PHP", "scrollkeeper", "ethereal", "PXE"

- continued tests using "Time Tracking Tool":
- "Quit" button did not trash screen
- "closing button" did trash screen (possibly because
I did not reboot between attempts -- not sure)
- re-tested 2x for each termination method,
making sure I rebooted between each test
- the screen was not trashed when I exited X
for either method of exit

Conclusions:

1.  I although I have only completely cleared
    the "Mahjongg" board twice, I believe that
it is unlikely that doing so will cause the
screen to trash again.

2.  I do not believe that the updates had any
    affect on the tests.  I am applying the
updates in the recommended order and I am trying
to catch up now because there has been a recent
update to "glibc" which actually does have some
chance of affecting the result -- though it
probaby will not.

3.  The "Time Tracking Tool" tests are still
    the best test I have for trashing the screen,
but I seem to have been wrong about the degree
of predictability.  Clearly the screen will not
be trashed every time I exit with the "Quit"
button.  I am guessing that the ratio is close
to 50%.  I do not know what to think about the
fact that it trashed once when I exited with the
window closing button ("X" in the upper right
corner).  It is possible that I did not reboot
between tests and that there was an interaction
caused by a previous test.  If so, then exiting
with the window closing button is 100% successful.
If not, then the percentage of "bad" exits
still seems to be lower.  It may require around
100 test samples to be certain.  I do not feel
like doing that much testing, even with this
shorter test case.

4.  From the above data, I think I can say that
    the current patch has improved the reliability
of the driver.  It is possible that increasing the
delay slightly will be a sufficient fix.  I
will try to look into it a bit further before I
make a recommendation about how long a delay
should be used.

Comment 30 Mike A. Harris 2002-10-10 19:47:26 UTC

Don't mean to complain... but could you please make your lines longer in
the bug reports?  It is hard to read bug reports that are 30 pages long
with each line 3 inches wide on my monitor.  Just makes it more difficult
to assess the problem each time I look at the bug report as I can't fit
as much information as possible on my 19 inch monitor.

Comment 31 jimomura 2002-10-15 17:21:38 UTC

2002-10-13

Sorry about the formatting, I will try to remember that. I use a number of computers and some
have very narrow screens, and I am accustomed to message systems that have auto formatting in
the reading phase.

I updated the RPMs to Oct. 10,2002 which were for:

"tar", "nss_ldap", "glibc", "fetchmail", "gv" and "ggv" for Postscript and PDF, "update2 and "rhn_register"

Test "Time Tracking Tool":
I repeated 3 x each alternating "closing button" & "Quit" button w/reboot between each test. The screen
was not trashed in any of these tests.

Thinking about previous tests, I wondered if recycling X (Gnome) sessions by itself was stable. So I used
the following test:

- boot the system, then "cycling X (Gnome)" -- just "startx" & immediately "log out" without running any
programs, until screen trashes. The results were:

3 cycles, 2 cycles, 3 cycles, 2 cycles

Assumptions:
None of the updates should have made any difference to the tests, so they can be ignored.

Conclusions:
1. For now, I would ignore the "Time Tracking Tool" program as any indicator of this problem.

2. In the "cycle through X (Gnome)" test, since the number of cycles before a error varies, it does look like
a timing problem rather than a coding error.

3. If the current patch is a single "timing loop," my setup is probably near the borderline. A small increase
of the loop should be enough. I would guess that if it is a fairly stable time base (like "10 hardware
clock ticks of 1/100 sec.") then maybe a 10% increase would do.

If the time base is more variable (like "an empty loop up to a number times the hardware clock speed")
then I expect a 33% increase would be sufficient. I have not seen the patch, so these are crude guesses,
but I do have some reasons for these particular numbers.

Comment 32 Jesse Keating 2002-10-15 17:41:55 UTC

This looks like something I see on all systems running XFree86.  When Xfree86 is
shut down, a few messages take a couple extra seconds to report to the tty, and
it can paste over, or around your current prompt.  The "workaround" is just to
issue another "CR" when X (and it's friends) are all done messaging you.  I
don't really see this as a "bug" persay, as just a mear annoyance.

Your attached pictures backup my assumptions, so if I'm wrong, please let me know.

Comment 33 Mike A. Harris 2002-10-15 20:40:28 UTC

I'm not sure where you're getting these statistical numbers from
but it seems to me you're just making random guesses as to what
is causing the problems that you are seeing.  The random data is
not really useful in debugging the problem however.

The only real way to debug this problem is to reproduce it locally
with identical video hardware, and then run it in a debugger and
single step the problem to reproduce it, possibly taking register
snapshots of the card.

Random applications being executed doesn't likely have anything
at all to do with any of this, and so such information is rather
useless in debugging the problem.

At this point, I'm thinking that it is likely not going to be possible
for me to debug this problem because I can't reproduce it on any
hardware I've got here.  Inability to reproduce hardware related
problems, generally translates to inability to do anything about
said problems.  I strongly suggest reporting this problem on
XFree86 mailing lists in hopes that someone else out there shares
your problem, and hopefully some kind of useful information can be
gathered amongst various people with the problem, that can aide
in someone being able to determine what exactly is going wrong.  At
least more developers are aware of the problem then and can comment
on it.

Comment 34 jimomura 2002-10-27 02:35:52 UTC

     As usual, sorry for the delay.  Things have been very busy lately.

. . .

> then I expect a 33% increase would be sufficient.  I have not seen the patch, so
> these are crude guesses, but I do have some reasons for these particular numbers.


>+------- Additional comments from hosting 2002-10-15 13:41:55 -------
>+This looks like something I see on all systems running XFree86.  When Xfree86 is
>+shut down, a few messages take a couple extra seconds to report to the tty, and
>+it can paste over, or around your current prompt.  The "workaround" is just to
>+issue another "CR" when X (and it's friends) are all done messaging you.

     No, I know what you are talking about and that is not what is going on.  There
is definitely at minimum "a piece of data" being corrupted and hitting CR does not
solve it.  I stated that above, actually circumspectly in the original posting and
again specifically regarding the screen photographs.

>+  I don't really see this as a "bug" persay, as just a mear annoyance.

     Again, no.  The amount of corruption is not determinable.  As such, it has to
be considered an unstable system.  This is not acceptable for any business
computing.  I have not stated this before, but the whole point of this computer was
to be used for business purposes.  It was going to become what you might call my
"main" computer.  Unfortunately, that never happened.  It never achieved acceptable
stability.  (Ironically, I think this was the first real Red Hat boxed package I
have bought since 6.0, but that is another matter. :-)  Moreover, since the main
terminal screen is what I would have used for debugging in the first place, it is
a bug that in essense defeats its own debugging.  Theoretically, this is not quite
that bad a problem since I should be able to wire up a terminal out the serial port
for debugging, but physically that is quite difficult, due to the arrangement of the
workspace.

. . .

> don't really see this as a "bug" persay, as just a mear annoyance.

> Your attached pictures backup my assumptions, so if I'm wrong, please let me know.

     No, look at the pictures again.  Oh never mind. :-)

>+------- Additional comments from mharris 2002-10-15 16:40:28 -------
>+I'm not sure where you're getting these statistical numbers from
>+but it seems to me you're just making random guesses as to what
>+is causing the problems that you are seeing.  The random data is
>+not really useful in debugging the problem however.

     Which statistical numbers?  I have reported tests in what
I consider to be brief but sufficient detail.  If you do not
understand them, quote me a passage and I will expand it further.

     If you mean the "10%" and "33%" timing increase recommendations,
well, I have not seen the patches, so yes, certainly those are only
my guesses about what might help based on what I think has been done
in your patch(es), based on the discussions I have read so far.  As for
why I recommended those numbers, I did not feel like taking the "column
inches" to say in detail.  In fact, the 10% has to do with the difference
in performance between the SIS chipset in the motherboard and previous
chipsets (mainly by VIA) when the SIS chipset was new.  It was a
particularly fast chipset at that time, but general performance
differences never exceeded 10%.  Thus, 10% should be enough to cover
most timing differences resulting from this particular chipset.

     The 33% is a bit "softer" in origin, but again is rooted in known
performance figures.  In this case it is the difference between 100 MHz
and 133 MHz which has to do with the fact that I am using a split speed
setting.  Memory accesses from the CPU to RAM is at 100 MHz, but memory
accesses from the graphics card to main memory (through the 4X AGP port)
should be at 133 MHz.  The speed difference might not be expected in a
timing loop calculation -- if it is relevant at all.  Again, it
is "not much more than a guess" because I do not know what the patch
looks like.  But it is not just a number picked out of thin air.

>+The only real way to debug this problem is to reproduce it locally
>+with identical video hardware, and then run it in a debugger and
>+single step the problem to reproduce it, possibly taking register
>+snapshots of the card.

     Well, no that is not the only way to debug it, but I understand
the sentiment. :-)  In fact, I expect that the type of debugging
you have described would not be better than what we are doing right
now -- a bit faster, but no more informative.  I expect we really
need an accurate real-time hardware emulation of the video card.  Now,
who would have such a thing?  Uh, ATI would, would they not? . . . .
;-)

>+Random applications being executed doesn't likely have anything
>+at all to do with any of this, and so such information is rather
>+useless in debugging the problem.

     It was not random.  It was painstaking and took a lot of my
time.  Go back and read my postings.  Unfortunately, what is needed
(if the above mentioned emulator is not available) is an even far
more thorough set of tests than I have the time to conduct.  That
was one of the problems I have had -- deriving a repeatable test
that was quick enough to repeat enough times to form a useful
statistical base.  Ironically, we needed more people with the problem
so that we could have gotten the statistical data.  Since I was the
only one working at it, and *I* do not have the time, well, that is
what killed the effort.

>+At this point, I'm thinking that it is likely not going to be possible
>+for me to debug this problem because I can't reproduce it on any
>+hardware I've got here.  Inability to reproduce hardware related
>+problems, generally translates to inability to do anything about
>+said problems.  I strongly suggest reporting this problem on
>+XFree86 mailing lists in hopes thatsomeone else out there shares
>+your problem, and hopefully some kind of useful information can be
>+gathered amongst various people with the problem, that can aide
>+in someone being able to determine what exactly is going wrong.  At
>+least more developers are aware of the problem then and can comment
>+on it. 

     Deciding where to go to solve the problem was, in itself a
problem.  In order to achieve a stable software base I decided to
work completely within the "Red Hat system" if possible.  That way I
could avoid a mixed system.  Going to the XFree86 people, I expect
that they will (quite reasonably) insist on my first compiling the
current XFree86 sources and then report what happens.  There are two
things wrong with this from my point of view.  First, I still have
not gotten around to doing any real compiling work under Linux yet.
I was an experienced programmer years ago, but I am not looking
forward to getting back in harness by facing a hardware/kernel/driver
level debugging problem with new tools and an untrusworthy computer
doing the compiling.  Second, it will mean that any further debugging
of other problems will have to be qualified by the degree to which
my system strays from both canonical sources and Red Hat's development
stream.

     Anyway, thanks for the efforts.  I would suggest that the
patchwork that you have done so far at least seems to have helped
some systems, so you might as well release what you have done.
In fact, although I have not used the K7S5A much, I do believe that
it has become more stable using the patched driver.

     I will eventually try to consolidate the previous postings and
notify the XFree86 people (though I would have expected them to have
checked out this problem "here" by now).  I assume that you will leave
this bug open so others might add comments later.  Maybe someone will
happen along who will fix it.

Comment 35 jimomura 2003-01-20 23:32:41 UTC

I have been upgrading the BIOS and Windows drivers on my ECS K7S5A
motherboard system and it occured to me that there was a possibility
which I have considered, but which I do not think I actually mentioned.
The Windows system uses a separate AGP driver which is specific to the
motherboard (or actually the SiS735 chipset).  I do not know exactly
what is in the AGP driver, but it would seem to me that the Linux setup
is likely similar.  As such, an equivalent "AGP driver" is probably
part of the Kernel.

Since this Radeon DDR VIVO card is an AGP card, clearly a problem with
an AGP driver could be a problem showing up in the graphics systems.
The AGP driver for the SiS 735 chipset has been updated a few times --
the last being around Dec. 2002.  However, the your Kernel may be based
on information as far back as 2001.  It would be a good idea to check
that out.

In fact, I have not upgraded the Linux Kernel since the Aug. 20, 2002
version (there have been 2 upgrades I know of), so it is possible that
the problem has been addressed.

As it is, I might not ever know the answer to this problem.  I am
currently considering swapping the video cards in 2 of my computers.
If so, I would probably end up with an older "Number 9" (S3 Savage4)
graphics card in this box and move the ATI card to the other box.
Assuming the S3 Savage4 card and drivers work better on this
motherboard, that may resolve the issue for me.  That would be good
enough for me, but unfortunately, it would leave problem for someone
else to trip over in the future -- which is not unlikely because the
K7S5A has apparently been a particularly popular motherboard, and this
video board was not that rare either.

Comment 36 Mike A. Harris 2003-04-15 08:59:37 UTC

I've just reread all of the material in this bug report and I'm not sure what
to tell you.  It's the only bug report I've received of this nature ever,
so I presume if it was a common reproduceable problem that I'd have received
multiple bug reports by now, or heard of similar problems on mailing lists
and IRC channels that I frequent.  It is almost certainly IMHO a localized
problem with your system, either some bad motherboard component such as the
BIOS or chipset, or perhaps even a bad video card.  Perhaps APM or something
is messing with the video card while X is also trying to control it.  I
really can't investigate the matter unless I can reproduce it though, and
I've never seen this kind of problem on any ATI Radeon hardware before.

Your best bet, if you are not using Red Hat Linux 9 already, is to upgrade
to that release, and if the problem still exists for you then try reporting
it on the xfree86 mailing list and/or the XFree86 bug tracking
database so that more people can see the problem, and perhaps someone else
will have other suggestions or feedback.

There isn't much I can do though, so I'm closing this as WORKSFORME
because if it is a bug, it can't be fixed unless it can be reproduced
and I can't reproduce it and don't know anyone else who can either.  Radeon
hardware being the most common hardware out there along with Nvidia, if it
were a major common problem, I'd have almost certainly have heard more by
now.

Another suggestion is to try borrowing a different card and see if it happens
on that card.

Hope this helps.

Closing WORKSFORME