Bug 57509 - Frequent Crashes under X, simple lockup to full reboot
Frequent Crashes under X, simple lockup to full reboot
Status: CLOSED WORKSFORME
Product: Red Hat Linux
Classification: Retired
Component: XFree86 (Show other bugs)
7.2
athlon Linux
medium Severity medium
: ---
: ---
Assigned To: Mike A. Harris
David Lawrence
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-12-14 12:36 EST by P. Beltrani
Modified: 2007-04-18 12:38 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-02-22 10:41:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ksymoops output after running mesa fire (4.11 KB, text/plain)
2001-12-14 12:40 EST, P. Beltrani
no flags Details
ksymoops output after running mesa gears (4.23 KB, text/plain)
2001-12-14 12:41 EST, P. Beltrani
no flags Details
ksymoops output of unplanned oops while submitting report (36.79 KB, text/plain)
2001-12-14 12:42 EST, P. Beltrani
no flags Details

  None (edit)
Description P. Beltrani 2001-12-14 12:36:32 EST
Description of problem:
Hardware: ASUS A7M266 Motherboard,  Athlon processor with AMD 761 Chipset
          BIOS upgraded from v1004a to v1005
          256 MB RAM
          ATI XPERT 2000 (Rage 128 RL) Video card

Software: OS: RH7.2 with current patches, Kernel 2.4.9-13.athlon

Possibly related bugs? : #49586, #56426

This system is very unstable under X.  Problems range from simple X crashes
to unannounced warm reboots.  The full system lockups/reboots make it
difficult to capture any useful debugging info.

I have been able to consistently lockup X by running some of the mesa
demos.  By lockup up, I mean the system will not accept keyboard input and
the display is frozen.  I have to ssh in from another system to gracefully
reboot as ctrl-alt-del does not work.  Please note that this is NOT limited
to the mesa demos.  I used them because:
 1) They exercise (what I believe to be) the bug
 2) They reproduce the problem on demand
 3) They don't wipe out the system to the point that I can't capture any
debug info.

In addition to info generated by running the mesa demos, attached is
ksymoops output for an X crash that occurred while
trying to submit this bug report.

Version-Release number of selected component (if applicable):
Red Hat Linux 7.2

How reproducible:
Always

Steps to Reproduce:
1.Create generic test acct using unmodified configs from /etc/skel
2.Reboot system
3.Login using generic acct and gnome
4.run /usr/bin/fire or /usr/bin/gears
(The following steps may require logging in from
another system as X may have locked up to the extent
that even switching to a virtual console is not possible.)
5.run ksymoops, capture output for bug report
6.reboot
7.repeat

Actual Results:  At the very least, an "Oops" is generated.


Expected Results:  System should not generate an "Oops". It should
DEFINITELY NOT bring down the kernel, i.e. warm reboot

Additional info:

The output of ksymoops for three different events is attached.

Argh is the result of running aspell while spell checking this bug report.
(X crashed and dumped me back to the login screen.)

CrashC are the results of running "fire"

CrashD is the result of running "gears"

Full logs, symbol and module lists for each "oops" event are available on
request.
Comment 1 P. Beltrani 2001-12-14 12:40:06 EST
Created attachment 40648 [details]
ksymoops output after running mesa fire
Comment 2 P. Beltrani 2001-12-14 12:41:45 EST
Created attachment 40649 [details]
ksymoops output after running mesa gears
Comment 3 P. Beltrani 2001-12-14 12:42:58 EST
Created attachment 40650 [details]
ksymoops output of unplanned oops while submitting report
Comment 4 Mike A. Harris 2002-01-24 20:06:40 EST
Try our latest kernel 2.4.9-21 out.

Also, try using the non-athlon kernel just to see if that fixes it
or not.  Try also booting with the option "nopentium" which bypasses
a bug related to Athlon CPU's, 4Mb pages and AGP.  Many 3D lockup
problems are believed to be the result of this bug.
Comment 5 P. Beltrani 2002-02-04 11:14:55 EST
Items tried:

 0) Previously
  a) tried the i686 kernel without success. (Not 2.4.9-21.i686.)
  b) tried the test kernel at
http://people.redhat.com/arjanv/testkernels/athlon/kernel-2.4.9-17.6.athlon.rpm
without success

 1)
   a) Clean install with bad block check
   b) update to kernel 2.4.9-21

  Crashed and trashed the root file system. Had to do a rescue/manual fsck to
get it to a usable state.

 2)
   a) Clean install
   b) update to kernel 2.4.9-21
   c) boot with "nopentium" option

  Crashed with the now familiar "Unable to handle kernel paging request at
virtual address ..."

For what it's worth:
 RAM checks out with two different test programs.  The system runs games like
Quake II and u$soft Flight Simulator 2000 just fine under win98 so I don't think
it's broken hardware.  (Buggy chip-sets excluded.)

I've seen threads on the linux-kernel mailing list archived at
http://marc.theaimsgroup.com/ that there are problems with the AMD-761 and AGP.
 Are there any known issues with the AMD-761 and ATI XPERT 2000, Rage 128 based
card?
Comment 6 Mike A. Harris 2002-02-09 14:58:48 EST
Appears to be a R128 DRM problem if I read the oops reports correctly.

Arjan, can you make heads or tails of it?  I'm not well versed when
it comes to debugging kernel oops.  ;o)

Yes, there are problems reported with some AMD chipsets and some ATI video
cards, however that particular combination you mention, I'm not aware of.

This is something the kernel guys can answer better I believe. Stephen?
Comment 7 Stephen Tweedie 2002-02-11 08:28:04 EST
AMD-761 and Radeon has been a known bad combination, but I can't recall seeing
similar reports of problems with the r128.  

The first of the r128 oopses is *really* weird: the kernel is trying to execute
code with %EIP in the middle of an assembler instruction.  No wonder it oopses
--- it's essentially trying to execute garbage.  

The second looks semi-sane --- there's an oops accessing memory at %d08b1da3,
which is _just_ above the 256MB boundary so which could actually be physical
memory if the e820 map is doing weird stuff; it could also be AGP memory,
although the CPU should probably not be trying to access that.  What do the
"BIOS-e820" entries in your kernel boot log look like?

The other three oopses are just random memory corruption most likely triggered
by the initial corruption.

Is the system stable under Linux when not using the r128 drm (ie. when not
running accelerated 3d apps)?
Comment 8 P. Beltrani 2002-02-11 17:01:22 EST
1) Re: BIOS-e820

The following is a partial capture of the system boot:

Linux version 2.4.9-21 (bhcompile@stripples.devel.redhat.com) (gcc version 2.96
20000731 (Red Hat Linux 7.1 2.96-98)) #1 Thu Jan 17 13:35:37 EST 2002
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000000ffec000 (usable)
 BIOS-e820: 000000000ffec000 - 000000000ffef000 (ACPI data)
 BIOS-e820: 000000000ffef000 - 000000000ffff000 (reserved)
 BIOS-e820: 000000000ffff000 - 0000000010000000 (ACPI NVS)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
On node 0 totalpages: 65516
zone(0): 4096 pages.
zone(1): 61420 pages.
zone(2): 0 pages.
Found and enabled local APIC!
Kernel command line: auto BOOT_IMAGE=sconsole ro root=305
BOOT_FILE=/boot/vmlinuz-2.4.9-21 hdd=ide-scsi mem=nopentium console=ttyS0
...
...
Feb 11 16:30:42 sam kernel: Symbols match kernel version 2.4.9.
Feb 11 16:30:42 sam kernel: Loaded 222 symbols from 7 modules.
Feb 11 16:30:42 sam kernel: Linux version 2.4.9-21
(bhcompile@stripples.devel.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux
7.1 2.96-98)) #1 Thu Jan 17 13:35:37 EST 2002
Feb 11 16:30:42 sam kernel: BIOS-provided physical RAM map:
Feb 11 16:30:42 sam kernel:  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 000000000009fc00 - 00000000000a0000
(reserved)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 00000000000f0000 - 0000000000100000
(reserved)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 0000000000100000 - 000000000ffec000 (usable)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 000000000ffec000 - 000000000ffef000
(ACPI data)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 000000000ffef000 - 000000000ffff000
(reserved)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 000000000ffff000 - 0000000010000000
(ACPI NVS)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 00000000fec00000 - 00000000fec01000
(reserved)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 00000000fee00000 - 00000000fee01000
(reserved)
Feb 11 16:30:42 sam kernel:  BIOS-e820: 00000000ffff0000 - 0000000100000000
(reserved)
Feb 11 16:30:42 sam kernel: On node 0 totalpages: 65516
Feb 11 16:30:42 sam kernel: zone(0): 4096 pages.
Feb 11 16:30:42 sam kernel: zone(1): 61420 pages.
Feb 11 16:30:42 sam kernel: zone(2): 0 pages.
Feb 11 16:30:42 sam kernel: Found and enabled local APIC!
Feb 11 16:30:42 sam kernel: Kernel command line: auto BOOT_IMAGE=sconsole ro
root=305 BOOT_FILE=/boot/vmlinuz-2.4.9-21 hdd=ide-scsi mem=nopentium console=ttyS0
Feb 11 16:30:42 sam kernel: ide_setup: hdd=ide-scsi





2) Re: stability when not using r128 drm


The reason I used gears and fire to illustrate the problem is they were fairly
consistent at causing a crash in a reasonable amount of time. The system has
crashed while doing something as simple as moving a KDE "Konsole" window or
running top.  However, I can't say for sure that something 3D related was not
run between system boot and execution of the command that triggered the crash.

I would be happy to set up any specific tests you would like. The system is
useless for any real work in it's current state so it's not a problem for me to
rebuild it, try new kernels etc.
Comment 9 Stephen Tweedie 2002-02-11 17:14:02 EST
It would be useful to know if (a) you can reproduce the problem without running
X at all; or (b) whether simple cpu-intensive tasks (such as rebuilding a
kernel) can provoke the problems.

However, at this point it really looks like hardware. It may be a peculiarity of
the way Linux is driving the hardware, or it may be hardware problems for which
Windows has a workaround --- it's really impossible to tell right now.
Comment 10 P. Beltrani 2002-02-11 19:38:07 EST
The system does NOT appear to have any problems with strictly CPU intensive
tasks.  I can build test kernels etc without problems, provided I do it from one
of the vtty's.  I am very open to suggestions as to how to stress test the
hardware without X, especially the AGP and video hardware. (It feels like
something on the video or AGP side is trashing sections of memory.)  Is there a
benchmarking or acceptance suite anyone would care to suggest?

I tend to agree with the last comment suggesting that the root of the problem is
hardware related. i.e. "works as designed but the hardware design is flawed". 
For what it's worth all the major components are on AMD's approved list.

Does anyone else out there have this hardware combo: Asus A7M266 Main board with
AMD 761 Chipset,  ATI XPERT 2000 AGP, Rage 128 based video?  If so, are you
experiencing similar problems?
Comment 11 P. Beltrani 2002-02-22 10:41:27 EST
For what it's worth:

Re: Is the system stable under Linux when not using the r128 drm (ie. when not
running accelerated 3d apps)?

I swapped out the Xpert2000 Rage 128 AGP card for an Xpert98 Rage Pro PCI.  With
the AGP card the system would die almost immediately after starting gears.  With
the PCI card it would ran for several hours but eventually failed with the now
familiar " Unable to handle kernel paging request ..."


One thing I do find interesting is lspci reports AMD-760 but this is an AMD-761
system.  This is true even with kernel v2.4.17 which does have AMD-761 AGP support.

i.e:
00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 [Irongate] System
Controller (rev 13)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 [Irongate] AGP Bridge

Finally, I don't plan on spending much more time on this issue.  If anyone has
any specific test they would like run, please let me know. Otherwise, unless
there are others with this problem, you all may want to drop it as well.
Comment 12 Mike A. Harris 2002-03-07 15:38:47 EST
I believe this issue is just bad hardware or bad hardware combination.
Also, the error:  Unable to handle kernel paging request
is a kernel crash, not XFree86.  If you determine any more info that you
think might be helpful however, please add it to the report.


Note You need to log in before you can comment on or make changes to this bug.