Bug 113449

Summary: [PATCH] Tyan 2885 system hang with AGP graphics enabled
Product: Red Hat Enterprise Linux 3 Reporter: Mark Langsdorf <mark.langsdorf>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: davej, jdennis, jparadis, peterm
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: U2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-05-12 19:38:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to fix the AMD64 kernel's handling of 4 GB MTRR buffers
none
Improved patch to fix MTRR handling none

Description Mark Langsdorf 2004-01-14 00:22:40 UTC
Description of problem:
The default BIOS on the Tyan 2885 motherboard does not set up the 
MTRRs properly, resulting in 1x AGP performance on high end graphics 
card (instead of 8x) when the system has 8 GB+ of memory.  Tyan has 
rewritten the BIOS with AMD's help (version 2885101k) and nVidia has 
rewritten the graphics driver (version 10-5331).  This BIOS uses a 4 
GB MTRR for memory from 4GB to 8GB and the mtrr.c in the 2.4.21-6EL 
kernel does not handle this correctly.  A corrected mtrr.c, BIOS, and 
nVidia driver has been sent to Jim Paradis.

With this combination of updates, the system will boot and run with 
8x AGP performance.  However, it will hang when loading any window 
manager more complicated than the 'failsafe' graphics manager.

Version-Release number of selected component (if applicable):
Tyan BIOS 2885101k
nVidia driver 10-5331

How reproducible:


Steps to Reproduce:
1.  Start with a Tyan 2885 k8W system with 2x Opteron 24x processors 
and 4x1GB memory sticks.  
2.  Update the BIOS to 2885101k.  
3.  Install RHEL3.  
4.  Apply AGP detection patch.  Replace /usr/src/linux-
2.4/arch/x86_64/kernel/mtrr.c with revised mtrr.c. 
5.  Recompile the kernel. 
6.  Shutdown the machine and add 4x1GB of memory, so that all slots 
are populated and the machine has 8GB physical RAM.
7.  Reboot with new kernel into runlevel 3.
8.  Configure the nvidia 10-5331 driver.  Config X server to run with 
AGP support.
9.  Go to runlevel 5.  Observe that X comes up.
10. Set session type to "failsafe" and log in.
11. Run glxgears.  Observe ~4500 fps on default settings.
12. Log out and log back in with "default" (kde/gnome) session.  
  
Actual results:
System hangs while restoring kde/gnome session.

Expected results:
kde/gnome session should restore and run.  Running glxgears should 
provide 4500 fps on default settings.

Additional info:
Many large geosurvey firms want to purchase Opteron workstations with 
8x AGP support, 8 GB of memory, and RHEL3.  Please do not ignore this 
report just because of nVidia's stupid proprietary drivers.  Thank 
you.

All necessary .c files, XF86Config files, nvidia driver libraries, 
and BIOS images have been sent to Jim Paradis (jparadis).

Comment 1 Arjan van de Ven 2004-01-15 09:39:36 UTC
Does this happen with the nv driver too?


Comment 2 Mike A. Harris 2004-01-15 09:44:44 UTC
Actually, does this problem occur with all video hardware using
any drivers?  I've had various reports with that Tyan motherboard,
which I've let sit for a while, as I know they were hardware related
problems.  I don't believe all of the problems reported were Nvidia
related strictly.  If you can confirm if this problem is wider in
scope, that would be helpful.

Also, if you could attach the files (preferably as unified diffs)
to the bug report for review (mtrr.c), that would be appreciated
also.

Thanks in advance.

Comment 3 Mike A. Harris 2004-01-15 09:53:23 UTC
mtrr.c is part of the kernel, not XFree86...  Reassigning to kernel.

Comment 4 Mike A. Harris 2004-01-15 11:16:30 UTC
Whoops, I changed the status to MODIFIED by accident, changing back.

Comment 5 Mark Langsdorf 2004-01-15 19:17:51 UTC
Created attachment 97037 [details]
Patch to fix the AMD64 kernel's handling of 4 GB MTRR buffers

Comment 6 Mark Langsdorf 2004-01-15 19:20:45 UTC
I haven't tested this with anything but the nVidia drivers.  ATI 
doesn't provide accellerated graphics support for AMD64 yet and
I don't have any experience in getting the Open Source AGP drivers
to work on Red Hat.

I have confirmed that the problem occurs on both KDE and GNOME.

I will try setting up a Radeon 7500 on this system and see if the 
problem still occurs.

Comment 7 Mike A. Harris 2004-01-15 21:42:51 UTC
Thanks Mark.  The fix appears to me to be sane, and that it would
affect all video hardware running in X, not just Nvidia, however
it might have only triggered visible problems on certain setups
to date.


Comment 8 Mike A. Harris 2004-01-15 21:46:19 UTC
Hmm, just noticed this.  Shouldn't the following:
+	newsize = (u64) (mask_hi | ~0xff) << 32 | (mask_lo & ~0x800);

Be changed to:
+	newsize = (u64) ((mask_hi | ~0xff) << 32 | (mask_lo & ~0x800));



Comment 10 Mark Langsdorf 2004-01-15 22:44:15 UTC
Well, the first version doesn't load into KDE, but gets excellent 
(4800+ fps on glxgears) AGP performance under the failsafe wm.  The 
second version loads into KDE, but can't load the AGP v3 drivers and 
only gives adequate AGP performance (2600+ fps).  Under more compute 
intensive tests, the second version has about 1/10th the performance 
of the first version.

I did some more regression tests, and the 2885101i BIOS (available on 
the Tyan website) loads KDE and provides excellent performance with 
the first version of the test.  It has an MTRR map like this:
reg00: base=0xf0000000 (3840MB), size= 128MB: write-combining, count=1
reg01: base=0x00000000 (   0MB), size=2048MB: write-back, count=1
reg02: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
reg03: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
reg04: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
reg05: base=0xd0000000 (3328MB), size= 128MB: write-back, count=1
reg06: base=0xd8000000 (3456MB), size=  32MB: write-back, count=1

The 2885101k has an MTRR map like this:
reg00: base=0x00000000 (   0MB), size=2048MB: write-back, count=1
reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
reg02: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
reg03: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
reg04: base=0xd0000000 (3328MB), size= 128MB: write-back, count=1
reg05: base=0xd8000000 (3456MB), size=  32MB: write-back, count=1
reg06: base=0xf0000000 (3840MB), size=128MB: write-combining, count=1

I'm not sure what the difference here really is, or what one system 
locks up and the other doesn't.

Comment 12 Mark Langsdorf 2004-01-16 16:42:52 UTC
Richard Brunner pointed out that my temporary mtrr patch was wrong, 
and we worked out an improved version, which I have attached.  It 
solves the issue with the 2885101k BIOS.

Comment 13 Mark Langsdorf 2004-01-16 16:43:33 UTC
Created attachment 97059 [details]
Improved patch to fix MTRR handling

Comment 14 Jim Paradis 2004-03-03 20:28:27 UTC
MTRR patch submitted for U2


Comment 15 Mark Langsdorf 2004-03-30 22:11:21 UTC
AMD Validation reported a problem when running RHEL3 U2 Beta1 for 
AMD64 on an Asus platform with 8 GB of memory.  Applying the patch 
fixed the problem.  Shouldn't the patch have been part of RHEL3 U2 
Beta1?

Comment 16 Jim Paradis 2004-03-30 23:14:45 UTC
The patch just missed inclusion in Beta1, but I have verified that it
is in the codebase for the GA release.


Comment 17 Keith Lindsay 2004-05-10 13:59:09 UTC
For those of us who don't want to become career linux developers, could 
someone summarize what needs to be done to coax the Tyan S2885 with dual 
processors to actually run RHEL WS3 reliably ??  Is the best solution, for now, 
to just ditch AGP cards altogether and go to an ancient PCI card for reasonable, 
reliable performance?  THANKS MUCH... 

Comment 18 Mark Langsdorf 2004-05-12 19:36:37 UTC
How do I indicate that this has been fixed in U2 and AMD has verified 
the fix?