465144 – Xen + Radeon + RAM > 2GB = Xorg crash

Bug 465144 - Xen + Radeon + RAM > 2GB = Xorg crash

Summary: Xen + Radeon + RAM > 2GB = Xorg crash

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Rik van Riel
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514491
TreeView+	depends on / blocked

Reported:	2008-10-01 19:40 UTC by François Cami
Modified:	2011-02-05 16:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-05 15:59:18 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg 2.6.18-128.2.1.el5xen (18.60 KB, text/plain) 2009-07-26 12:29 UTC, François Cami	no flags	Details
/var/log/messages (271.83 KB, text/plain) 2009-07-26 12:29 UTC, François Cami	no flags	Details
/var/log/Xorg.0.log (43.90 KB, text/plain) 2009-07-26 12:30 UTC, François Cami	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
CentOS	2665	0	None	None	None	Never

Description François Cami 2008-10-01 19:40:24 UTC

*** Description of problem: ***

When booting a Xen hypervisor + kernel on a x86_64 machine with more than
2GB of RAM and a Radeon video card, X is not able to run reliably and may even
crash the PC.

*** Version-Release number of selected component (if applicable): ***

3.0.3-64.el5_2.1.x86_64

*** How reproducible: ***

Nearly always after a few tries.

*** Steps to Reproduce: ***
1. Get an x86_64 computer (Athlon X2, Intel Core2Duo...) with a PCI-E Radeon R400 (FireGL V3100, Radeon X300, etc...).
2. Install a X11-less RHEL AP 5.2 with Virtualization.
3. Reboot, groupinstall "GNOME Desktop Environment" and "X Window System".
4. Reboot, still in runlevel 3.
5. Run "init 5".

If it runs properly, you can login, launch a terminal, a glxgears window...
Then logout.
At this point, black rectangles may be drawn over GDM.
Killing X with CTRL+ALT+BACKSPACE usually makes the matter worse (no more X11).
Launching glxgears in the first Xorg instance then logging out and in seems to trigger the bug more often than not.
Sometimes glxgears refuses to run after X11 was restarted (only the first frame is displayed).

*** Actual results: ***

Black rectangles over the screen, sometimes locked-up Xorg, sometimes no video...

*** Expected results: ***

Xorg restarts gracefully :)

*** Additional info: ***

I have reproduced the bug on RHEL 5.2 Beta on a Dell Precision 390 (quad core
Intel CPU, 4GB of ram, Radeon X300SE), and on a fully patched (as of 20080930)
RHEL 5.2 on my personal workstation (Athlon X2, Uli 1697-based mainboard, 4GB
of ram, FireGL V7100). Both workstations are 100% stable if kept in runlevel 3 ;
both can run a Dom-0 Fedora 8 without crashing either.

This does not happen on my PowerEdge servers probably because the ES1000 video
card they have has DRI/DRM disabled by default, something which is not possible
in EL5 for Radeon cards (Option "NoDRI" does not work).

The crash goes away if I force the Xen hypervisor to use only the first 2GB of
ram by setting the following xen hypervisor options in grub.conf :
mem=1984M dom0_mem=1800M

Limiting ram use by the Linux kernel in Dom-0 without limiting Xen is of no use.

Running with 4GB of ram is possible :
* in runlevel 3 (of course) ;
* in runlevel 5 if Xorg is forced to use the Vesa driver instead of the Radeon
driver ;
* in runlevel 5 with the Radeon driver but without Xen.

I haven't managed to disable DRI in xorg.conf due to bug 465142.

Comment 3 Rik van Riel 2009-01-28 19:23:14 UTC

This looks like a bug I fixed in RHEL 5.3, for another video card.  

Fixing that bug involved both a fix to the hypervisor and a fix to the driver for that video card.

Francois, it would not surprise me if just the hypervisor fix alone fixes your issue. Could you please let us know whether the bug still happens with RHEL 5.3?

Comment 4 François Cami 2009-03-02 21:07:22 UTC

Update, I'll check it this week. Thanks for bearing with me.

Comment 8 François Cami 2009-07-26 10:26:16 UTC

I can definitely reproduce this on RHEL 5.3 as released on my Athlon X2, the Precision is not available for testing anymore. I'll update this as soon as I can update my system through RHN to see if the latest updates fix the problem.

Comment 9 François Cami 2009-07-26 12:29:02 UTC

Created attachment 355192 [details]
dmesg 2.6.18-128.2.1.el5xen

Comment 10 François Cami 2009-07-26 12:29:45 UTC

Created attachment 355193 [details]
/var/log/messages

Comment 11 François Cami 2009-07-26 12:30:20 UTC

Created attachment 355194 [details]
/var/log/Xorg.0.log

Comment 12 François Cami 2009-07-26 12:32:26 UTC

After updating kernel-xen, I can reproduce the problem as well. I added dmesg, /var/log/messages, and /var/log/Xorg.0.log from the attempt with kernel-xen-2.6.18-128.2.1.el5 .

My system uses the following packages :
kernel-xen-2.6.18-128.2.1.el5
xorg-x11-server-Xorg-1.1.1-48.52.el5
xorg-x11-drv-ati-6.6.3-3.22.el5

Please note that I don't have RHN access to updated VT packages (xen, libvirt) yet.

Rik, are there any logs you need to debug this further ? This is a test system, so feel free to ask me for more logs or test updated packages.

Comment 13 Rik van Riel 2009-07-26 13:35:45 UTC

I'll get a system to reproduce this locally, so I can figure out what's going on.

Comment 15 François Cami 2009-11-19 20:55:41 UTC

Hello,

There is a probable fix in Dave Airlie's drm tree, from Jeremy Fitzhardinge.

http://www.gossamer-threads.com/lists/linux/kernel/1155486

commit c7e3bff327d8f5291046ff7ff0f4568dee1f0292
Author: Jeremy Fitzhardinge <jeremy[at]goop.org>
Date: Tue Nov 17 14:08:54 2009 -0800

drm: make sure page protections are updated after changing vm_flags

Some architectures compute ->vm_page_prot depending on ->vm_flags, so we
need to update the protections after adjusting the flags.

AFAIK this only affects running X under Xen; without this patch you get
lots of coloured blobs on the screen, or maybe a complete lockup. Or
anything really.

But that still depends on lots of out-of-tree stuff, so I don't think
there are any consequences for anyone else. But it is wrong in principle.

Reported-by: Jan Beulich <JBeulich[at]novell.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge[at]citrix.com>
Signed-off-by: Andrew Morton <akpm[at]linux-foundation.org>
Signed-off-by: Dave Airlie <airlied[at]redhat.com> 

I can't test right now, sorry.

François

Comment 16 Chris Lalancette 2009-11-20 09:05:49 UTC

(In reply to comment #15)
> Hello,
> 
> There is a probable fix in Dave Airlie's drm tree, from Jeremy Fitzhardinge.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/1155486
> 
> commit c7e3bff327d8f5291046ff7ff0f4568dee1f0292
> Author: Jeremy Fitzhardinge <jeremy[at]goop.org>
> Date: Tue Nov 17 14:08:54 2009 -0800
> 
> drm: make sure page protections are updated after changing vm_flags

While it's certainly possible, I'm not sure if this will have an effect on a RHEL-5 era kernel.  First of all, the DRM code is vastly different (due to all of the GEM rework upstream).  However, even if I look at drivers/char/drm/drm_vm.c in the RHEL-5 tree (which seems to be the precursor to the gem-based drivers/gpu/drm/drm_gem.c in current kernels), it doesn't seem to use the vma->vm_flags to set the vma->vm_page_prot.  Instead, it seems to hard-code the vm_page_prot (on i386/x86_64, at least) to PAGE_PCG & ~PAGE_PWT.

Of course, I'm not familiar at all with this code, so I definitely could be wrong, but I wouldn't really even know where to start backporting it.  One of the DRM hackers would have to take a look.

Chris Lalancette

Comment 22 Andrew Jones 2011-01-21 13:20:52 UTC

Hello François,

I'm afraid we haven't be able to reproduce this issue with the hardware we have available. Is this issue still present on your system?

Drew

Comment 23 François Cami 2011-02-05 12:14:49 UTC

Hi Drew,

I've switched everything to RHEL6/KVM or Fedora/KVM here, so this is not important to me. I could try to reproduce on one of the hosts _if_ this is important to Red Hat though, your call.

François

Comment 24 Rik van Riel 2011-02-05 15:59:18 UTC

François,

this is a pretty rare bug and the one system inside Red Hat that reproduced the bug has gotten lost.  Lets assume all the users of such systems have moved on to either other software or hardware by now and close this bug.

If another user runs into it, support will open a new bug.

Comment 25 François Cami 2011-02-05 16:49:43 UTC

That works. Thank you Rik.

Note You need to log in before you can comment on or make changes to this bug.