Red Hat Bugzilla – Bug 465144
Xen + Radeon + RAM > 2GB = Xorg crash
Last modified: 2011-02-05 11:49:43 EST
*** Description of problem: ***
When booting a Xen hypervisor + kernel on a x86_64 machine with more than
2GB of RAM and a Radeon video card, X is not able to run reliably and may even
crash the PC.
*** Version-Release number of selected component (if applicable): ***
*** How reproducible: ***
Nearly always after a few tries.
*** Steps to Reproduce: ***
1. Get an x86_64 computer (Athlon X2, Intel Core2Duo...) with a PCI-E Radeon R400 (FireGL V3100, Radeon X300, etc...).
2. Install a X11-less RHEL AP 5.2 with Virtualization.
3. Reboot, groupinstall "GNOME Desktop Environment" and "X Window System".
4. Reboot, still in runlevel 3.
5. Run "init 5".
If it runs properly, you can login, launch a terminal, a glxgears window...
At this point, black rectangles may be drawn over GDM.
Killing X with CTRL+ALT+BACKSPACE usually makes the matter worse (no more X11).
Launching glxgears in the first Xorg instance then logging out and in seems to trigger the bug more often than not.
Sometimes glxgears refuses to run after X11 was restarted (only the first frame is displayed).
*** Actual results: ***
Black rectangles over the screen, sometimes locked-up Xorg, sometimes no video...
*** Expected results: ***
Xorg restarts gracefully :)
*** Additional info: ***
I have reproduced the bug on RHEL 5.2 Beta on a Dell Precision 390 (quad core
Intel CPU, 4GB of ram, Radeon X300SE), and on a fully patched (as of 20080930)
RHEL 5.2 on my personal workstation (Athlon X2, Uli 1697-based mainboard, 4GB
of ram, FireGL V7100). Both workstations are 100% stable if kept in runlevel 3 ;
both can run a Dom-0 Fedora 8 without crashing either.
This does not happen on my PowerEdge servers probably because the ES1000 video
card they have has DRI/DRM disabled by default, something which is not possible
in EL5 for Radeon cards (Option "NoDRI" does not work).
The crash goes away if I force the Xen hypervisor to use only the first 2GB of
ram by setting the following xen hypervisor options in grub.conf :
Limiting ram use by the Linux kernel in Dom-0 without limiting Xen is of no use.
Running with 4GB of ram is possible :
* in runlevel 3 (of course) ;
* in runlevel 5 if Xorg is forced to use the Vesa driver instead of the Radeon
* in runlevel 5 with the Radeon driver but without Xen.
I haven't managed to disable DRI in xorg.conf due to bug 465142.
This looks like a bug I fixed in RHEL 5.3, for another video card.
Fixing that bug involved both a fix to the hypervisor and a fix to the driver for that video card.
Francois, it would not surprise me if just the hypervisor fix alone fixes your issue. Could you please let us know whether the bug still happens with RHEL 5.3?
Update, I'll check it this week. Thanks for bearing with me.
I can definitely reproduce this on RHEL 5.3 as released on my Athlon X2, the Precision is not available for testing anymore. I'll update this as soon as I can update my system through RHN to see if the latest updates fix the problem.
Created attachment 355192 [details]
Created attachment 355193 [details]
Created attachment 355194 [details]
After updating kernel-xen, I can reproduce the problem as well. I added dmesg, /var/log/messages, and /var/log/Xorg.0.log from the attempt with kernel-xen-2.6.18-128.2.1.el5 .
My system uses the following packages :
Please note that I don't have RHN access to updated VT packages (xen, libvirt) yet.
Rik, are there any logs you need to debug this further ? This is a test system, so feel free to ask me for more logs or test updated packages.
I'll get a system to reproduce this locally, so I can figure out what's going on.
There is a probable fix in Dave Airlie's drm tree, from Jeremy Fitzhardinge.
Author: Jeremy Fitzhardinge <jeremy[at]goop.org>
Date: Tue Nov 17 14:08:54 2009 -0800
drm: make sure page protections are updated after changing vm_flags
Some architectures compute ->vm_page_prot depending on ->vm_flags, so we
need to update the protections after adjusting the flags.
AFAIK this only affects running X under Xen; without this patch you get
lots of coloured blobs on the screen, or maybe a complete lockup. Or
But that still depends on lots of out-of-tree stuff, so I don't think
there are any consequences for anyone else. But it is wrong in principle.
Reported-by: Jan Beulich <JBeulich[at]novell.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge[at]citrix.com>
Signed-off-by: Andrew Morton <akpm[at]linux-foundation.org>
Signed-off-by: Dave Airlie <airlied[at]redhat.com>
I can't test right now, sorry.
(In reply to comment #15)
> There is a probable fix in Dave Airlie's drm tree, from Jeremy Fitzhardinge.
> commit c7e3bff327d8f5291046ff7ff0f4568dee1f0292
> Author: Jeremy Fitzhardinge <jeremy[at]goop.org>
> Date: Tue Nov 17 14:08:54 2009 -0800
> drm: make sure page protections are updated after changing vm_flags
While it's certainly possible, I'm not sure if this will have an effect on a RHEL-5 era kernel. First of all, the DRM code is vastly different (due to all of the GEM rework upstream). However, even if I look at drivers/char/drm/drm_vm.c in the RHEL-5 tree (which seems to be the precursor to the gem-based drivers/gpu/drm/drm_gem.c in current kernels), it doesn't seem to use the vma->vm_flags to set the vma->vm_page_prot. Instead, it seems to hard-code the vm_page_prot (on i386/x86_64, at least) to PAGE_PCG & ~PAGE_PWT.
Of course, I'm not familiar at all with this code, so I definitely could be wrong, but I wouldn't really even know where to start backporting it. One of the DRM hackers would have to take a look.
I'm afraid we haven't be able to reproduce this issue with the hardware we have available. Is this issue still present on your system?
I've switched everything to RHEL6/KVM or Fedora/KVM here, so this is not important to me. I could try to reproduce on one of the hosts _if_ this is important to Red Hat though, your call.
this is a pretty rare bug and the one system inside Red Hat that reproduced the bug has gotten lost. Lets assume all the users of such systems have moved on to either other software or hardware by now and close this bug.
If another user runs into it, support will open a new bug.
That works. Thank you Rik.