Red Hat Bugzilla – Bug 437412
System reboots when X started on dom0
Last modified: 2010-10-22 19:14:28 EDT
Description of problem:
System reboots when entering runlevel 5
Version-Release number of selected component (if applicable):
System is fresh install from 20080306.0 RHEL5.2 server tree
Steps to Reproduce:
1. Install system with 20080306.0 5.2 beta x86_64 including Virt support
2. Allow system to start normally
3. System boots with xen kernel and reboots when X tries to start.
Last line in /var/log/messages is kernel: [drm] Initialized i915 1.8.0 20060920
on minor 0
Created attachment 297993 [details]
sosreport from affected machine
Possible dup of 435130.
Would it be possible for you to capture serial console from the host as it
crashes? (Let me know if you need help setting that up on Xen.)
And can you verify that the non-xen kernel boots successfully?
The non-Xen kernel works just fine. I need to set up another system to do serial
console and once that's ready I'll paste the data into here. (This is a
production DQ35JO system with no serial ports, but I have several preproduction
versions of this system (weybridge platform) that have rs232 on board)
Created attachment 298069 [details]
serial console output of panic
I'm attaching the serial console output of a machine that reboots when X is
started on the xen kernel.
Also, did 5.1 boot correctly on this hardware?
<gcase> sct, these boxes were certified on RHEL5.1, so I know that xen kernel
had to work in order to run our tests
Just to narrow it down, does the 5.2-beta userland with 5.1 kernel-xen boot
The oops shows
Kernel BUG at include/asm/mach-xen/asm/maddr.h:24
invalid opcode: 0000  SMP
Pid: 7329, comm: Xorg Not tainted 2.6.18-84.el5xen #1
RIP [<ffffffff802214ea>] sys_mprotect+0x937/0xb80
and we did touch mprotect code in kernel-xen for 5.2, so it is quite possible
that that's where the regression might be, but it would be useful to confirm
it's definitely kernel-xen and not (say) a change in the X drivers.
I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It
Hmm: your oops announces
Kernel 2.6.18-84.el5xen on an x86_64
but we didn't add the big Xen mprotect change until -85.el5. So it's not that...
> I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It
> works fine.
OK, so definitely looks like the regression is in kernel-xen. Thanks!
I have isolated the similar issue in 435130 to starting with the -65 kernel. -64
HV&kernel work fine and -65 fails. Tried the -65 HV with -64 kernel and all was
OK, this definitely looks like a regression from the PV migration fix. It
doesn't look straightforward to fix, though.
Basically, mprotect(PROT_NONE) is faked on x86 hardware, since the MMU does not
have the granularity to mark present pages as being unreadable. So for such
regions, the kernel installs ptes which are not-present as far as the hardware
is concerned (_PAGE_PRESENT is clear), but which otherwise look present to the
kernel (a separate bit, _PAGE_PROTNONE, is set to let pte_present() detect that
the pte is still pointing to a real physical page.)
However, the hypervisor has no knowledge of _PAGE_PROTNONE. So when we clear
the physical _PAGE_PRESENT bit, the hypervisor no longer expects a physical pte,
pointing to a true machine mfn, in the pte. So on guest migration, the
hypervisor will not automatically translate the pte to point to the correct mfn
on the new host.
So in 5.2, we added a fix from upstream Xen to "canonicalise" these PROTNONE
pages, turning them from mfn references to guest-relative pseudophysical pfns.
When the PROTNONE gets cleared, we restore them to mfns; if they get migrated
while still containing pfns, then on the new host the correct translation will
get applied at the time they are converted back to mfns.
This all works fine, UNLESS the mfns point to memory that does not have a valid
pfn translation at all, such as some ioremap()ed hardware device memory. So it
looks as if the 5.2 X server here is doing an mprotect(PROT_NONE) on such
ioremap()ed memory, which translates the mfn to pfn via
static inline unsigned long mfn_to_pfn(unsigned long mfn)
if (unlikely((mfn >> machine_to_phys_order) != 0))
which returns end_pfn; and then when we later mprotect(PROT_READ) again, we hit
the reverse translation in pfn_to_mfn() ---
static inline unsigned long pfn_to_mfn(unsigned long pfn)
BUG_ON(end_pfn && pfn >= end_pfn);
and hit this BUG_ON().
A fix is likely to involve some form of special-casing of these pfn/mfn
translations for the case where the mfn is not within the normal translatable
page range of the kernel.
Created attachment 299460 [details]
Fix xen mprotect(PROT_NONE) handling on ioremap()ed memory
Proposed patch, based on upstream fix.
I can confirm that it works for me on the system that made me file 435130 - Bill
It works for me as well on my weybridge qual box. The -86 kernel causes a panic
and reboot at startx, the test kernel works as expected (no panic/crash).
You can download this test kernel from http://people.redhat.com/dzickus/el5
I tried the test kernel and it worked for me on my weybridge qual.
Yongkang, can you try it out on your systems to verify that it works for
you as well?
Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client
This event sent from IssueTracker by gcase
*** Bug 435130 has been marked as a duplicate of this bug. ***
In Issue Tracker: #173507, I have verified the new -89.el5 kernel-xen has fixed
Event posted 04-09-2008 11:23pm EDT by yongkang.you
I just confirmed 89 kernel doesn't have startx issue!
Does 89 kernel snapshot4?
Status set to: Waiting on Tech
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.