Bug 437412

Summary: System reboots when X started on dom0
Product: Red Hat Enterprise Linux 5 Reporter: Gary Case <gcase>
Component: kernel-xenAssignee: Stephen Tweedie <sct>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.2CC: bburns, dzickus, edwin.zhai, fred.yang, grgustaf, keve.a.gabbert, xen-maint, yongkang.you
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:11:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 435130    
Attachments:
Description Flags
sosreport from affected machine
none
serial console output of panic
none
Fix xen mprotect(PROT_NONE) handling on ioremap()ed memory none

Description Gary Case 2008-03-13 21:56:47 UTC
Description of problem:
System reboots when entering runlevel 5

Version-Release number of selected component (if applicable):
System is fresh install from 20080306.0 RHEL5.2 server tree

How reproducible:

Every time

Steps to Reproduce:
1. Install system with 20080306.0 5.2 beta x86_64 including Virt support
2. Allow system to start normally
3. System boots with xen kernel and reboots when X tries to start.
  
Actual results:

Reboot loop.

Expected results:

Stable system.

Additional info:

Last line in /var/log/messages is kernel: [drm] Initialized i915 1.8.0 20060920
on minor 0

Comment 1 Gary Case 2008-03-13 21:56:47 UTC
Created attachment 297993 [details]
sosreport from affected machine

Comment 2 Bill Burns 2008-03-14 10:25:24 UTC
Possible dup of 435130.

Comment 3 Stephen Tweedie 2008-03-14 14:52:32 UTC
Would it be possible for you to capture serial console from the host as it
crashes?  (Let me know if you need help setting that up on Xen.)  

And can you verify that the non-xen kernel boots successfully?

Thanks!

Comment 4 Gary Case 2008-03-14 15:09:58 UTC
The non-Xen kernel works just fine. I need to set up another system to do serial
console and once that's ready I'll paste the data into here. (This is a
production DQ35JO system with no serial ports, but I have several preproduction
versions of this system (weybridge platform) that have rs232 on board)

Comment 5 Gary Case 2008-03-14 15:45:26 UTC
Created attachment 298069 [details]
serial console output of panic

I'm attaching the serial console output of a machine that reboots when X is
started on the xen kernel.

Comment 6 Stephen Tweedie 2008-03-14 15:55:32 UTC
Also, did 5.1 boot correctly on this hardware?

Comment 7 Stephen Tweedie 2008-03-14 16:00:48 UTC
<gcase> sct, these boxes were certified on RHEL5.1, so I know that xen kernel
had to work in order to run our tests

Comment 8 Stephen Tweedie 2008-03-14 16:06:16 UTC
Just to narrow it down, does the 5.2-beta userland with 5.1 kernel-xen boot
correctly?

The oops shows

Kernel BUG at include/asm/mach-xen/asm/maddr.h:24
invalid opcode: 0000 [1] SMP 
Pid: 7329, comm: Xorg Not tainted 2.6.18-84.el5xen #1
Call Trace:
 [<ffffffff802602f1>] tracesys+0xa7/0xb2
...
RIP  [<ffffffff802214ea>] sys_mprotect+0x937/0xb80

and we did touch mprotect code in kernel-xen for 5.2, so it is quite possible
that that's where the regression might be, but it would be useful to confirm
it's definitely kernel-xen and not (say) a change in the X drivers.

Comment 9 Gary Case 2008-03-14 16:21:54 UTC
I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It
works fine.

Comment 10 Stephen Tweedie 2008-03-14 16:41:54 UTC
Hmm: your oops announces

    Kernel 2.6.18-84.el5xen on an x86_64

but we didn't add the big Xen mprotect change until -85.el5.  So it's not that...

> I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It
> works fine.

OK, so definitely looks like the regression is in kernel-xen.  Thanks!


Comment 12 Bill Burns 2008-03-16 10:53:19 UTC
I have isolated the similar issue in 435130 to starting with the -65 kernel. -64
HV&kernel work fine and -65 fails. Tried the -65 HV with -64 kernel and all was
well.

Comment 13 Stephen Tweedie 2008-03-17 20:02:28 UTC
OK, this definitely looks like a regression from the PV migration fix.  It
doesn't look straightforward to fix, though.

Basically, mprotect(PROT_NONE) is faked on x86 hardware, since the MMU does not
have the granularity to mark present pages as being unreadable.  So for such
regions, the kernel installs ptes which are not-present as far as the hardware
is concerned (_PAGE_PRESENT is clear), but which otherwise look present to the
kernel (a separate bit, _PAGE_PROTNONE, is set to let pte_present() detect that
the pte is still pointing to a real physical page.)

However, the hypervisor has no knowledge of _PAGE_PROTNONE.  So when we clear
the physical _PAGE_PRESENT bit, the hypervisor no longer expects a physical pte,
pointing to a true machine mfn, in the pte.  So on guest migration, the
hypervisor will not automatically translate the pte to point to the correct mfn
on the new host.

So in 5.2, we added a fix from upstream Xen to "canonicalise" these PROTNONE
pages, turning them from mfn references to guest-relative pseudophysical pfns. 
When the PROTNONE gets cleared, we restore them to mfns; if they get migrated
while still containing pfns, then on the new host the correct translation will
get applied at the time they are converted back to mfns.

This all works fine, UNLESS the mfns point to memory that does not have a valid
pfn translation at all, such as some ioremap()ed hardware device memory.  So it
looks as if the 5.2 X server here is doing an mprotect(PROT_NONE) on such
ioremap()ed memory, which translates the mfn to pfn via

static inline unsigned long mfn_to_pfn(unsigned long mfn)
{
...
	if (unlikely((mfn >> machine_to_phys_order) != 0))
		return end_pfn;

which returns end_pfn; and then when we later mprotect(PROT_READ) again, we hit
the reverse translation in pfn_to_mfn() ---

static inline unsigned long pfn_to_mfn(unsigned long pfn)
{
...
	BUG_ON(end_pfn && pfn >= end_pfn);

and hit this BUG_ON().

A fix is likely to involve some form of special-casing of these pfn/mfn
translations for the case where the mfn is not within the normal translatable
page range of the kernel.


Comment 14 Stephen Tweedie 2008-03-28 13:08:48 UTC
Created attachment 299460 [details]
Fix xen mprotect(PROT_NONE) handling on ioremap()ed memory

Proposed patch, based on upstream fix.

Comment 16 Bill Burns 2008-03-28 16:36:55 UTC
I can confirm that it works for me on the system that made me file 435130 - Bill


Comment 17 Gary Case 2008-03-28 16:39:42 UTC
It works for me as well on my weybridge qual box. The -86 kernel causes a panic
and reboot at startx, the test kernel works as expected (no panic/crash).

Comment 18 Bill Burns 2008-03-28 17:29:24 UTC
Setting flags


Comment 22 Don Zickus 2008-04-09 18:44:27 UTC
in kernel-2.6.18-89.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 23 Issue Tracker 2008-04-09 20:55:57 UTC
I tried the test kernel and it worked for me on my weybridge qual.
Yongkang, can you try it out on your systems to verify that it works for
you as well?

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by gcase 
 issue 173507

Comment 25 Bill Burns 2008-04-17 20:05:32 UTC
*** Bug 435130 has been marked as a duplicate of this bug. ***

Comment 26 You, Yongkang 2008-04-18 03:48:09 UTC
Hi all,

In Issue Tracker: #173507, I have verified the new -89.el5 kernel-xen has fixed
this issue. 

----
Event posted 04-09-2008 11:23pm EDT by yongkang.you 	 
I just confirmed 89 kernel doesn't have startx issue!
Does 89 kernel snapshot4?

Status set to: Waiting on Tech
----

Comment 30 errata-xmlrpc 2008-05-21 15:11:53 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html