Description of problem: System reboots when entering runlevel 5 Version-Release number of selected component (if applicable): System is fresh install from 20080306.0 RHEL5.2 server tree How reproducible: Every time Steps to Reproduce: 1. Install system with 20080306.0 5.2 beta x86_64 including Virt support 2. Allow system to start normally 3. System boots with xen kernel and reboots when X tries to start. Actual results: Reboot loop. Expected results: Stable system. Additional info: Last line in /var/log/messages is kernel: [drm] Initialized i915 1.8.0 20060920 on minor 0
Created attachment 297993 [details] sosreport from affected machine
Possible dup of 435130.
Would it be possible for you to capture serial console from the host as it crashes? (Let me know if you need help setting that up on Xen.) And can you verify that the non-xen kernel boots successfully? Thanks!
The non-Xen kernel works just fine. I need to set up another system to do serial console and once that's ready I'll paste the data into here. (This is a production DQ35JO system with no serial ports, but I have several preproduction versions of this system (weybridge platform) that have rs232 on board)
Created attachment 298069 [details] serial console output of panic I'm attaching the serial console output of a machine that reboots when X is started on the xen kernel.
Also, did 5.1 boot correctly on this hardware?
<gcase> sct, these boxes were certified on RHEL5.1, so I know that xen kernel had to work in order to run our tests
Just to narrow it down, does the 5.2-beta userland with 5.1 kernel-xen boot correctly? The oops shows Kernel BUG at include/asm/mach-xen/asm/maddr.h:24 invalid opcode: 0000 [1] SMP Pid: 7329, comm: Xorg Not tainted 2.6.18-84.el5xen #1 Call Trace: [<ffffffff802602f1>] tracesys+0xa7/0xb2 ... RIP [<ffffffff802214ea>] sys_mprotect+0x937/0xb80 and we did touch mprotect code in kernel-xen for 5.2, so it is quite possible that that's where the regression might be, but it would be useful to confirm it's definitely kernel-xen and not (say) a change in the X drivers.
I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It works fine.
Hmm: your oops announces Kernel 2.6.18-84.el5xen on an x86_64 but we didn't add the big Xen mprotect change until -85.el5. So it's not that... > I just --force --nodeps installed the kernel-xen from 5.1, version 53.1.14. It > works fine. OK, so definitely looks like the regression is in kernel-xen. Thanks!
I have isolated the similar issue in 435130 to starting with the -65 kernel. -64 HV&kernel work fine and -65 fails. Tried the -65 HV with -64 kernel and all was well.
OK, this definitely looks like a regression from the PV migration fix. It doesn't look straightforward to fix, though. Basically, mprotect(PROT_NONE) is faked on x86 hardware, since the MMU does not have the granularity to mark present pages as being unreadable. So for such regions, the kernel installs ptes which are not-present as far as the hardware is concerned (_PAGE_PRESENT is clear), but which otherwise look present to the kernel (a separate bit, _PAGE_PROTNONE, is set to let pte_present() detect that the pte is still pointing to a real physical page.) However, the hypervisor has no knowledge of _PAGE_PROTNONE. So when we clear the physical _PAGE_PRESENT bit, the hypervisor no longer expects a physical pte, pointing to a true machine mfn, in the pte. So on guest migration, the hypervisor will not automatically translate the pte to point to the correct mfn on the new host. So in 5.2, we added a fix from upstream Xen to "canonicalise" these PROTNONE pages, turning them from mfn references to guest-relative pseudophysical pfns. When the PROTNONE gets cleared, we restore them to mfns; if they get migrated while still containing pfns, then on the new host the correct translation will get applied at the time they are converted back to mfns. This all works fine, UNLESS the mfns point to memory that does not have a valid pfn translation at all, such as some ioremap()ed hardware device memory. So it looks as if the 5.2 X server here is doing an mprotect(PROT_NONE) on such ioremap()ed memory, which translates the mfn to pfn via static inline unsigned long mfn_to_pfn(unsigned long mfn) { ... if (unlikely((mfn >> machine_to_phys_order) != 0)) return end_pfn; which returns end_pfn; and then when we later mprotect(PROT_READ) again, we hit the reverse translation in pfn_to_mfn() --- static inline unsigned long pfn_to_mfn(unsigned long pfn) { ... BUG_ON(end_pfn && pfn >= end_pfn); and hit this BUG_ON(). A fix is likely to involve some form of special-casing of these pfn/mfn translations for the case where the mfn is not within the normal translatable page range of the kernel.
Created attachment 299460 [details] Fix xen mprotect(PROT_NONE) handling on ioremap()ed memory Proposed patch, based on upstream fix.
I can confirm that it works for me on the system that made me file 435130 - Bill
It works for me as well on my weybridge qual box. The -86 kernel causes a panic and reboot at startx, the test kernel works as expected (no panic/crash).
Setting flags
in kernel-2.6.18-89.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I tried the test kernel and it worked for me on my weybridge qual. Yongkang, can you try it out on your systems to verify that it works for you as well? Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by gcase issue 173507
*** Bug 435130 has been marked as a duplicate of this bug. ***
Hi all, In Issue Tracker: #173507, I have verified the new -89.el5 kernel-xen has fixed this issue. ---- Event posted 04-09-2008 11:23pm EDT by yongkang.you I just confirmed 89 kernel doesn't have startx issue! Does 89 kernel snapshot4? Status set to: Waiting on Tech ----
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html