Bug 502826

Summary: [RHEL-5 Xen]: F-11 Xen 64-bit domU cannot be started with > 2047MB of memory
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: ajia, drjones, dzickus, itamar, jeremy, kernel-maint, markmc, pasik, pbonzini, virt-maint, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When booting paravirtualized guests that support gigabyte page tables (i.e. a Fedora 11 guest) on Red Hat Enterprise Linux 5.4 Xen, the domain may fail to start if more than 2047MB of memory is configured for the domain. To work around this issue, pass the "nogbpages" parameter on the guest kernel command-line.
Story Points: ---
Clone Of: 499592 Environment:
Last Closed: 2010-03-30 07:40:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 499592    
Bug Blocks: 513501, 526946    
Attachments:
Description Flags
Patch to mask more of the PV bits
none
Replacement patch to mask more CPUID bits for PV guests none

Description Chris Lalancette 2009-05-27 11:29:15 UTC
+++ This bug was initially created as a clone of Bug #499592 +++

Description of problem:
When trying to boot a F-11 64-bit Xen domU with up to 2047MB of memory, it boots just fine.  However, when trying  to boot this same domain with 2048MB of memory or more, it crashes immediately on bootup, before any messages are printed to the console.

The only clue that I do have is from the serial console of the dom0 (RHEL 5.3 dom0):

tap tap-39-51712: 2 getting info
mapping kernel into physical memory
about to get started...
(XEN) Unhandled page fault in domain 39 on VCPU 0 (ec=0003)
(XEN) Pagetable walk from ffff880001002008:
(XEN)  L4[0x110] = 0000000242929067 0000000000001002
(XEN)  L3[0x000] = 00000002f40af067 0000000000001006
(XEN)  L2[0x008] = 0000000365b81067 00000000000023f8 
(XEN)  L1[0x002] = 8010000242929065 0000000000001002
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 39 (vcpu#0) crashed on cpu#7:
(XEN) ----[ Xen-3.1.2-141.el5bz479754rxcopy  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    7
(XEN) RIP:    e033:[<ffffffff8100ce42>]
(XEN) RFLAGS: 0000000000000296   CONTEXT: guest
(XEN) rax: 0000000000000000   rbx: ffff880001002008   rcx: 00000000fbfde700
(XEN) rdx: 0000000000000000   rsi: 80000003474da0e3   rdi: ffff880001002008
(XEN) rbp: ffffffff81613d68   rsp: ffffffff81613d00   r8:  8000000000000163
(XEN) r9:  0000000000000005   r10: 0000000000000005   r11: ffffffff8100c5d3
(XEN) r12: 80000003474da0e3   r13: 0000000000000000   r14: 0000000040000000
(XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000340d80000   cr2: ffff880001002008
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff81613d00:
(XEN)    00000000fbfde700 ffffffff8100c5d3 0000000000000003 ffffffff8100ce42
(XEN)    000000010000e030 0000000000010096 ffffffff81613d48 000000000000e02b
(XEN)    ffffffff8100ce2f ffffffff8100c5d3 ffff880001002008 0000000000000000
(XEN)    0000000040000000 ffffffff81613de8 ffffffff8167455d ffffffff8150196c
(XEN)    ffff880001002000 0000000000000008 0000000000000008 0000000100000020
(XEN)    0000000080000000 ffffffff81613db8 0000000000000000 0000000581613dd8
(XEN)    ffffffff81001880 0000000000000000 0000000080000000 ffffffff8150196c
(XEN)    0000000080000000 ffffffff81613ed8 ffffffff81397945 ffffffffff40f000
(XEN)    0000000000000008 ffff880080000000 0000000100000000 ffff880080000000
(XEN)    ffffffff81613e28 0000000000000000 0000000080000000 0000000000000008
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffffffff81613ec8 0000000000000000 0000000000003000
(XEN)    ffffffff81613f50 ffffffffffffffff ffffffff823ed000 ffffffff81613f38
(XEN)    ffffffff8164548e ffffffff00000008 ffffffff81613f48 ffffffff81613f08
(XEN)    ffffffff81806450 ffffffff81613f28 0000000000000000 ffffffff81676020
(XEN)    0000000000003000 0000000000000000 ffffffffffffffff ffffffff81613f78
(XEN)    ffffffff8163da83 ffffffff81613f78 ffffffff81678ac0 00000000018c700c
(XEN)    0000000000003000 ffffffff81788000 0000000000000800 ffffffff81613f98

--- Additional comment from clalance on 2009-05-08 11:17:14 EDT ---

Justin, of the Fedora Xen domU bugs that I filed during yesterday's test day, this seems to be the most important.  Can you look at it, and see if you can try to see what's going on here?

Chris Lalancette

--- Additional comment from jforbes on 2009-05-08 15:54:23 EDT ---

Can you tell me what kernel was used in the guest? And was this pv or fv?  With an up to date install using kernel 2.6.29.2-126.fc11.x86_64 I can boot a pv guest with 2047MB, 2048MB, and 4096MB just fine.

--- Additional comment from clalance on 2009-05-10 02:46:07 EDT ---

All of the bugs that I filed were for PV guests.  And the kernel was exactly the same version as you state above; 2.6.29.2-126.fc11.x86_64.  So, I guess we should focus on other things, since this might be a difference in hardware or dom0 software.  My dom0 is 5.4 preview, with Xen kernel 2.6.18-145.el5 and xen package xen-3.0.3-83.el5jd4.  My hardware is an AMD Barcelona machine with 2 quad-core CPUs and 16GB of memory.  Anything else?  I can add dmesg and/or dmidecode to the BZ if you think it will help, but this is 100% repeatable on this hardware.

Chris Lalancette

--- Additional comment from jeremy on 2009-05-11 13:33:34 EDT ---

What function does ffffffff8100ce42 map to in the kernel?

--- Additional comment from jeremy on 2009-05-11 14:00:16 EDT ---

Justin, how much physical memory does your system have?

I'm pretty sure I've tested with >2G memory guests on a 16G hosts, though I can easily test again with a recent kernel.

--- Additional comment from jforbes on 2009-05-11 14:13:17 EDT ---

My test system also has 16G of memory, but is not numa, it is a single Intel Q6600.

--- Additional comment from jforbes on 2009-05-14 16:13:56 EDT ---

Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this might be limited to the barcelona or similar machines.

--- Additional comment from jeremy on 2009-05-14 16:20:41 EDT ---

(In reply to comment #7)
> Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this
> might be limited to the barcelona or similar machines.  

There's nothing very architecturally specific about this.  Failures above certain memory thresholds happen because either a pfn gets truncated or an mfn gets truncated, most commonly because they get turned into pointers then clipped at 32-bits by some vagarities of C's type promotion rules.

In a dom0 kernel, I guess there's the possibility that something is failing because of some DMA issue, and the memory size is just incidental to how things get layed out in memory and stomp on each other.  But this seems unlikely given that it happens early in boot.  And its domU.

Also, could someone map the failing rip, ffffffff8100ce42, to a function?

--- Additional comment from clalance on 2009-05-14 17:18:04 EDT ---

Jeremy,
     Oops, sorry about that, you had asked earlier and I forgot to respond.  In the 2.6.29.2-126.fc11 kernel, the address 0xffffffff8100ce42 maps to:

(gdb) list *(0xffffffff8100ce42)
0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531).
526	#ifdef CONFIG_X86_PAE
527		ptep->pte_high = pte.pte_high;
528		smp_wmb();
529		ptep->pte_low = pte.pte_low;
530	#else
531		*ptep = pte;
532	#endif
533	}
534	
535	#ifdef CONFIG_X86_PAE

Let me know if you want more information about the kernel; I have the debug information handy.

Chris Lalancette

--- Additional comment from jeremy on 2009-05-14 17:55:13 EDT ---

What's the exact instruction (x/i)?

Looks like its writing into a pte page, but Xen doesn't think its pinned and so just faults the write.  Or the new pte is bogus.

Any chance you could dig back through the stack trace to see what the callers are?

Good candidates:
ffffffff8100ce2f
ffffffff8100c5d3
ffffffff8167455d
ffffffff8150196c
ffffffff8150196c
ffffffff81397945

--- Additional comment from clalance on 2009-05-15 03:33:04 EDT ---

Jeremy,
     The exact instruction is:

(gdb) x/i 0xffffffff8100ce42
0xffffffff8100ce42 <xen_set_pte+71>:	mov    %r12,(%rbx)

For context, a disassembly of the whole xen_set_pte() function is:

(gdb) disass xen_set_pte
Dump of assembler code for function xen_set_pte:
0xffffffff8100cdfb <xen_set_pte+0>:	push   %rbp
0xffffffff8100cdfc <xen_set_pte+1>:	mov    %rsp,%rbp
0xffffffff8100cdff <xen_set_pte+4>:	push   %r13
0xffffffff8100ce01 <xen_set_pte+6>:	push   %r12
0xffffffff8100ce03 <xen_set_pte+8>:	push   %rbx
0xffffffff8100ce04 <xen_set_pte+9>:	sub    $0x8,%rsp
0xffffffff8100ce08 <xen_set_pte+13>:	callq  0xffffffff81011000 <mcount>
0xffffffff8100ce0d <xen_set_pte+18>:	mov    %rdi,%rbx
0xffffffff8100ce10 <xen_set_pte+21>:	mov    %rsi,%r12
0xffffffff8100ce13 <xen_set_pte+24>:	callq  0xffffffff8100cdcd <check_zero>
0xffffffff8100ce18 <xen_set_pte+29>:	incl   0x7ad166(%rip)        # 0xffffffff817b9f84 <mmu_stats+36>
0xffffffff8100ce1e <xen_set_pte+35>:	callq  0xffffffff8100cdcd <check_zero>
0xffffffff8100ce23 <xen_set_pte+40>:	mov    0x7ad162(%rip),%r13d        # 0xffffffff817b9f8c <mmu_stats+44>
0xffffffff8100ce2a <xen_set_pte+47>:	callq  0xffffffff81029a9d <paravirt_get_lazy_mode>
0xffffffff8100ce2f <xen_set_pte+52>:	dec    %eax
0xffffffff8100ce31 <xen_set_pte+54>:	sete   %al
0xffffffff8100ce34 <xen_set_pte+57>:	movzbl %al,%eax
0xffffffff8100ce37 <xen_set_pte+60>:	lea    (%rax,%r13,1),%r13d
0xffffffff8100ce3b <xen_set_pte+64>:	mov    %r13d,0x7ad14a(%rip)        # 0xf
fffffff817b9f8c <mmu_stats+44>
0xffffffff8100ce42 <xen_set_pte+71>:	mov    %r12,(%rbx)
0xffffffff8100ce45 <xen_set_pte+74>:	pop    %r11
0xffffffff8100ce47 <xen_set_pte+76>:	pop    %rbx
0xffffffff8100ce48 <xen_set_pte+77>:	pop    %r12
0xffffffff8100ce4a <xen_set_pte+79>:	pop    %r13
0xffffffff8100ce4c <xen_set_pte+81>:	leaveq 
0xffffffff8100ce4d <xen_set_pte+82>:	retq   

Resolving some of those symbols to function names, it looks like:
0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531)
0xffffffff8100c5d3 is in pte_pfn_to_mfn (arch/x86/xen/mmu.c:451)
0xffffffff8100ce2f is in xen_set_pte (arch/x86/xen/mmu.c:524)
0xffffffff8167455d is in __raw_spin_unlock (/usr/src/debug/kernel-2.6.29/linux-2.6.29.x86_64/arch/x86/include/asm/paravirt.h:1421)
0xffffffff81001880 is at arch/x86/kernel/head_64.S:267
0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547)
0xffffffff8164548e is in setup_arch (arch/x86/kernel/setup.c:849)
0xffffffff8163da83 is in start_kernel (init/main.c:573)

I'm not exactly sure at the moment, but it seems like the most likely candidate is from the phys_pud_update() call.

Chris Lalancette

--- Additional comment from jeremy on 2009-05-15 13:20:55 EDT ---

r12 is 80000003474da0e3, with has _PAGE_PSE set.  We don't support PSE under Xen, but some code is ignoring the fact that cpu_has_pse is false and trying to create a large mapping anyway.  If its in mm/init_64.c, then something has set PG_LEVEL_2M in page_size_mask, or something is forgetting to test before creating the mapping.  But I don't see anything obvious, or why it should only happen sometimes.  "phys_pud_update" would be the place to look, but it probably has almost everything else inlined into it.  What does line 547 correspond to in your sources?

mm/numa_32.c has some code like this that I haven't got around to addressing, but I'm not aware of anything like this on 64 bit.  The report says that it only happens on a NUMA host machine, but the NUMA-ness of the host shouldn't be visible to a domU guest.

--- Additional comment from clalance on 2009-05-18 03:20:50 EDT ---

From gdb:

(gdb) list *(0xffffffff81397945)
0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547).
542	{
543		pud_t *pud;
544	
545		pud = (pud_t *)pgd_page_vaddr(*pgd);
546	
547		return phys_pud_init(pud, addr, end, page_size_mask);
548	}
549	
550	static void __init find_early_table_space(unsigned long end, int use_pse,
551						  int use_gbpages)

Looking at the sources directly, we can only get to phys_pud_update() from kernel_physical_mapping_init(), and we can only get there from init_memory_mapping().  In that case, it looks like the page_size_mask is being passed in based on the "mr" array in init_memory_mapping().  So that must be getting messed up.  Indeed, if I put "earlyprintk=xen" on the guest boot-line, I see this:

[root@amd1 ~]# xm create -c f11pv_x86_64
Using config file "/etc/xen/f11pv_x86_64".
Started domain f11pv_x86_64
(early) Initializing cgroup subsys cpuset
(early) Initializing cgroup subsys cpu
(early) Linux version 2.6.29.2-126.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090427 (Red Hat 4.4.0-3) (GCC) ) #1 SMP Mon May 4 04:46:15 EDT 2009
(early) Command line:  ro root=/dev/VolGroup00/LogVol00 console=hvc0 earlyprintk=xen
(early) KERNEL supported cpus:
(early)   Intel GenuineIntel
(early)   AMD AuthenticAMD
(early)   Centaur CentaurHauls
(early) ACPI in unprivileged domain disabled
(early) BIOS-provided physical RAM map:
(early)  Xen: 0000000000000000 - 00000000000a0000 (usable)
(early)  Xen: 00000000000a0000 - 0000000000100000 (reserved)
(early)  Xen: 0000000000100000 - 0000000001fea000 (usable)
(early)  Xen: 0000000001fea000 - 00000000023ed000 (reserved)
(early)  Xen: 00000000023ed000 - 0000000080000000 (usable)
(early) console [xenboot0] enabled
(early) DMI not present or invalid.
(early) last_pfn = 0x80000 max_arch_pfn = 0x100000000
(early) init_memory_mapping: 0000000000000000-0000000080000000
(early) Using GB pages for direct mapping

That last line is probably the problem.  Indeed, if I add "nogbpages" to the guest kernel command-line, it then boots just fine.  So the reason it shows up on my Barcelona, and not an earlier AMD, is that my Barcelona supports GB pages (cpu_has_gbpages).

Chris Lalancette

--- Additional comment from jeremy on 2009-05-18 03:30:05 EDT ---

Are GB pages a distinct feature flag in cpuid?  I wonder if its another flag Xen should be masking out.  Either way, its easy to mask in xen_cpuid()

--- Additional comment from clalance on 2009-05-18 08:40:17 EDT ---

Yeah, it looks like it:

/* AMD-defined CPU features, CPUID level 0x80000001, word 1 */
/* Don't duplicate feature flags which are redundant with Intel! */
#define X86_FEATURE_SYSCALL	(1*32+11) /* SYSCALL/SYSRET */
#define X86_FEATURE_MP		(1*32+19) /* MP Capable. */
#define X86_FEATURE_NX		(1*32+20) /* Execute Disable */
#define X86_FEATURE_MMXEXT	(1*32+22) /* AMD MMX extensions */
#define X86_FEATURE_FXSR_OPT	(1*32+25) /* FXSAVE/FXRSTOR optimizations */
#define X86_FEATURE_GBPAGES	(1*32+26) /* "pdpe1gb" GB pages */

And looking at the AMD CPUID spec (Rev 2.28, page 15) it does look like this is a unique flag.

I'm attaching a patch to xen_cpuid() that disables it, and seems to fix the problem for me.  Jeremy, do you want me to post the patch to xen-devel/lkml, or will you take it from here?

Chris Lalancette

--- Additional comment from clalance on 2009-05-18 08:40:49 EDT ---

Created an attachment (id=344429)
Disable GB pages for Xen guests

--- Additional comment from jeremy on 2009-05-18 14:31:51 EDT ---

I think setup_clear_cpu_cap(X86_FEATURE_GBPAGES) might be better (probably for the other features as well).  I'll put something together.

--- Additional comment from clalance on 2009-05-19 02:49:15 EDT ---

OK, thanks Jeremy!

Chris Lalancette

--- Additional comment from jforbes on 2009-05-22 11:32:21 EDT ---

The patch from Chris has been added to F-11 kernel 2.6.29.3-155 in time to meet release.  I will replace this with the patch from Jeremy when it is available, and leave this bug open until that is done.

--- Additional comment from jeremy on 2009-05-22 12:31:28 EDT ---

Yep, that's fine.

--- Additional comment from jeremy on 2009-05-26 18:13:01 EDT ---

Note that this has been fixed in Xen since March last year, with changeset 4b157affc08f.

--- Additional comment from clalance on 2009-05-27 02:55:05 EDT ---

Jeremy,
     Hm, what do you mean?  I thought this was a bug in the pv_ops kernel, and hence there wasn't a patch yet?  Or do you mean that there is a patch to the hypervisor to workaround this issue?  Also, what repository does the above c/s come from?

Thanks,
Chris Lalancette

--- Additional comment from jeremy on 2009-05-27 03:12:26 EDT ---

It is at heart a Xen bug; the hypervisor shouldn't be exposing CPU capabilities that guests cannot use.  The kernel change is a workaround to mask things that Xen is not.

Xen fixed this issue in http://xenbits.xensource.com/xen-unstable.hg change 4b157affc08f.  You seem to be using a relatively old version of Xen, which does not have this fix.

--- Additional comment from clalance on 2009-05-27 07:27:46 EDT ---

Jeremy,

Ah, OK, thanks for the clarification.  I'll clone a copy of this bug for RHEL-5 then.

Chris Lalancette

Comment 1 Chris Lalancette 2009-05-27 11:30:34 UTC
A direct link to the relevant c/s is:

http://xenbits.xensource.com/xen-unstable.hg?rev/4b157affc08f

Chris Lalancette

Comment 2 Chris Lalancette 2009-05-27 14:39:07 UTC
Created attachment 345617 [details]
Patch to mask more of the PV bits

This is the patch that I'm currently testing to fix this issue.  Essentially it masks more of the CPUID bits from PV guests when the hypervisor knows they can't work.  Importantly, this will mask out the GBpages feature from newer kernels so they won't try to use it while running on Xen.  This is a backport of c/s 17238, plus cpufeatures.h defines from 15803, 16101, 16102, and 16117.

Chris Lalancette

Comment 3 Chris Lalancette 2009-05-27 14:42:32 UTC
In order to effectively test this patch, we first need to make sure we are on a Barcelona (or newer) AMD box that has the GB pages feature.  Then we need to run 3 tests:

1)  Make sure that existing PV guests have no problems booting up.  For our purposes, this mostly means that RHEL-4 and RHEL-5 PV guests, both x86_64 and i386, boot up and run basic tests fine.

2)  Make sure that newer guests (such as F-11) can boot up with > 2047MB of memory.

3)  Start a guest on a RHEL-5.3 box.  Save the guest (xm save <guest> <file>).  Now, reboot the dom0 into the new kernel with this patch in place.  Now restore the guest, and make sure it operates properly.

Chris Lalancette

Comment 4 Chris Lalancette 2009-07-08 10:39:29 UTC
Created attachment 350912 [details]
Replacement patch to mask more CPUID bits for PV guests

This is a slightly updated patch to mask more CPUID bits.  It has one bugfix (make sure to mask out SYSCALL for all 32-bit PV guests), plus it adds a few less #define's that proved not to be necessary.

Comment 6 Chris Lalancette 2009-07-08 13:14:36 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
When booting paravirtualized guests that support gigabyte page tables (such as a Fedora 11 guest) on RHEL-5.4 Xen, the domain may fail to start if more than 2047MB of memory is configured for the domain.  To work around this issue, pass "nogbpages" on the guest kernel command-line.  This limitation will be addressed in a future version of RHEL-5 Xen.

Comment 9 Ryan Lerch 2009-08-18 03:35:24 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-When booting paravirtualized guests that support gigabyte page tables (such as a Fedora 11 guest) on RHEL-5.4 Xen, the domain may fail to start if more than 2047MB of memory is configured for the domain.  To work around this issue, pass "nogbpages" on the guest kernel command-line.  This limitation will be addressed in a future version of RHEL-5 Xen.+When booting paravirtualized guests that support gigabyte page tables (i.e. a Fedora 11 guest) on Red Hat Enterprise Linux 5.4 Xen, the domain may fail to start if more than 2047MB of memory is configured for the domain. To work around this issue, pass the "nogbpages" parameter on the guest kernel command-line.

Comment 10 Chris Lalancette 2009-08-25 09:59:52 UTC
I've uploaded a test kernel that should have a fix for this problem here:

http://people.redhat.com/clalance/virttest/

Can the reporters who are having problems please download and try out this test kernel?

Thanks,
Chris Lalancette

Comment 13 Don Zickus 2009-10-21 19:11:45 UTC
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 15 Chris Lalancette 2009-11-06 17:56:35 UTC
*** Bug 524719 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2010-03-30 07:40:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html