Bug 499592

Summary: F-11 Xen 64-bit domU cannot be started with > 2047MB of memory
Product: [Fedora] Fedora Reporter: Chris Lalancette <clalance>
Component: kernelAssignee: Justin M. Forbes <jforbes>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: itamar, jeremy, kernel-maint, markmc, virt-maint, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 502826 523124 (view as bug list) Environment:
Last Closed: 2009-09-25 14:46:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 480594, 502826, 523124    
Attachments:
Description Flags
Disable GB pages for Xen guests none

Description Chris Lalancette 2009-05-07 11:13:40 UTC
Description of problem:
When trying to boot a F-11 64-bit Xen domU with up to 2047MB of memory, it boots just fine.  However, when trying  to boot this same domain with 2048MB of memory or more, it crashes immediately on bootup, before any messages are printed to the console.

The only clue that I do have is from the serial console of the dom0 (RHEL 5.3 dom0):

tap tap-39-51712: 2 getting info
mapping kernel into physical memory
about to get started...
(XEN) Unhandled page fault in domain 39 on VCPU 0 (ec=0003)
(XEN) Pagetable walk from ffff880001002008:
(XEN)  L4[0x110] = 0000000242929067 0000000000001002
(XEN)  L3[0x000] = 00000002f40af067 0000000000001006
(XEN)  L2[0x008] = 0000000365b81067 00000000000023f8 
(XEN)  L1[0x002] = 8010000242929065 0000000000001002
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 39 (vcpu#0) crashed on cpu#7:
(XEN) ----[ Xen-3.1.2-141.el5bz479754rxcopy  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    7
(XEN) RIP:    e033:[<ffffffff8100ce42>]
(XEN) RFLAGS: 0000000000000296   CONTEXT: guest
(XEN) rax: 0000000000000000   rbx: ffff880001002008   rcx: 00000000fbfde700
(XEN) rdx: 0000000000000000   rsi: 80000003474da0e3   rdi: ffff880001002008
(XEN) rbp: ffffffff81613d68   rsp: ffffffff81613d00   r8:  8000000000000163
(XEN) r9:  0000000000000005   r10: 0000000000000005   r11: ffffffff8100c5d3
(XEN) r12: 80000003474da0e3   r13: 0000000000000000   r14: 0000000040000000
(XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000340d80000   cr2: ffff880001002008
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff81613d00:
(XEN)    00000000fbfde700 ffffffff8100c5d3 0000000000000003 ffffffff8100ce42
(XEN)    000000010000e030 0000000000010096 ffffffff81613d48 000000000000e02b
(XEN)    ffffffff8100ce2f ffffffff8100c5d3 ffff880001002008 0000000000000000
(XEN)    0000000040000000 ffffffff81613de8 ffffffff8167455d ffffffff8150196c
(XEN)    ffff880001002000 0000000000000008 0000000000000008 0000000100000020
(XEN)    0000000080000000 ffffffff81613db8 0000000000000000 0000000581613dd8
(XEN)    ffffffff81001880 0000000000000000 0000000080000000 ffffffff8150196c
(XEN)    0000000080000000 ffffffff81613ed8 ffffffff81397945 ffffffffff40f000
(XEN)    0000000000000008 ffff880080000000 0000000100000000 ffff880080000000
(XEN)    ffffffff81613e28 0000000000000000 0000000080000000 0000000000000008
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffffffff81613ec8 0000000000000000 0000000000003000
(XEN)    ffffffff81613f50 ffffffffffffffff ffffffff823ed000 ffffffff81613f38
(XEN)    ffffffff8164548e ffffffff00000008 ffffffff81613f48 ffffffff81613f08
(XEN)    ffffffff81806450 ffffffff81613f28 0000000000000000 ffffffff81676020
(XEN)    0000000000003000 0000000000000000 ffffffffffffffff ffffffff81613f78
(XEN)    ffffffff8163da83 ffffffff81613f78 ffffffff81678ac0 00000000018c700c
(XEN)    0000000000003000 ffffffff81788000 0000000000000800 ffffffff81613f98

Comment 1 Chris Lalancette 2009-05-08 15:17:14 UTC
Justin, of the Fedora Xen domU bugs that I filed during yesterday's test day, this seems to be the most important.  Can you look at it, and see if you can try to see what's going on here?

Chris Lalancette

Comment 2 Justin M. Forbes 2009-05-08 19:54:23 UTC
Can you tell me what kernel was used in the guest? And was this pv or fv?  With an up to date install using kernel 2.6.29.2-126.fc11.x86_64 I can boot a pv guest with 2047MB, 2048MB, and 4096MB just fine.

Comment 3 Chris Lalancette 2009-05-10 06:46:07 UTC
All of the bugs that I filed were for PV guests.  And the kernel was exactly the same version as you state above; 2.6.29.2-126.fc11.x86_64.  So, I guess we should focus on other things, since this might be a difference in hardware or dom0 software.  My dom0 is 5.4 preview, with Xen kernel 2.6.18-145.el5 and xen package xen-3.0.3-83.el5jd4.  My hardware is an AMD Barcelona machine with 2 quad-core CPUs and 16GB of memory.  Anything else?  I can add dmesg and/or dmidecode to the BZ if you think it will help, but this is 100% repeatable on this hardware.

Chris Lalancette

Comment 4 Jeremy Fitzhardinge 2009-05-11 17:33:34 UTC
What function does ffffffff8100ce42 map to in the kernel?

Comment 5 Jeremy Fitzhardinge 2009-05-11 18:00:16 UTC
Justin, how much physical memory does your system have?

I'm pretty sure I've tested with >2G memory guests on a 16G hosts, though I can easily test again with a recent kernel.

Comment 6 Justin M. Forbes 2009-05-11 18:13:17 UTC
My test system also has 16G of memory, but is not numa, it is a single Intel Q6600.

Comment 7 Justin M. Forbes 2009-05-14 20:13:56 UTC
Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this might be limited to the barcelona or similar machines.

Comment 8 Jeremy Fitzhardinge 2009-05-14 20:20:41 UTC
(In reply to comment #7)
> Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this
> might be limited to the barcelona or similar machines.  

There's nothing very architecturally specific about this.  Failures above certain memory thresholds happen because either a pfn gets truncated or an mfn gets truncated, most commonly because they get turned into pointers then clipped at 32-bits by some vagarities of C's type promotion rules.

In a dom0 kernel, I guess there's the possibility that something is failing because of some DMA issue, and the memory size is just incidental to how things get layed out in memory and stomp on each other.  But this seems unlikely given that it happens early in boot.  And its domU.

Also, could someone map the failing rip, ffffffff8100ce42, to a function?

Comment 9 Chris Lalancette 2009-05-14 21:18:04 UTC
Jeremy,
     Oops, sorry about that, you had asked earlier and I forgot to respond.  In the 2.6.29.2-126.fc11 kernel, the address 0xffffffff8100ce42 maps to:

(gdb) list *(0xffffffff8100ce42)
0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531).
526	#ifdef CONFIG_X86_PAE
527		ptep->pte_high = pte.pte_high;
528		smp_wmb();
529		ptep->pte_low = pte.pte_low;
530	#else
531		*ptep = pte;
532	#endif
533	}
534	
535	#ifdef CONFIG_X86_PAE

Let me know if you want more information about the kernel; I have the debug information handy.

Chris Lalancette

Comment 10 Jeremy Fitzhardinge 2009-05-14 21:55:13 UTC
What's the exact instruction (x/i)?

Looks like its writing into a pte page, but Xen doesn't think its pinned and so just faults the write.  Or the new pte is bogus.

Any chance you could dig back through the stack trace to see what the callers are?

Good candidates:
ffffffff8100ce2f
ffffffff8100c5d3
ffffffff8167455d
ffffffff8150196c
ffffffff8150196c
ffffffff81397945

Comment 11 Chris Lalancette 2009-05-15 07:33:04 UTC
Jeremy,
     The exact instruction is:

(gdb) x/i 0xffffffff8100ce42
0xffffffff8100ce42 <xen_set_pte+71>:	mov    %r12,(%rbx)

For context, a disassembly of the whole xen_set_pte() function is:

(gdb) disass xen_set_pte
Dump of assembler code for function xen_set_pte:
0xffffffff8100cdfb <xen_set_pte+0>:	push   %rbp
0xffffffff8100cdfc <xen_set_pte+1>:	mov    %rsp,%rbp
0xffffffff8100cdff <xen_set_pte+4>:	push   %r13
0xffffffff8100ce01 <xen_set_pte+6>:	push   %r12
0xffffffff8100ce03 <xen_set_pte+8>:	push   %rbx
0xffffffff8100ce04 <xen_set_pte+9>:	sub    $0x8,%rsp
0xffffffff8100ce08 <xen_set_pte+13>:	callq  0xffffffff81011000 <mcount>
0xffffffff8100ce0d <xen_set_pte+18>:	mov    %rdi,%rbx
0xffffffff8100ce10 <xen_set_pte+21>:	mov    %rsi,%r12
0xffffffff8100ce13 <xen_set_pte+24>:	callq  0xffffffff8100cdcd <check_zero>
0xffffffff8100ce18 <xen_set_pte+29>:	incl   0x7ad166(%rip)        # 0xffffffff817b9f84 <mmu_stats+36>
0xffffffff8100ce1e <xen_set_pte+35>:	callq  0xffffffff8100cdcd <check_zero>
0xffffffff8100ce23 <xen_set_pte+40>:	mov    0x7ad162(%rip),%r13d        # 0xffffffff817b9f8c <mmu_stats+44>
0xffffffff8100ce2a <xen_set_pte+47>:	callq  0xffffffff81029a9d <paravirt_get_lazy_mode>
0xffffffff8100ce2f <xen_set_pte+52>:	dec    %eax
0xffffffff8100ce31 <xen_set_pte+54>:	sete   %al
0xffffffff8100ce34 <xen_set_pte+57>:	movzbl %al,%eax
0xffffffff8100ce37 <xen_set_pte+60>:	lea    (%rax,%r13,1),%r13d
0xffffffff8100ce3b <xen_set_pte+64>:	mov    %r13d,0x7ad14a(%rip)        # 0xf
fffffff817b9f8c <mmu_stats+44>
0xffffffff8100ce42 <xen_set_pte+71>:	mov    %r12,(%rbx)
0xffffffff8100ce45 <xen_set_pte+74>:	pop    %r11
0xffffffff8100ce47 <xen_set_pte+76>:	pop    %rbx
0xffffffff8100ce48 <xen_set_pte+77>:	pop    %r12
0xffffffff8100ce4a <xen_set_pte+79>:	pop    %r13
0xffffffff8100ce4c <xen_set_pte+81>:	leaveq 
0xffffffff8100ce4d <xen_set_pte+82>:	retq   

Resolving some of those symbols to function names, it looks like:
0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531)
0xffffffff8100c5d3 is in pte_pfn_to_mfn (arch/x86/xen/mmu.c:451)
0xffffffff8100ce2f is in xen_set_pte (arch/x86/xen/mmu.c:524)
0xffffffff8167455d is in __raw_spin_unlock (/usr/src/debug/kernel-2.6.29/linux-2.6.29.x86_64/arch/x86/include/asm/paravirt.h:1421)
0xffffffff81001880 is at arch/x86/kernel/head_64.S:267
0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547)
0xffffffff8164548e is in setup_arch (arch/x86/kernel/setup.c:849)
0xffffffff8163da83 is in start_kernel (init/main.c:573)

I'm not exactly sure at the moment, but it seems like the most likely candidate is from the phys_pud_update() call.

Chris Lalancette

Comment 12 Jeremy Fitzhardinge 2009-05-15 17:20:55 UTC
r12 is 80000003474da0e3, with has _PAGE_PSE set.  We don't support PSE under Xen, but some code is ignoring the fact that cpu_has_pse is false and trying to create a large mapping anyway.  If its in mm/init_64.c, then something has set PG_LEVEL_2M in page_size_mask, or something is forgetting to test before creating the mapping.  But I don't see anything obvious, or why it should only happen sometimes.  "phys_pud_update" would be the place to look, but it probably has almost everything else inlined into it.  What does line 547 correspond to in your sources?

mm/numa_32.c has some code like this that I haven't got around to addressing, but I'm not aware of anything like this on 64 bit.  The report says that it only happens on a NUMA host machine, but the NUMA-ness of the host shouldn't be visible to a domU guest.

Comment 13 Chris Lalancette 2009-05-18 07:20:50 UTC
From gdb:

(gdb) list *(0xffffffff81397945)
0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547).
542	{
543		pud_t *pud;
544	
545		pud = (pud_t *)pgd_page_vaddr(*pgd);
546	
547		return phys_pud_init(pud, addr, end, page_size_mask);
548	}
549	
550	static void __init find_early_table_space(unsigned long end, int use_pse,
551						  int use_gbpages)

Looking at the sources directly, we can only get to phys_pud_update() from kernel_physical_mapping_init(), and we can only get there from init_memory_mapping().  In that case, it looks like the page_size_mask is being passed in based on the "mr" array in init_memory_mapping().  So that must be getting messed up.  Indeed, if I put "earlyprintk=xen" on the guest boot-line, I see this:

[root@amd1 ~]# xm create -c f11pv_x86_64
Using config file "/etc/xen/f11pv_x86_64".
Started domain f11pv_x86_64
(early) Initializing cgroup subsys cpuset
(early) Initializing cgroup subsys cpu
(early) Linux version 2.6.29.2-126.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090427 (Red Hat 4.4.0-3) (GCC) ) #1 SMP Mon May 4 04:46:15 EDT 2009
(early) Command line:  ro root=/dev/VolGroup00/LogVol00 console=hvc0 earlyprintk=xen
(early) KERNEL supported cpus:
(early)   Intel GenuineIntel
(early)   AMD AuthenticAMD
(early)   Centaur CentaurHauls
(early) ACPI in unprivileged domain disabled
(early) BIOS-provided physical RAM map:
(early)  Xen: 0000000000000000 - 00000000000a0000 (usable)
(early)  Xen: 00000000000a0000 - 0000000000100000 (reserved)
(early)  Xen: 0000000000100000 - 0000000001fea000 (usable)
(early)  Xen: 0000000001fea000 - 00000000023ed000 (reserved)
(early)  Xen: 00000000023ed000 - 0000000080000000 (usable)
(early) console [xenboot0] enabled
(early) DMI not present or invalid.
(early) last_pfn = 0x80000 max_arch_pfn = 0x100000000
(early) init_memory_mapping: 0000000000000000-0000000080000000
(early) Using GB pages for direct mapping

That last line is probably the problem.  Indeed, if I add "nogbpages" to the guest kernel command-line, it then boots just fine.  So the reason it shows up on my Barcelona, and not an earlier AMD, is that my Barcelona supports GB pages (cpu_has_gbpages).

Chris Lalancette

Comment 14 Jeremy Fitzhardinge 2009-05-18 07:30:05 UTC
Are GB pages a distinct feature flag in cpuid?  I wonder if its another flag Xen should be masking out.  Either way, its easy to mask in xen_cpuid()

Comment 15 Chris Lalancette 2009-05-18 12:40:17 UTC
Yeah, it looks like it:

/* AMD-defined CPU features, CPUID level 0x80000001, word 1 */
/* Don't duplicate feature flags which are redundant with Intel! */
#define X86_FEATURE_SYSCALL	(1*32+11) /* SYSCALL/SYSRET */
#define X86_FEATURE_MP		(1*32+19) /* MP Capable. */
#define X86_FEATURE_NX		(1*32+20) /* Execute Disable */
#define X86_FEATURE_MMXEXT	(1*32+22) /* AMD MMX extensions */
#define X86_FEATURE_FXSR_OPT	(1*32+25) /* FXSAVE/FXRSTOR optimizations */
#define X86_FEATURE_GBPAGES	(1*32+26) /* "pdpe1gb" GB pages */

And looking at the AMD CPUID spec (Rev 2.28, page 15) it does look like this is a unique flag.

I'm attaching a patch to xen_cpuid() that disables it, and seems to fix the problem for me.  Jeremy, do you want me to post the patch to xen-devel/lkml, or will you take it from here?

Chris Lalancette

Comment 16 Chris Lalancette 2009-05-18 12:40:49 UTC
Created attachment 344429 [details]
Disable GB pages for Xen guests

Comment 17 Jeremy Fitzhardinge 2009-05-18 18:31:51 UTC
I think setup_clear_cpu_cap(X86_FEATURE_GBPAGES) might be better (probably for the other features as well).  I'll put something together.

Comment 18 Chris Lalancette 2009-05-19 06:49:15 UTC
OK, thanks Jeremy!

Chris Lalancette

Comment 19 Justin M. Forbes 2009-05-22 15:32:21 UTC
The patch from Chris has been added to F-11 kernel 2.6.29.3-155 in time to meet release.  I will replace this with the patch from Jeremy when it is available, and leave this bug open until that is done.

Comment 20 Jeremy Fitzhardinge 2009-05-22 16:31:28 UTC
Yep, that's fine.

Comment 21 Jeremy Fitzhardinge 2009-05-26 22:13:01 UTC
Note that this has been fixed in Xen since March last year, with changeset 4b157affc08f.

Comment 22 Chris Lalancette 2009-05-27 06:55:05 UTC
Jeremy,
     Hm, what do you mean?  I thought this was a bug in the pv_ops kernel, and hence there wasn't a patch yet?  Or do you mean that there is a patch to the hypervisor to workaround this issue?  Also, what repository does the above c/s come from?

Thanks,
Chris Lalancette

Comment 23 Jeremy Fitzhardinge 2009-05-27 07:12:26 UTC
It is at heart a Xen bug; the hypervisor shouldn't be exposing CPU capabilities that guests cannot use.  The kernel change is a workaround to mask things that Xen is not.

Xen fixed this issue in http://xenbits.xensource.com/xen-unstable.hg change 4b157affc08f.  You seem to be using a relatively old version of Xen, which does not have this fix.

Comment 24 Chris Lalancette 2009-05-27 11:27:46 UTC
Jeremy,

Ah, OK, thanks for the clarification.  I'll clone a copy of this bug for RHEL-5 then.

Chris Lalancette

Comment 25 Mark McLoughlin 2009-06-03 17:09:36 UTC
Chris/Justin: I'm coming to two conflicting conclusions from reading this:

  1) It's fixed in the latest fedora kernel -> CLOSED RAWHIDE

  2) It's a hypervisor bug -> CLOSED NOTABUG

which is it? :-)

Comment 26 Chris Lalancette 2009-06-03 17:47:04 UTC
Mark,
     I don't think it is either, actually.  The underlying issue is a hypervisor issue, and I've opened up BZ 502826 to solve it for RHEL-5.  However, I think we also might want a patch in the upstream domU kernel to make it a bit more robust against this kind of hypervisor bug.  Jeremy, is that correct?  Or do you want to just declare this a hypervisor bug and do nothing in the upstream domU kernel code?

Chris Lalancette

Comment 27 Mark McLoughlin 2009-06-03 21:46:17 UTC
(In reply to comment #19)
> The patch from Chris has been added to F-11 kernel 2.6.29.3-155 in time to meet
> release.  I will replace this with the patch from Jeremy when it is available,
> and leave this bug open until that is done.

Ah - I missed that last bit - Justin is leaving this bug open until he pulls in the patch from upstream.

Comment 28 Jeremy Fitzhardinge 2009-06-03 22:48:56 UTC
(In reply to comment #25)
> Chris/Justin: I'm coming to two conflicting conclusions from reading this:
> 
>   1) It's fixed in the latest fedora kernel -> CLOSED RAWHIDE
> 
>   2) It's a hypervisor bug -> CLOSED NOTABUG
> 
> which is it? :-)  

Well, its both.  It is a Xen bug which is best fixed in Xen, but it can be easily worked around in the guest kernel.  The bug will only hit in fairly limited circumstances (starting a 2+GB guest on a host with a CPU supporting GB pages).  

(You could also work around it in current versions of the Xen tools which allow the user to mask CPUID bits in the config file, but I don't know if your shipping versions of the tools have that feature.)

Comment 29 Bug Zapper 2009-06-09 15:18:26 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 30 Justin M. Forbes 2009-09-25 14:46:24 UTC
This is a workaround for a hypervisor bug with limited impact, and is not going to be carried forward in F-12.