Bug 499592
Summary: | F-11 Xen 64-bit domU cannot be started with > 2047MB of memory | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Chris Lalancette <clalance> | ||||
Component: | kernel | Assignee: | Justin M. Forbes <jforbes> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 11 | CC: | itamar, jeremy, kernel-maint, markmc, virt-maint, xen-maint | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 502826 523124 (view as bug list) | Environment: | |||||
Last Closed: | 2009-09-25 14:46:24 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 480594, 502826, 523124 | ||||||
Attachments: |
|
Description
Chris Lalancette
2009-05-07 11:13:40 UTC
Justin, of the Fedora Xen domU bugs that I filed during yesterday's test day, this seems to be the most important. Can you look at it, and see if you can try to see what's going on here? Chris Lalancette Can you tell me what kernel was used in the guest? And was this pv or fv? With an up to date install using kernel 2.6.29.2-126.fc11.x86_64 I can boot a pv guest with 2047MB, 2048MB, and 4096MB just fine. All of the bugs that I filed were for PV guests. And the kernel was exactly the same version as you state above; 2.6.29.2-126.fc11.x86_64. So, I guess we should focus on other things, since this might be a difference in hardware or dom0 software. My dom0 is 5.4 preview, with Xen kernel 2.6.18-145.el5 and xen package xen-3.0.3-83.el5jd4. My hardware is an AMD Barcelona machine with 2 quad-core CPUs and 16GB of memory. Anything else? I can add dmesg and/or dmidecode to the BZ if you think it will help, but this is 100% repeatable on this hardware. Chris Lalancette What function does ffffffff8100ce42 map to in the kernel? Justin, how much physical memory does your system have? I'm pretty sure I've tested with >2G memory guests on a 16G hosts, though I can easily test again with a recent kernel. My test system also has 16G of memory, but is not numa, it is a single Intel Q6600. Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this might be limited to the barcelona or similar machines. (In reply to comment #7) > Also tested with an AMD numa system, and newer 5.4 kernel. It seems that this > might be limited to the barcelona or similar machines. There's nothing very architecturally specific about this. Failures above certain memory thresholds happen because either a pfn gets truncated or an mfn gets truncated, most commonly because they get turned into pointers then clipped at 32-bits by some vagarities of C's type promotion rules. In a dom0 kernel, I guess there's the possibility that something is failing because of some DMA issue, and the memory size is just incidental to how things get layed out in memory and stomp on each other. But this seems unlikely given that it happens early in boot. And its domU. Also, could someone map the failing rip, ffffffff8100ce42, to a function? Jeremy, Oops, sorry about that, you had asked earlier and I forgot to respond. In the 2.6.29.2-126.fc11 kernel, the address 0xffffffff8100ce42 maps to: (gdb) list *(0xffffffff8100ce42) 0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531). 526 #ifdef CONFIG_X86_PAE 527 ptep->pte_high = pte.pte_high; 528 smp_wmb(); 529 ptep->pte_low = pte.pte_low; 530 #else 531 *ptep = pte; 532 #endif 533 } 534 535 #ifdef CONFIG_X86_PAE Let me know if you want more information about the kernel; I have the debug information handy. Chris Lalancette What's the exact instruction (x/i)? Looks like its writing into a pte page, but Xen doesn't think its pinned and so just faults the write. Or the new pte is bogus. Any chance you could dig back through the stack trace to see what the callers are? Good candidates: ffffffff8100ce2f ffffffff8100c5d3 ffffffff8167455d ffffffff8150196c ffffffff8150196c ffffffff81397945 Jeremy, The exact instruction is: (gdb) x/i 0xffffffff8100ce42 0xffffffff8100ce42 <xen_set_pte+71>: mov %r12,(%rbx) For context, a disassembly of the whole xen_set_pte() function is: (gdb) disass xen_set_pte Dump of assembler code for function xen_set_pte: 0xffffffff8100cdfb <xen_set_pte+0>: push %rbp 0xffffffff8100cdfc <xen_set_pte+1>: mov %rsp,%rbp 0xffffffff8100cdff <xen_set_pte+4>: push %r13 0xffffffff8100ce01 <xen_set_pte+6>: push %r12 0xffffffff8100ce03 <xen_set_pte+8>: push %rbx 0xffffffff8100ce04 <xen_set_pte+9>: sub $0x8,%rsp 0xffffffff8100ce08 <xen_set_pte+13>: callq 0xffffffff81011000 <mcount> 0xffffffff8100ce0d <xen_set_pte+18>: mov %rdi,%rbx 0xffffffff8100ce10 <xen_set_pte+21>: mov %rsi,%r12 0xffffffff8100ce13 <xen_set_pte+24>: callq 0xffffffff8100cdcd <check_zero> 0xffffffff8100ce18 <xen_set_pte+29>: incl 0x7ad166(%rip) # 0xffffffff817b9f84 <mmu_stats+36> 0xffffffff8100ce1e <xen_set_pte+35>: callq 0xffffffff8100cdcd <check_zero> 0xffffffff8100ce23 <xen_set_pte+40>: mov 0x7ad162(%rip),%r13d # 0xffffffff817b9f8c <mmu_stats+44> 0xffffffff8100ce2a <xen_set_pte+47>: callq 0xffffffff81029a9d <paravirt_get_lazy_mode> 0xffffffff8100ce2f <xen_set_pte+52>: dec %eax 0xffffffff8100ce31 <xen_set_pte+54>: sete %al 0xffffffff8100ce34 <xen_set_pte+57>: movzbl %al,%eax 0xffffffff8100ce37 <xen_set_pte+60>: lea (%rax,%r13,1),%r13d 0xffffffff8100ce3b <xen_set_pte+64>: mov %r13d,0x7ad14a(%rip) # 0xf fffffff817b9f8c <mmu_stats+44> 0xffffffff8100ce42 <xen_set_pte+71>: mov %r12,(%rbx) 0xffffffff8100ce45 <xen_set_pte+74>: pop %r11 0xffffffff8100ce47 <xen_set_pte+76>: pop %rbx 0xffffffff8100ce48 <xen_set_pte+77>: pop %r12 0xffffffff8100ce4a <xen_set_pte+79>: pop %r13 0xffffffff8100ce4c <xen_set_pte+81>: leaveq 0xffffffff8100ce4d <xen_set_pte+82>: retq Resolving some of those symbols to function names, it looks like: 0xffffffff8100ce42 is in xen_set_pte (arch/x86/xen/mmu.c:531) 0xffffffff8100c5d3 is in pte_pfn_to_mfn (arch/x86/xen/mmu.c:451) 0xffffffff8100ce2f is in xen_set_pte (arch/x86/xen/mmu.c:524) 0xffffffff8167455d is in __raw_spin_unlock (/usr/src/debug/kernel-2.6.29/linux-2.6.29.x86_64/arch/x86/include/asm/paravirt.h:1421) 0xffffffff81001880 is at arch/x86/kernel/head_64.S:267 0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547) 0xffffffff8164548e is in setup_arch (arch/x86/kernel/setup.c:849) 0xffffffff8163da83 is in start_kernel (init/main.c:573) I'm not exactly sure at the moment, but it seems like the most likely candidate is from the phys_pud_update() call. Chris Lalancette r12 is 80000003474da0e3, with has _PAGE_PSE set. We don't support PSE under Xen, but some code is ignoring the fact that cpu_has_pse is false and trying to create a large mapping anyway. If its in mm/init_64.c, then something has set PG_LEVEL_2M in page_size_mask, or something is forgetting to test before creating the mapping. But I don't see anything obvious, or why it should only happen sometimes. "phys_pud_update" would be the place to look, but it probably has almost everything else inlined into it. What does line 547 correspond to in your sources? mm/numa_32.c has some code like this that I haven't got around to addressing, but I'm not aware of anything like this on 64 bit. The report says that it only happens on a NUMA host machine, but the NUMA-ness of the host shouldn't be visible to a domU guest. From gdb: (gdb) list *(0xffffffff81397945) 0xffffffff81397945 is in phys_pud_update (arch/x86/mm/init_64.c:547). 542 { 543 pud_t *pud; 544 545 pud = (pud_t *)pgd_page_vaddr(*pgd); 546 547 return phys_pud_init(pud, addr, end, page_size_mask); 548 } 549 550 static void __init find_early_table_space(unsigned long end, int use_pse, 551 int use_gbpages) Looking at the sources directly, we can only get to phys_pud_update() from kernel_physical_mapping_init(), and we can only get there from init_memory_mapping(). In that case, it looks like the page_size_mask is being passed in based on the "mr" array in init_memory_mapping(). So that must be getting messed up. Indeed, if I put "earlyprintk=xen" on the guest boot-line, I see this: [root@amd1 ~]# xm create -c f11pv_x86_64 Using config file "/etc/xen/f11pv_x86_64". Started domain f11pv_x86_64 (early) Initializing cgroup subsys cpuset (early) Initializing cgroup subsys cpu (early) Linux version 2.6.29.2-126.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090427 (Red Hat 4.4.0-3) (GCC) ) #1 SMP Mon May 4 04:46:15 EDT 2009 (early) Command line: ro root=/dev/VolGroup00/LogVol00 console=hvc0 earlyprintk=xen (early) KERNEL supported cpus: (early) Intel GenuineIntel (early) AMD AuthenticAMD (early) Centaur CentaurHauls (early) ACPI in unprivileged domain disabled (early) BIOS-provided physical RAM map: (early) Xen: 0000000000000000 - 00000000000a0000 (usable) (early) Xen: 00000000000a0000 - 0000000000100000 (reserved) (early) Xen: 0000000000100000 - 0000000001fea000 (usable) (early) Xen: 0000000001fea000 - 00000000023ed000 (reserved) (early) Xen: 00000000023ed000 - 0000000080000000 (usable) (early) console [xenboot0] enabled (early) DMI not present or invalid. (early) last_pfn = 0x80000 max_arch_pfn = 0x100000000 (early) init_memory_mapping: 0000000000000000-0000000080000000 (early) Using GB pages for direct mapping That last line is probably the problem. Indeed, if I add "nogbpages" to the guest kernel command-line, it then boots just fine. So the reason it shows up on my Barcelona, and not an earlier AMD, is that my Barcelona supports GB pages (cpu_has_gbpages). Chris Lalancette Are GB pages a distinct feature flag in cpuid? I wonder if its another flag Xen should be masking out. Either way, its easy to mask in xen_cpuid() Yeah, it looks like it: /* AMD-defined CPU features, CPUID level 0x80000001, word 1 */ /* Don't duplicate feature flags which are redundant with Intel! */ #define X86_FEATURE_SYSCALL (1*32+11) /* SYSCALL/SYSRET */ #define X86_FEATURE_MP (1*32+19) /* MP Capable. */ #define X86_FEATURE_NX (1*32+20) /* Execute Disable */ #define X86_FEATURE_MMXEXT (1*32+22) /* AMD MMX extensions */ #define X86_FEATURE_FXSR_OPT (1*32+25) /* FXSAVE/FXRSTOR optimizations */ #define X86_FEATURE_GBPAGES (1*32+26) /* "pdpe1gb" GB pages */ And looking at the AMD CPUID spec (Rev 2.28, page 15) it does look like this is a unique flag. I'm attaching a patch to xen_cpuid() that disables it, and seems to fix the problem for me. Jeremy, do you want me to post the patch to xen-devel/lkml, or will you take it from here? Chris Lalancette Created attachment 344429 [details]
Disable GB pages for Xen guests
I think setup_clear_cpu_cap(X86_FEATURE_GBPAGES) might be better (probably for the other features as well). I'll put something together. OK, thanks Jeremy! Chris Lalancette The patch from Chris has been added to F-11 kernel 2.6.29.3-155 in time to meet release. I will replace this with the patch from Jeremy when it is available, and leave this bug open until that is done. Yep, that's fine. Note that this has been fixed in Xen since March last year, with changeset 4b157affc08f. Jeremy, Hm, what do you mean? I thought this was a bug in the pv_ops kernel, and hence there wasn't a patch yet? Or do you mean that there is a patch to the hypervisor to workaround this issue? Also, what repository does the above c/s come from? Thanks, Chris Lalancette It is at heart a Xen bug; the hypervisor shouldn't be exposing CPU capabilities that guests cannot use. The kernel change is a workaround to mask things that Xen is not. Xen fixed this issue in http://xenbits.xensource.com/xen-unstable.hg change 4b157affc08f. You seem to be using a relatively old version of Xen, which does not have this fix. Jeremy, Ah, OK, thanks for the clarification. I'll clone a copy of this bug for RHEL-5 then. Chris Lalancette Chris/Justin: I'm coming to two conflicting conclusions from reading this: 1) It's fixed in the latest fedora kernel -> CLOSED RAWHIDE 2) It's a hypervisor bug -> CLOSED NOTABUG which is it? :-) Mark, I don't think it is either, actually. The underlying issue is a hypervisor issue, and I've opened up BZ 502826 to solve it for RHEL-5. However, I think we also might want a patch in the upstream domU kernel to make it a bit more robust against this kind of hypervisor bug. Jeremy, is that correct? Or do you want to just declare this a hypervisor bug and do nothing in the upstream domU kernel code? Chris Lalancette (In reply to comment #19) > The patch from Chris has been added to F-11 kernel 2.6.29.3-155 in time to meet > release. I will replace this with the patch from Jeremy when it is available, > and leave this bug open until that is done. Ah - I missed that last bit - Justin is leaving this bug open until he pulls in the patch from upstream. (In reply to comment #25) > Chris/Justin: I'm coming to two conflicting conclusions from reading this: > > 1) It's fixed in the latest fedora kernel -> CLOSED RAWHIDE > > 2) It's a hypervisor bug -> CLOSED NOTABUG > > which is it? :-) Well, its both. It is a Xen bug which is best fixed in Xen, but it can be easily worked around in the guest kernel. The bug will only hit in fairly limited circumstances (starting a 2+GB guest on a host with a CPU supporting GB pages). (You could also work around it in current versions of the Xen tools which allow the user to mask CPUID bits in the config file, but I don't know if your shipping versions of the tools have that feature.) This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle. Changing version to '11'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping This is a workaround for a hypervisor bug with limited impact, and is not going to be carried forward in F-12. |