Bug 508120
Summary: | 2.6.31-rc1 xen domU crashes early during boot | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kalev Lember <kalevlember> |
Component: | kernel | Assignee: | Justin M. Forbes <jforbes> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | rawhide | CC: | atodorov, clalance, itamar, jeremy, jforbes, kernel-maint, kevin, markmc, mitchb, mschmidt, orion, pasik, rwilliam, virt-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-09-05 18:52:23 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 498968 | ||
Attachments: |
I'm seeing the same problem with a x86_64 Rawhide DomU and a RHEL5 Dom0. I believe it's the same problem Orion Poplawski reported recently on fedora-xen mailing list: https://www.redhat.com/archives/fedora-xen/2009-August/msg00008.html I got the stack trace as Mark McLoughlin suggested: michich@hammerfall ~$ sudo /usr/lib64/xen/bin/xenctx -s /tmp/System.map-2.6.31-rc5-git2 12 rip: ffffffff817290a1 xen_start_kernel+0x10 rsp: ffffffff8171df90 rax: 00000000 rbx: 00000000 rcx: 00000000 rdx: 00000000 rsi: ffffffff82fc3000 rdi: ffffffff82fc3000 rbp: ffffffff8171dff8 r8: 00000000 r9: 00000000 r10: 00000000 r11: 00000000 r12: 00000000 r13: 00000000 r14: 00000000 r15: 00000000 cs: 0000e033 ds: 00000000 fs: 00000000 gs: 00000000 Stack: 0000000000000000 0000000000000000 0000000000000000 ffffffff817290a1 000000010000e030 0000000000010096 ffffffff8171dfd8 000000000000e02b 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Code: bd 93 ff c9 c3 55 48 89 e5 53 48 83 ec 18 48 8b 3d 27 c0 33 00 <65> 48 8b 04 25 28 00 00 00 48 89 Call Trace: [<ffffffff817290a1>] xen_start_kernel+0x10 <-- [<ffffffff817290a1>] xen_start_kernel+0x10 Then I dissasembled xen_start_kernel(): 0000000000000000 <xen_start_kernel>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 53 push %rbx 5: 48 83 ec 18 sub $0x18,%rsp 9: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 10 <xen_start_kernel+0x10> 10: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax ***CRASHES HERE*** 17: 00 00 19: 48 89 45 e8 mov %rax,-0x18(%rbp) 1d: 31 c0 xor %eax,%eax ... At first these last three instructions confused me, because they did not seem to correspond to anything in the C source, but then I realized they setup the canary for stack smashing detection. So I recompiled the kernel without CONFIG_CC_STACKPROTECTOR and I got much farther with the boot (it hung after loading some drivers, I'll investigate more). I guess xen_start_kernel() (and possibly more of Xen DomU startup code) should be compiled with -fno-stack-protector. Wow, excellent analysis, thanks. I was just starting to see this myself, but hadn't yet had time to look into it. We'll have to take this up with upstream and see what they have to say about it. Chris Lalancette Yeah, seeing this also on a machine in fedora infrastructure. ;( Thanks for the report and analysis. I guess there's a keyword to prevent gcc from adding stack-smashing to particular functions or files... Erm... Created attachment 357697 [details]
Make sure load_percpu_segment doesn't have stack-protector enabled
Created attachment 357698 [details]
Setup percpu segments before calling stack-protected functions
Do those two help? Jeremy, yes, these patches help. The kernel starts booting with them applied. Just a suggestion: the usual way (as seen in other Makefiles) to disable the stack protection for selected source files seems to be: nostackp := $(call cc-option, -fno-stack-protector) CFLAGS_somefile.o := $(nostackp) And the kernel still hangs for me later during boot, but that's a different bug. Ah, I couldn't find another instance of stack-protector being disabled. Have you reported the other bug, or is it something purely local? Ingo has these queued up in linux-2.6-tip.git/x86/urgent: http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=ce2eef33d3 http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=5416c26635 Just need to re-test and close this when they make their way to rawhide Are you sure they work? M A Young still reports crashes when they're applied. Michal says it still hangs later on, but that its a different issue M A Young is probably testing Dom0, maybe yet another issue? Dunno, that's why I said we need to re-test :-) I've described the other bug in http://lkml.org/lkml/2009/8/21/71 Should be fixed in 2.6.31-0.173.rc7.git2, which has the two x86-tip patches plus the framebuffer fix from LKML. 2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64 in xen_start_kernel(). Seems to work here under a x86_64 guest. Thanks. (In reply to comment #15) > 2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still > fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64 > in xen_start_kernel(). 32 bit is trickier because it needs a specifically set-up GDT entry and its own segment register. Doing this setup properly ends upcalling functions with stack-protector prologs which assume the segment register is already set up. I need to work out 1) how native does this setup, and/or 2) refactor the segment register setup so that can avoid functions with stack-protector code. *** Bug 519342 has been marked as a duplicate of this bug. *** FYI: 2.6.31-0.174.rc7.git2 still fails on i386. My dom0 is recent RHEL5 and domU is F12-Alpha I also tried the latest rawhide tree (2.6.31-0.174.rc7.git2.fc12.i686) with virt-install on my F11 + Xen 3.4.1 + 2.6.31-rc6 pv_ops dom0 setup, and it still crashes. What compiler are people using? Using F11's gcc-4.4.1-2.fc11.x86_64, it says: /home/jeremy/git/linux/arch/x86/Makefile:80: stack protector enabled but no compiler support Jeremy, Rawhide builds currently use gcc-4.4.1-6.x86_64. You can find this information in Koji build logs, e.g.: http://kojipkgs.fedoraproject.org/packages/kernel/2.6.31/0.185.rc7.git6.fc12/data/logs/x86_64/ root.log tells you the versions of the packages used in the build. build.log has the build warnings. There was no such stack protector warning in this case. Where can I get this version of gcc? "yum update --enablerepo=rawhide gcc" doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64. Or does 32-bit stackprotector not work in the x86-64 version of the compiler? > Where can I get this version of gcc? "yum update --enablerepo=rawhide gcc" > doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64. Works for me, yum can see the newer version. But the gcc from Rawhide depends on newer glibc, so I do not recommend doing it. > Or does 32-bit stackprotector not work in the x86-64 version of the compiler? Bug in the stack protector detection for ARCH=i386 builds on x86_64. I've sent a patch to LKML and CCed you. Koji always builds packages using native arch toolchain, so it is not affected. Created attachment 359557 [details]
Set up kernel GDT early to make -fstack-protector work under Xen
This patch should comprehensively fix -fstack-protector under Xen for both 32 and 64-bit. Please test.
Someone please add that bugfix patch to next rawhide kernel build so we get people to test it.. The patch has been applied and should be available in the next rawhide kernel build. 2.6.31-0.203.rc8.git2.fc12 boots successfully as Xen domU. I've tested both i686.PAE and x86_64. Seems to boot now. virt-install started f12/rawhide Xen domU installation OK, on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates testing. Installation went fine, and the installed domU seems to have 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running. There's a traceback on domU dmesg though.. the domU still runs fine. Write protecting the kernel text: 4352k Write protecting the kernel read-only data: 1800k ============================================= [ INFO: possible recursive locking detected ] 2.6.31-0.203.rc8.git2.fc12.i686.PAE #1 --------------------------------------------- init/1 is trying to acquire lock: (&input_pool.lock){+.+...}, at: [<c043b30e>] __wake_up+0x2b/0x61 but task is already holding lock: (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0 other info that might help us debug this: 2 locks held by init/1: #0: (&p->cred_guard_mutex){+.+.+.}, at: [<c0508756>] do_execve+0xa4/0x2ee #1: (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0 stack backtrace: Pid: 1, comm: init Not tainted 2.6.31-0.203.rc8.git2.fc12.i686.PAE #1 Call Trace: [<c08387c0>] ? printk+0x22/0x3a [<c0478b59>] __lock_acquire+0x7e9/0xb25 [<c0478f4c>] lock_acquire+0xb7/0xeb [<c043b30e>] ? __wake_up+0x2b/0x61 [<c043b30e>] ? __wake_up+0x2b/0x61 [<c083b4f7>] _spin_lock_irqsave+0x45/0x89 [<c043b30e>] ? __wake_up+0x2b/0x61 [<c043b30e>] __wake_up+0x2b/0x61 [<c068e2a0>] account+0xb5/0xf0 [<c068e3ef>] extract_entropy+0x3e/0xac [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1 [<c04799d7>] ? lock_release+0x186/0x19f [<c068e56e>] get_random_bytes+0x29/0x3e [<c053bbd1>] load_elf_binary+0xab9/0x106c [<c050732d>] search_binary_handler+0xd7/0x27b [<c053b118>] ? load_elf_binary+0x0/0x106c [<c0539c76>] load_script+0x1a6/0x1c8 [<c0507323>] ? search_binary_handler+0xcd/0x27b [<c0406199>] ? xen_force_evtchn_callback+0x1d/0x34 [<c0507323>] ? search_binary_handler+0xcd/0x27b [<c0406b14>] ? check_events+0x8/0xc [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1 [<c04799d7>] ? lock_release+0x186/0x19f [<c050732d>] search_binary_handler+0xd7/0x27b [<c0539ad0>] ? load_script+0x0/0x1c8 [<c050888b>] do_execve+0x1d9/0x2ee [<c0408359>] sys_execve+0x39/0x6e [<c0409ad0>] syscall_call+0x7/0xb [<c04f00d8>] ? sys_swapon+0x348/0xa98 [<c040d76b>] ? kernel_execve+0x27/0x3e [<c04031e0>] ? run_init_process+0x2b/0x3e [<c0403275>] ? init_post+0x82/0xe9 [<c0a9b566>] ? kernel_init+0x1f6/0x211 [<c0a9b370>] ? kernel_init+0x0/0x211 [<c040a6bf>] ? kernel_thread_helper+0x7/0x10 (In reply to comment #29) > Seems to boot now. virt-install started f12/rawhide Xen domU installation OK, > on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates > testing. > > Installation went fine, and the installed domU seems to have > 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running. > > There's a traceback on domU dmesg though.. the domU still runs fine. It's probably worth looking through BZ quickly to see if a bug with that trace exists already, and if not, to open a new bug about it. Thanks for the testing, Chris Lalancette OK, new bug opened: https://bugzilla.redhat.com/show_bug.cgi?id=521800 |
Created attachment 349429 [details] xm dmesg I am running an i686 rawhide domU PV machine under x86_64 xen host. After updating from F-11's kernel-PAE-2.6.29.4-167.fc11.i686 the new kernels no longer boot. They seem crash very early, so that I don't even get any printk() output in the console. Right now I can reproduce the problem with kernel-PAE-2.6.31-0.28.rc1.fc12.i686, however the same issue started with 2.6.30-something, and I also verified that I get the exact same behaviour with x86_64 domU kernels. "xm dmesg" reports the following: (XEN) Unhandled page fault in domain 12 on VCPU 0 (ec=0000) (XEN) Pagetable walk from 0000000000000014: (XEN) L4[0x000] = 0000000081d9d027 0000000000001b48 (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 12 (vcpu#0) crashed on cpu#3: (XEN) ----[ Xen-3.1.2-155.el5 x86_64 debug=n Not tainted ]---- (XEN) CPU: 3 (XEN) RIP: e019:[<00000000c0a8b501>] <snip> (full trace attached) Examining matching vmlinux in gdb I get: (gdb) x/i 0x00000000c0a8b501 0xc0a8b501 <xen_start_kernel+9>: mov %gs:0x14,%eax (gdb) l *0x00000000c0a8b501 0xc0a8b501 is in xen_start_kernel (arch/x86/xen/enlighten.c:990). 985 .emergency_restart = xen_emergency_restart, 986 }; 987 988 /* First C function to be called on Xen boot */ 989 asmlinkage void __init xen_start_kernel(void) 990 { 991 pgd_t *pgd; 992 993 if (!xen_start_info) 994 return; The host is running Centos 5.3 with kernel-xen-2.6.18-155.el5 and xen-3.0.3-80.el5_3.3. Xen config: name = "fedora-rawhide" uuid = "1f162091-fa31-c67c-9da1-702bcd5cb40b" maxmem = 512 memory = 512 vcpus = 1 bootloader = "/usr/bin/pygrub" on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" vfb = [ ] disk = [ "phy:/dev/vg0/xen_rawhide,xvda,w" ] vif = [ "mac=00:16:3e:11:95:e6,bridge=xenbr0" ]