Bug 508120

Summary:

2.6.31-rc1 xen domU crashes early during boot

Product:

[Fedora] Fedora

Reporter:

Kalev Lember <kalevlember>

Component:

kernel

Assignee:

Justin M. Forbes <jforbes>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

high

Version:

rawhide

CC:

atodorov, clalance, itamar, jeremy, jforbes, kernel-maint, kevin, markmc, mitchb, mschmidt, orion, pasik, rwilliam, virt-maint

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-09-05 18:52:23 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

498968

Attachments:

Description	Flags
xm dmesg	none
Make sure load_percpu_segment doesn't have stack-protector enabled	none
Setup percpu segments before calling stack-protected functions	none
Set up kernel GDT early to make -fstack-protector work under Xen	none

Description Kalev Lember 2009-06-25 17:25:26 UTC

Created attachment 349429 [details]
xm dmesg

I am running an i686 rawhide domU PV machine under x86_64 xen host. After updating from F-11's kernel-PAE-2.6.29.4-167.fc11.i686 the new kernels no longer boot. They seem crash very early, so that I don't even get any printk() output in the console.
Right now I can reproduce the problem with kernel-PAE-2.6.31-0.28.rc1.fc12.i686, however the same issue started with 2.6.30-something, and I also verified that I get the exact same behaviour with x86_64 domU kernels.

"xm dmesg" reports the following:
(XEN) Unhandled page fault in domain 12 on VCPU 0 (ec=0000)
(XEN) Pagetable walk from 0000000000000014:
(XEN)  L4[0x000] = 0000000081d9d027 0000000000001b48
(XEN)  L3[0x000] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 12 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-3.1.2-155.el5  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e019:[<00000000c0a8b501>]
<snip> (full trace attached)

Examining matching vmlinux in gdb I get:
(gdb) x/i 0x00000000c0a8b501
0xc0a8b501 <xen_start_kernel+9>:        mov    %gs:0x14,%eax
(gdb) l *0x00000000c0a8b501
0xc0a8b501 is in xen_start_kernel (arch/x86/xen/enlighten.c:990).
985             .emergency_restart = xen_emergency_restart,
986     };
987
988     /* First C function to be called on Xen boot */
989     asmlinkage void __init xen_start_kernel(void)
990     {
991             pgd_t *pgd;
992
993             if (!xen_start_info)
994                     return;

The host is running Centos 5.3 with kernel-xen-2.6.18-155.el5 and xen-3.0.3-80.el5_3.3.

Xen config:
name = "fedora-rawhide"
uuid = "1f162091-fa31-c67c-9da1-702bcd5cb40b"
maxmem = 512
memory = 512
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [  ]
disk = [ "phy:/dev/vg0/xen_rawhide,xvda,w" ]
vif = [ "mac=00:16:3e:11:95:e6,bridge=xenbr0" ]

Comment 1 Michal Schmidt 2009-08-16 19:54:04 UTC

I'm seeing the same problem with a x86_64 Rawhide DomU and a RHEL5 Dom0.
I believe it's the same problem Orion Poplawski reported recently on fedora-xen mailing list: https://www.redhat.com/archives/fedora-xen/2009-August/msg00008.html

I got the stack trace as Mark McLoughlin suggested:

michich@hammerfall ~$ sudo /usr/lib64/xen/bin/xenctx -s /tmp/System.map-2.6.31-rc5-git2  12
rip: ffffffff817290a1 xen_start_kernel+0x10
rsp: ffffffff8171df90
rax: 00000000   rbx: 00000000   rcx: 00000000   rdx: 00000000
rsi: ffffffff82fc3000   rdi: ffffffff82fc3000   rbp: ffffffff8171dff8
 r8: 00000000    r9: 00000000   r10: 00000000   r11: 00000000
r12: 00000000   r13: 00000000   r14: 00000000   r15: 00000000
 cs: 0000e033    ds: 00000000    fs: 00000000    gs: 00000000

Stack:
 0000000000000000 0000000000000000 0000000000000000 ffffffff817290a1
 000000010000e030 0000000000010096 ffffffff8171dfd8 000000000000e02b
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000

Code:
bd 93 ff c9 c3 55 48 89 e5 53 48 83 ec 18 48 8b 3d 27 c0 33 00 <65> 48 8b 04 25 28 00 00 00 48 89 

Call Trace:
  [<ffffffff817290a1>] xen_start_kernel+0x10 <--
  [<ffffffff817290a1>] xen_start_kernel+0x10


Then I dissasembled xen_start_kernel():
0000000000000000 <xen_start_kernel>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   53                      push   %rbx
   5:   48 83 ec 18             sub    $0x18,%rsp
   9:   48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # 10 <xen_start_kernel+0x10>
  10:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax  ***CRASHES HERE***
  17:   00 00 
  19:   48 89 45 e8             mov    %rax,-0x18(%rbp)
  1d:   31 c0                   xor    %eax,%eax
...

At first these last three instructions confused me, because they did not seem to correspond to anything in the C source, but then I realized they setup the canary for stack smashing detection.
So I recompiled the kernel without CONFIG_CC_STACKPROTECTOR and I got much farther with the boot (it hung after loading some drivers, I'll investigate more).

I guess xen_start_kernel() (and possibly more of Xen DomU startup code) should be compiled with -fno-stack-protector.

Comment 2 Chris Lalancette 2009-08-17 09:15:10 UTC

Wow, excellent analysis, thanks.  I was just starting to see this myself, but hadn't yet had time to look into it.  We'll have to take this up with upstream and see what they have to say about it.

Chris Lalancette

Comment 3 Kevin Fenzi 2009-08-17 16:22:05 UTC

Yeah, seeing this also on a machine in fedora infrastructure. ;(

Comment 4 Jeremy Fitzhardinge 2009-08-17 17:49:02 UTC

Thanks for the report and analysis.  I guess there's a keyword to prevent gcc from adding stack-smashing to particular functions or files...  Erm...

Comment 5 Jeremy Fitzhardinge 2009-08-17 19:31:05 UTC

Created attachment 357697 [details]
Make sure load_percpu_segment doesn't have stack-protector enabled

Comment 6 Jeremy Fitzhardinge 2009-08-17 19:31:34 UTC

Created attachment 357698 [details]
Setup percpu segments before calling stack-protected functions

Comment 7 Jeremy Fitzhardinge 2009-08-17 19:31:59 UTC

Do those two help?

Comment 8 Michal Schmidt 2009-08-18 09:32:43 UTC

Jeremy,
yes, these patches help. The kernel starts booting with them applied.

Just a suggestion: the usual way (as seen in other Makefiles) to disable the stack protection for selected source files seems to be:
nostackp := $(call cc-option, -fno-stack-protector)
CFLAGS_somefile.o := $(nostackp)

And the kernel still hangs for me later during boot, but that's a different bug.

Comment 9 Jeremy Fitzhardinge 2009-08-18 17:53:20 UTC

Ah, I couldn't find another instance of stack-protector being disabled.

Have you reported the other bug, or is it something purely local?

Comment 10 Mark McLoughlin 2009-08-21 07:31:29 UTC

Ingo has these queued up in linux-2.6-tip.git/x86/urgent:

http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=ce2eef33d3
http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=5416c26635

Just need to re-test and close this when they make their way to rawhide

Comment 11 Jeremy Fitzhardinge 2009-08-21 07:51:04 UTC

Are you sure they work?  M A Young still reports crashes when they're applied.

Comment 12 Mark McLoughlin 2009-08-21 08:36:11 UTC

Michal says it still hangs later on, but that its a different issue

M A Young is probably testing Dom0, maybe yet another issue?

Dunno, that's why I said we need to re-test :-)

Comment 13 Michal Schmidt 2009-08-21 10:40:19 UTC

I've described the other bug in http://lkml.org/lkml/2009/8/21/71

Comment 14 Chuck Ebbert 2009-08-25 01:08:46 UTC

Should be fixed in 2.6.31-0.173.rc7.git2, which has the two x86-tip patches plus the framebuffer fix from LKML.

Comment 15 Michal Schmidt 2009-08-25 14:30:24 UTC

2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64 in xen_start_kernel().

Comment 16 Kevin Fenzi 2009-08-25 16:02:48 UTC

Seems to work here under a x86_64 guest. Thanks.

Comment 17 Jeremy Fitzhardinge 2009-08-25 18:15:53 UTC

(In reply to comment #15)
> 2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still
> fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64
> in xen_start_kernel().  

32 bit is trickier because it needs a specifically set-up GDT entry and its own segment register.  Doing this setup properly ends upcalling functions with stack-protector prologs which assume the segment register is already set up.  I need to work out 1) how native does this setup, and/or 2) refactor the segment register setup so that can avoid functions with stack-protector code.

Comment 18 Alexander Todorov 2009-08-26 12:02:59 UTC

*** Bug 519342 has been marked as a duplicate of this bug. ***

Comment 19 Alexander Todorov 2009-08-26 12:07:21 UTC

FYI: 2.6.31-0.174.rc7.git2 still fails on i386. My dom0 is recent RHEL5 and domU is F12-Alpha

Comment 20 Pasi Karkkainen 2009-08-27 17:56:55 UTC

I also tried the latest rawhide tree (2.6.31-0.174.rc7.git2.fc12.i686) with virt-install on my F11 + Xen 3.4.1 + 2.6.31-rc6 pv_ops dom0 setup, and it still crashes.

Comment 21 Jeremy Fitzhardinge 2009-08-28 02:55:21 UTC

What compiler are people using?  Using F11's gcc-4.4.1-2.fc11.x86_64, it says:

/home/jeremy/git/linux/arch/x86/Makefile:80: stack protector enabled but no compiler support

Comment 22 Michal Schmidt 2009-08-28 08:28:44 UTC

Jeremy,

Rawhide builds currently use gcc-4.4.1-6.x86_64. You can find this information in Koji build logs, e.g.: http://kojipkgs.fedoraproject.org/packages/kernel/2.6.31/0.185.rc7.git6.fc12/data/logs/x86_64/
root.log tells you the versions of the packages used in the build.
build.log has the build warnings. There was no such stack protector warning in this case.

Comment 23 Jeremy Fitzhardinge 2009-08-28 18:18:30 UTC

Where can I get this version of gcc?  "yum update --enablerepo=rawhide gcc" doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64.  Or does 32-bit stackprotector not work in the x86-64 version of the compiler?

Comment 24 Michal Schmidt 2009-08-29 16:29:23 UTC

> Where can I get this version of gcc?  "yum update --enablerepo=rawhide gcc"
> doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64.

Works for me, yum can see the newer version. But the gcc from Rawhide depends on newer glibc, so I do not recommend doing it.

> Or does 32-bit stackprotector not work in the x86-64 version of the compiler?  

Bug in the stack protector detection for ARCH=i386 builds on x86_64. I've sent a patch to LKML and CCed you.

Koji always builds packages using native arch toolchain, so it is not affected.

Comment 25 Jeremy Fitzhardinge 2009-09-02 17:25:59 UTC

Created attachment 359557 [details]
Set up kernel GDT early to make -fstack-protector work under Xen

This patch should comprehensively fix -fstack-protector under Xen for both 32 and 64-bit.  Please test.

Comment 26 Pasi Karkkainen 2009-09-02 18:59:33 UTC

Someone please add that bugfix patch to next rawhide kernel build so we get people to test it..

Comment 27 Justin M. Forbes 2009-09-03 18:05:07 UTC

The patch has been applied and should be available in the next rawhide kernel build.

Comment 28 Michal Schmidt 2009-09-05 18:52:23 UTC

2.6.31-0.203.rc8.git2.fc12 boots successfully as Xen domU. I've tested both i686.PAE and x86_64.

Comment 29 Pasi Karkkainen 2009-09-06 11:50:36 UTC

Seems to boot now. virt-install started f12/rawhide Xen domU installation OK, on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates testing.

Installation went fine, and the installed domU seems to have 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running.

There's a traceback on domU dmesg though.. the domU still runs fine.

Write protecting the kernel text: 4352k
Write protecting the kernel read-only data: 1800k

=============================================
[ INFO: possible recursive locking detected ]
2.6.31-0.203.rc8.git2.fc12.i686.PAE #1
---------------------------------------------
init/1 is trying to acquire lock:
 (&input_pool.lock){+.+...}, at: [<c043b30e>] __wake_up+0x2b/0x61

but task is already holding lock:
 (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0

other info that might help us debug this:
2 locks held by init/1:
 #0:  (&p->cred_guard_mutex){+.+.+.}, at: [<c0508756>] do_execve+0xa4/0x2ee
 #1:  (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0

stack backtrace:
Pid: 1, comm: init Not tainted 2.6.31-0.203.rc8.git2.fc12.i686.PAE #1
Call Trace:
 [<c08387c0>] ? printk+0x22/0x3a
 [<c0478b59>] __lock_acquire+0x7e9/0xb25
 [<c0478f4c>] lock_acquire+0xb7/0xeb
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c083b4f7>] _spin_lock_irqsave+0x45/0x89
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c043b30e>] __wake_up+0x2b/0x61
 [<c068e2a0>] account+0xb5/0xf0
 [<c068e3ef>] extract_entropy+0x3e/0xac
 [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c04799d7>] ? lock_release+0x186/0x19f
 [<c068e56e>] get_random_bytes+0x29/0x3e
 [<c053bbd1>] load_elf_binary+0xab9/0x106c
 [<c050732d>] search_binary_handler+0xd7/0x27b
 [<c053b118>] ? load_elf_binary+0x0/0x106c
 [<c0539c76>] load_script+0x1a6/0x1c8
 [<c0507323>] ? search_binary_handler+0xcd/0x27b
 [<c0406199>] ? xen_force_evtchn_callback+0x1d/0x34
 [<c0507323>] ? search_binary_handler+0xcd/0x27b
 [<c0406b14>] ? check_events+0x8/0xc
 [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c04799d7>] ? lock_release+0x186/0x19f
 [<c050732d>] search_binary_handler+0xd7/0x27b
 [<c0539ad0>] ? load_script+0x0/0x1c8
 [<c050888b>] do_execve+0x1d9/0x2ee
 [<c0408359>] sys_execve+0x39/0x6e
 [<c0409ad0>] syscall_call+0x7/0xb
 [<c04f00d8>] ? sys_swapon+0x348/0xa98
 [<c040d76b>] ? kernel_execve+0x27/0x3e
 [<c04031e0>] ? run_init_process+0x2b/0x3e
 [<c0403275>] ? init_post+0x82/0xe9
 [<c0a9b566>] ? kernel_init+0x1f6/0x211
 [<c0a9b370>] ? kernel_init+0x0/0x211
 [<c040a6bf>] ? kernel_thread_helper+0x7/0x10

Comment 30 Chris Lalancette 2009-09-07 08:24:39 UTC

(In reply to comment #29)
> Seems to boot now. virt-install started f12/rawhide Xen domU installation OK,
> on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates
> testing.
> 
> Installation went fine, and the installed domU seems to have
> 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running.
> 
> There's a traceback on domU dmesg though.. the domU still runs fine.

It's probably worth looking through BZ quickly to see if a bug with that trace exists already, and if not, to open a new bug about it.

Thanks for the testing,
Chris Lalancette

Comment 31 Pasi Karkkainen 2009-09-08 11:41:33 UTC

OK, new bug opened: https://bugzilla.redhat.com/show_bug.cgi?id=521800