kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 kernel does not boot in my QEMU virtual machine because it panics on a kernel stack corruption:
[ 0.510887] Performance Events: unsupported p6 CPU model 60 no PMU driver, so
ftware events only. [ 0.511589] rcu: Hierarchical SRCU implementation. [ 0.512873] NMI watchdog: Perf NMI watchdog permanently disabled [ 0.513747] smp: Bringing up secondary CPUs ... [ 0.514741] x86: Booting SMP configuration: [ 0.515510] .... node #0, CPUs: #1
[ 0.062278] kvm-clock: cpu 1, msr 141a01041, secondary cpu clock
[ 0.062278] smpboot: CPU 1 Converting physical 0 to logical die 1 [ 0.062278] Kernel panic - not syncing: stack-protector: Kernel stack is corr
upted in: start_secondary+0x1b9/0x1c0
[ 0.062278] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-0.rc0.git1.1.fc32 .x86_64 #1
[ 0.062278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.
[ 0.062278] Call Trace:
[ 0.062278] dump_stack+0x8b/0xc8
[ 0.062278] panic+0x10d/0x302
[ 0.062278] ? start_secondary+0x1b9/0x1c0
[ 0.062278] __stack_chk_fail+0x15/0x20
[ 0.062278] start_secondary+0x1b9/0x1c0
[ 0.062278] secondary_startup_64+0xb6/0xc0 [ 0.062278] ---[ end Kernel panic - not syncing: stack-protector: Kernel stac
k is corrupted in: start_secondary+0x1b9/0x1c0 ]---
5.5.0-0.rc6.git3.1.fc32.x86_64 kernel boots fine. I start the machine with:
qemu-system-x86_64 -machine accel=kvm -cpu Haswell-noTSX -hda /home/petr/virtual/fedora-32.x86.disk -boot c -net nic,model=virtio,macaddr=00:50:54:00:0f:00 -net tap,ifname=tapfedora32x86,script=no -vga std -m 8192 -smp 4 -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -monitor stdio -display curses
Tried a 5.5.y built with gcc10 on rawhie and with CC_HAS_SANE_STACKPROTECTOR off and it seems to work ok in the qemu test
Narrowed it down to CONFIG_STACKPROTECTOR_STRONG , with that turned off rawhide gcc10 built 5.6.0-0.rc0.git1.1.fc32.x86_64 works for me
Today I learned about earlycon=efifb and can confirm that the failure is the same on real hardware
*** Bug 1797413 has been marked as a duplicate of this bug. ***
Adding Jakub to the CC as this is exclusive to GCC 10 and works fine in F31.
Given that the start_secondary function calls boot_init_stack_canary, I'd say that is a clear kernel bug - any functions for which the stack canary can change in between their start and end,
so e.g. in the kernel's case the boot_init_stack_canary function and anything that calls it, needs to have stack-protector disabled, either from the compiler command line options (-fno-stack-protector) or e.g. using optimize attribute
__attribute__((optimize ("no-stack-protector"))) (though, seems that only works with GCC 7 or later).
In the past you could just be lucky that nothing has been inlined into the start_secondary function that would trigger the use of stack canary in there.
If somebody attaches preprocessed smpboot.i and full gcc command line used to compile it, I can have a quick look at what changed in the inlining decisions or what are the other reasons why it now has a stack canary.
I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and 5.6.0-0.rc4.git0.1 hangs and dies.
This is under Xen 4.4 hypervisor.
(In reply to Terje Røsten from comment #7)
> I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and
> 5.6.0-0.rc4.git0.1 hangs and dies.
> This is under Xen 4.4 hypervisor.
The same stack trace? I believe you experience a different bug because Fedora 31 does not use GCC 10 for building the kernel.
Created attachment 1673338 [details]
pre-processed source file
I see the same on openSUSE kernel-default (5.5.11-5). The command line used for the file is:
gcc -Wp,-MD,arch/x86/kernel/.smpboot.o.d -nostdinc -isystem /usr/local/lib64/gcc/x86_64-pc-linux-gnu/10.0.1/include -I../arch/x86/include -I./arch/x86/include/generated -I../include -I./include -I../arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I../include/uapi -I./include/generated/uapi -include ../include/linux/kconfig.h -include ../include/linux/compiler_types.h -D__KERNEL__ -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -fno-jump-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fno-var-tracking-assignments -g -gdwarf-4 -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -fno-inline-functions-called-once -flive-patching=inline-clone -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=../= -fcf-protection=none -Wno-packed-not-aligned -I ../arch/x86/kernel -I ./arch/x86/kernel -DKBUILD_BASENAME='"smpboot"' -DKBUILD_MODNAME='"smpboot"' -c smpboot.i
The significant difference is that now with GCC 10 we do not inline:
I can see usage of %gs:xyz regment register to access some data but I don't see how is the register itself modified.
Created attachment 1673339 [details]
Assembly for start_secondary with GCC 9
Created attachment 1673340 [details]
Assembly for start_secondary with GCC 10
https://lkml.org/lkml/2020/3/17/746 contains details on what exactly is going on.
So did I understand correctly that the fix was made on Mar 17, and it is still not in 4.19.121, released on May 6?