Bug 1796780 - kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 panics on boot: Kernel stack is corrupted in: start_secondary+0x1b9/0x1c0
Summary: kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 panics on boot: Kernel stack is corrupt...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1797413 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-31 08:19 UTC by Petr Pisar
Modified: 2021-06-12 15:26 UTC (History)
32 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-12 15:26:53 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
pre-processed source file (2.49 MB, text/plain)
2020-03-25 09:07 UTC, Martin Liška
no flags Details
Assembly for start_secondary with GCC 9 (4.10 KB, text/plain)
2020-03-25 09:07 UTC, Martin Liška
no flags Details
Assembly for start_secondary with GCC 10 (2.95 KB, text/plain)
2020-03-25 09:08 UTC, Martin Liška
no flags Details

Description Petr Pisar 2020-01-31 08:19:45 UTC
kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 kernel does not boot in my QEMU virtual machine because it panics on a kernel stack corruption:

[    0.510887] Performance Events: unsupported p6 CPU model 60 no PMU driver, so 
ftware events only.                                                              [    0.511589] rcu: Hierarchical SRCU implementation.                            [    0.512873] NMI watchdog: Perf NMI watchdog permanently disabled              [    0.513747] smp: Bringing up secondary CPUs ...                               [    0.514741] x86: Booting SMP configuration:                                   [    0.515510] .... node  #0, CPUs:      #1                                      
[    0.062278] kvm-clock: cpu 1, msr 141a01041, secondary cpu clock              
[    0.062278] smpboot: CPU 1 Converting physical 0 to logical die 1             [    0.062278] Kernel panic - not syncing: stack-protector: Kernel stack is corr 
upted in: start_secondary+0x1b9/0x1c0                                            
[    0.062278] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-0.rc0.git1.1.fc32 .x86_64 #1                                                                       
[    0.062278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12. 
0-2.fc30 04/01/2014                                                              
[    0.062278] Call Trace:                                                       
[    0.062278]  dump_stack+0x8b/0xc8                                             
[    0.062278]  panic+0x10d/0x302                                                
[    0.062278]  ? start_secondary+0x1b9/0x1c0                                    
[    0.062278]  __stack_chk_fail+0x15/0x20                                       
[    0.062278]  start_secondary+0x1b9/0x1c0                                      
[    0.062278]  secondary_startup_64+0xb6/0xc0                                   [    0.062278] ---[ end Kernel panic - not syncing: stack-protector: Kernel stac 
k is corrupted in: start_secondary+0x1b9/0x1c0 ]---

5.5.0-0.rc6.git3.1.fc32.x86_64 kernel boots fine. I start the machine with:

qemu-system-x86_64 -machine accel=kvm -cpu Haswell-noTSX -hda /home/petr/virtual/fedora-32.x86.disk -boot c -net nic,model=virtio,macaddr=00:50:54:00:0f:00 -net tap,ifname=tapfedora32x86,script=no -vga std -m 8192 -smp 4 -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -monitor stdio -display curses

Comment 1 Yanko Kaneti 2020-01-31 09:24:52 UTC
Tried a 5.5.y  built with gcc10 on rawhie and with CC_HAS_SANE_STACKPROTECTOR off and  it seems to work ok in the qemu test

Comment 2 Yanko Kaneti 2020-01-31 13:03:15 UTC
Narrowed it down to CONFIG_STACKPROTECTOR_STRONG  , with that turned off rawhide gcc10 built 5.6.0-0.rc0.git1.1.fc32.x86_64 works for me

Comment 3 Yanko Kaneti 2020-02-03 09:16:56 UTC
Today I learned about earlycon=efifb and can confirm that the failure is the same on real hardware

Comment 4 Yanko Kaneti 2020-02-03 09:41:33 UTC
*** Bug 1797413 has been marked as a duplicate of this bug. ***

Comment 5 Justin M. Forbes 2020-02-04 21:17:01 UTC
Adding Jakub to the CC as this is exclusive to GCC 10 and works fine in F31.

Comment 6 Jakub Jelinek 2020-02-05 15:08:39 UTC
Given that the start_secondary function calls boot_init_stack_canary, I'd say that is a clear kernel bug - any functions for which the stack canary can change in between their start and end,
so e.g. in the kernel's case the boot_init_stack_canary function and anything that calls it, needs to have stack-protector disabled, either from the compiler command line options (-fno-stack-protector) or e.g. using optimize attribute
__attribute__((optimize ("no-stack-protector"))) (though, seems that only works with GCC 7 or later).
In the past you could just be lucky that nothing has been inlined into the start_secondary function that would trigger the use of stack canary in there.
If somebody attaches preprocessed smpboot.i and full gcc command line used to compile it, I can have a quick look at what changed in the inlining decisions or what are the other reasons why it now has a stack canary.

Comment 7 Terje Røsten 2020-03-05 14:10:41 UTC
I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and 5.6.0-0.rc4.git0.1 hangs and dies.
This is under Xen 4.4 hypervisor.

Comment 8 Petr Pisar 2020-03-06 09:54:07 UTC
(In reply to Terje Røsten from comment #7)
> I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and
> 5.6.0-0.rc4.git0.1 hangs and dies.
> This is under Xen 4.4 hypervisor.

The same stack trace? I believe you experience a different bug because Fedora 31 does not use GCC 10 for building the kernel.

Comment 9 Martin Liška 2020-03-25 09:07:11 UTC
Created attachment 1673338 [details]
pre-processed source file

I see the same on openSUSE kernel-default (5.5.11-5). The command line used for the file is:

gcc -Wp,-MD,arch/x86/kernel/.smpboot.o.d  -nostdinc -isystem /usr/local/lib64/gcc/x86_64-pc-linux-gnu/10.0.1/include -I../arch/x86/include -I./arch/x86/include/generated -I../include -I./include -I../arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I../include/uapi -I./include/generated/uapi -include ../include/linux/kconfig.h -include ../include/linux/compiler_types.h -D__KERNEL__ -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -fno-jump-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fno-var-tracking-assignments -g -gdwarf-4 -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -fno-inline-functions-called-once -flive-patching=inline-clone -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=../= -fcf-protection=none -Wno-packed-not-aligned -I ../arch/x86/kernel -I ./arch/x86/kernel    -DKBUILD_BASENAME='"smpboot"' -DKBUILD_MODNAME='"smpboot"' -c smpboot.i

The significant difference is that now with GCC 10 we do not inline:
call	smp_callin

I can see usage of %gs:xyz regment register to access some data but I don't see how is the register itself modified.

Comment 10 Martin Liška 2020-03-25 09:07:48 UTC
Created attachment 1673339 [details]
Assembly for start_secondary with GCC 9

Comment 11 Martin Liška 2020-03-25 09:08:05 UTC
Created attachment 1673340 [details]
Assembly for start_secondary with GCC 10

Comment 12 Jakub Jelinek 2020-03-25 09:31:23 UTC
https://lkml.org/lkml/2020/3/17/746 contains details on what exactly is going on.

Comment 13 Sami Farin 2020-05-07 10:07:51 UTC
So did I understand correctly that the fix was made on Mar 17, and it is still not in 4.19.121, released on May 6?


Note You need to log in before you can comment on or make changes to this bug.