Bug 1796780

Summary: kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 panics on boot: Kernel stack is corrupted in: start_secondary+0x1b9/0x1c0
Product: [Fedora] Fedora Reporter: Petr Pisar <ppisar>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: airlied, atu, bskeggs, extras-qa, hdegoede, hvtaifwkbgefbaei, ichavero, itamar, jakub, jarodwilson, jeremy, jforbes, jglisse, john.j5live, jonathan, josef, jpazdziora, j, kernel-maint, linville, masami256, mchehab, mikhail.v.gavrilov, mjg59, mliska, omosnace, pbrobinson, rjones, steved, terje.rosten, vashirov, yaneti
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-12 15:26:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pre-processed source file
none
Assembly for start_secondary with GCC 9
none
Assembly for start_secondary with GCC 10 none

Description Petr Pisar 2020-01-31 08:19:45 UTC
kernel-5.6.0-0.rc0.git1.1.fc32.x86_64 kernel does not boot in my QEMU virtual machine because it panics on a kernel stack corruption:

[    0.510887] Performance Events: unsupported p6 CPU model 60 no PMU driver, so 
ftware events only.                                                              [    0.511589] rcu: Hierarchical SRCU implementation.                            [    0.512873] NMI watchdog: Perf NMI watchdog permanently disabled              [    0.513747] smp: Bringing up secondary CPUs ...                               [    0.514741] x86: Booting SMP configuration:                                   [    0.515510] .... node  #0, CPUs:      #1                                      
[    0.062278] kvm-clock: cpu 1, msr 141a01041, secondary cpu clock              
[    0.062278] smpboot: CPU 1 Converting physical 0 to logical die 1             [    0.062278] Kernel panic - not syncing: stack-protector: Kernel stack is corr 
upted in: start_secondary+0x1b9/0x1c0                                            
[    0.062278] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-0.rc0.git1.1.fc32 .x86_64 #1                                                                       
[    0.062278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12. 
0-2.fc30 04/01/2014                                                              
[    0.062278] Call Trace:                                                       
[    0.062278]  dump_stack+0x8b/0xc8                                             
[    0.062278]  panic+0x10d/0x302                                                
[    0.062278]  ? start_secondary+0x1b9/0x1c0                                    
[    0.062278]  __stack_chk_fail+0x15/0x20                                       
[    0.062278]  start_secondary+0x1b9/0x1c0                                      
[    0.062278]  secondary_startup_64+0xb6/0xc0                                   [    0.062278] ---[ end Kernel panic - not syncing: stack-protector: Kernel stac 
k is corrupted in: start_secondary+0x1b9/0x1c0 ]---

5.5.0-0.rc6.git3.1.fc32.x86_64 kernel boots fine. I start the machine with:

qemu-system-x86_64 -machine accel=kvm -cpu Haswell-noTSX -hda /home/petr/virtual/fedora-32.x86.disk -boot c -net nic,model=virtio,macaddr=00:50:54:00:0f:00 -net tap,ifname=tapfedora32x86,script=no -vga std -m 8192 -smp 4 -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0 -monitor stdio -display curses

Comment 1 Yanko Kaneti 2020-01-31 09:24:52 UTC
Tried a 5.5.y  built with gcc10 on rawhie and with CC_HAS_SANE_STACKPROTECTOR off and  it seems to work ok in the qemu test

Comment 2 Yanko Kaneti 2020-01-31 13:03:15 UTC
Narrowed it down to CONFIG_STACKPROTECTOR_STRONG  , with that turned off rawhide gcc10 built 5.6.0-0.rc0.git1.1.fc32.x86_64 works for me

Comment 3 Yanko Kaneti 2020-02-03 09:16:56 UTC
Today I learned about earlycon=efifb and can confirm that the failure is the same on real hardware

Comment 4 Yanko Kaneti 2020-02-03 09:41:33 UTC
*** Bug 1797413 has been marked as a duplicate of this bug. ***

Comment 5 Justin M. Forbes 2020-02-04 21:17:01 UTC
Adding Jakub to the CC as this is exclusive to GCC 10 and works fine in F31.

Comment 6 Jakub Jelinek 2020-02-05 15:08:39 UTC
Given that the start_secondary function calls boot_init_stack_canary, I'd say that is a clear kernel bug - any functions for which the stack canary can change in between their start and end,
so e.g. in the kernel's case the boot_init_stack_canary function and anything that calls it, needs to have stack-protector disabled, either from the compiler command line options (-fno-stack-protector) or e.g. using optimize attribute
__attribute__((optimize ("no-stack-protector"))) (though, seems that only works with GCC 7 or later).
In the past you could just be lucky that nothing has been inlined into the start_secondary function that would trigger the use of stack canary in there.
If somebody attaches preprocessed smpboot.i and full gcc command line used to compile it, I can have a quick look at what changed in the inlining decisions or what are the other reasons why it now has a stack canary.

Comment 7 Terje Røsten 2020-03-05 14:10:41 UTC
I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and 5.6.0-0.rc4.git0.1 hangs and dies.
This is under Xen 4.4 hypervisor.

Comment 8 Petr Pisar 2020-03-06 09:54:07 UTC
(In reply to Terje Røsten from comment #7)
> I can't boot VMs with kernels after 5.5.7-200.fc31: 5.6.0-0.rc3.git0.1 and
> 5.6.0-0.rc4.git0.1 hangs and dies.
> This is under Xen 4.4 hypervisor.

The same stack trace? I believe you experience a different bug because Fedora 31 does not use GCC 10 for building the kernel.

Comment 9 Martin Liška 2020-03-25 09:07:11 UTC
Created attachment 1673338 [details]
pre-processed source file

I see the same on openSUSE kernel-default (5.5.11-5). The command line used for the file is:

gcc -Wp,-MD,arch/x86/kernel/.smpboot.o.d  -nostdinc -isystem /usr/local/lib64/gcc/x86_64-pc-linux-gnu/10.0.1/include -I../arch/x86/include -I./arch/x86/include/generated -I../include -I./include -I../arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I../include/uapi -I./include/generated/uapi -include ../include/linux/kconfig.h -include ../include/linux/compiler_types.h -D__KERNEL__ -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -DCONFIG_X86_X32_ABI -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -fno-jump-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fno-var-tracking-assignments -g -gdwarf-4 -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -fno-inline-functions-called-once -flive-patching=inline-clone -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=../= -fcf-protection=none -Wno-packed-not-aligned -I ../arch/x86/kernel -I ./arch/x86/kernel    -DKBUILD_BASENAME='"smpboot"' -DKBUILD_MODNAME='"smpboot"' -c smpboot.i

The significant difference is that now with GCC 10 we do not inline:
call	smp_callin

I can see usage of %gs:xyz regment register to access some data but I don't see how is the register itself modified.

Comment 10 Martin Liška 2020-03-25 09:07:48 UTC
Created attachment 1673339 [details]
Assembly for start_secondary with GCC 9

Comment 11 Martin Liška 2020-03-25 09:08:05 UTC
Created attachment 1673340 [details]
Assembly for start_secondary with GCC 10

Comment 12 Jakub Jelinek 2020-03-25 09:31:23 UTC
https://lkml.org/lkml/2020/3/17/746 contains details on what exactly is going on.

Comment 13 Sami Farin 2020-05-07 10:07:51 UTC
So did I understand correctly that the fix was made on Mar 17, and it is still not in 4.19.121, released on May 6?