Bug 407281

Summary: gcc produces unbootable kernels
Product: [Fedora] Fedora Reporter: Pierre Ossman <pierre-bugzilla>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rawhideCC: jwboyer, lkundrak, marc.c.dionne, paul, quentin, selinux, zing
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-03 13:51:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pre-processed hrtimer.c
none
Extracted assembly code for hrtimer_forward, gcc and cpp 4.1.2-33
none
Extracted assembly code for hrtimer_forward, gcc and cpp 4.1.2-34 none

Description Pierre Ossman 2007-12-01 14:47:38 UTC
The latest GCC in Fedora rawhide contains some serious bug (or provokes a latent
one in the kernel) that makes every kernel built unbootable. It just locks up
halfway through the init. Kernels that previously worked fine all now experience
the same symptom. Even RH's own kernels exhibit this. The kernel built Nov 24th
works, Nov 26th doesn't. gcc was updated 26th, 14 hours earlier.

The last message printed is:

isapnp: Scanning for PnP cards...

Comparing with the working kernel, the next steps are:

isapnp: Scanning for PnP cards...
Switched to high resolution mode on CPU 0
isapnp: No Plug & Play device found

Please revert the last changes until this can be resolved.

Comment 1 Jakub Jelinek 2007-12-01 15:18:40 UTC
Reversion is a wrong thing to do, there were dozens of changes.
Much better if somebody who can reproduce this does a binary search between
objects compiled with older and newer gcc and tracks this down to at least
one single .o file, then attaches here preprocessed source for that file and the
exact options passed to gcc to compile it.

Comment 2 Pierre Ossman 2007-12-01 16:23:28 UTC
I'm working on pinpointing what it breaks right now. To understand how, I'd need
the old gcc. Any binary RPM still stowed away somewhere?


Comment 3 Jakub Jelinek 2007-12-01 16:57:19 UTC
gcc-4.1.2-33 in F8 GA is the predecessor of -34.

Comment 4 Marc Dionne 2007-12-01 19:24:15 UTC
I have the same issue, and have narrowed the boot part to the kernel/hrtimer.o
object file.  If I replace it with one compiled with gcc-4.1.2-33, the system
boots normally, but once in X things are VERY sluggish, so my guess is that
other files are affected, maybe in the scheduler.

If it helps I can provide the bad and good hrtimer.o object files.

Comment 5 Marc Dionne 2007-12-01 20:18:07 UTC
Created attachment 274801 [details]
pre-processed hrtimer.c

Comment 6 Marc Dionne 2007-12-01 20:22:45 UTC
Attached pre-processed hrtimer.i file

From looking at the differences between the 4.1.2-33 and 4.1.2-34 versions of
the object file, it looks like the differences are mainly in the hrtimer_forward
function.

gcc command line is as follows:

  gcc -m32 -Wp,-MD,kernel/.hrtimer.o.d  -nostdinc -isystem
/usr/lib/gcc/i386-redhat-linux/4.1.2/include -D__KERNEL__ -Iinclude  -include
include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs
-fno-strict-aliasing -fno-common -Werror-implicit-function-declaration
-save-temps -Os -pipe -msoft-float -mregparm=3 -freg-struct-return
-mpreferred-stack-boundary=2  -march=k8 -mtune=generic -ffreestanding
-maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1
-Iinclude/asm-x86/mach-generic -Iinclude/asm-x86/mach-default
-fno-omit-frame-pointer -fno-optimize-sibling-calls -g  -fno-stack-protector
-Wdeclaration-after-statement -Wno-pointer-sign     -D"KBUILD_STR(s)=#s"
-D"KBUILD_BASENAME=KBUILD_STR(hrtimer)"  -D"KBUILD_MODNAME=KBUILD_STR(hrtimer)"
-c -o kernel/.tmp_hrtimer.o kernel/hrtimer.c


Comment 7 Pierre Ossman 2007-12-01 20:25:52 UTC
For me the place it gets stuck is in tick_setup_sched_timer() in
kernel/time/tick-sched.c. It loops forever trying to calibrate the timer tick.

The problem seems to be rather generic in that gcc somehow messes with the
kernel's ability to proper calibrate time sources. Unfortunately I haven't kept
up with all the new timing stuff so I don't know what base value is being set up
incorrectly.

Comment 8 Jakub Jelinek 2007-12-01 20:55:17 UTC
If it is really hrtimer_forward that is miscompiled (or buggy), then most
probably the relevant change is either http://gcc.gnu.org/PR33723 or
http://gcc.gnu.org/PR34146 since the function uses compound literals heavily.
That said, so far I don't see anything obviously wrong in the generated code, so
I'd appreciate any help in finding which exact function is problematic.
If you leave out -g, then there doesn't seem to be any difference between .L*
assignment, so it is easy to do a binary search in between functions.
Just compile with -S instead of -g -c (all other options kept unchanged) with
both compilers into hrtimer.s{33,34} and then always copy some functions from
hrtimer.s33 and the rest of functions from hrtimer.s34 into new hrtimer.s,
assemble it and test.  Thanks.


Comment 9 Marc Dionne 2007-12-01 23:07:00 UTC
I can confirm that replacing just the hrtimer_forward function assembler code
with the one produced by 4.1.2-34 makes a bootable kernel non-bootable.

I'll attach the 4.1.2-33 and 4.1.2-34 version of that function's assembler code.

Comment 10 Marc Dionne 2007-12-01 23:09:56 UTC
Created attachment 274881 [details]
Extracted assembly code for hrtimer_forward, gcc and cpp 4.1.2-33

Comment 11 Marc Dionne 2007-12-01 23:10:48 UTC
Created attachment 274891 [details]
Extracted assembly code for hrtimer_forward, gcc and cpp 4.1.2-34

Comment 12 Jakub Jelinek 2007-12-02 01:05:20 UTC
typedef union { long long int s; } U;
typedef struct { U u; } S;

void foo (S *s, long long int x, unsigned long int y)
{
  s->u = ({ (U) { .s = s->u.s + x * y }; });
}

is minimal testcase, gcc 4.3 doesn't exhibit this bug, so either it is a bug in
the backport, or some latent bug.  Will debug either tomorrow or on Monday.

Comment 13 Jakub Jelinek 2007-12-03 09:55:58 UTC
Please retry with gcc-4.1.2-35, I've backported there two bugfixes from the
trunk that should cure this.

Comment 14 Pierre Ossman 2007-12-03 13:45:51 UTC
-35 works nicely. I'm now running 2.6.24-rc3 built with the updated gcc.

Good work :)

Comment 15 Josh Boyer 2007-12-03 16:31:47 UTC
*** Bug 408101 has been marked as a duplicate of this bug. ***

Comment 16 Lubomir Kundrak 2007-12-04 07:38:20 UTC
*** Bug 406831 has been marked as a duplicate of this bug. ***