Bug 1108219 - 4.9.0-6 broke build of postgresql on AArch64
Summary: 4.9.0-6 broke build of postgresql on AArch64
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: gcc
Version: rawhide
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ARM64, F-ExcludeArch-aarch64
TreeView+ depends on / blocked
 
Reported: 2014-06-11 14:55 UTC by Marcin Juszkiewicz
Modified: 2014-06-24 15:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-24 15:55:08 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
archive with problematic file (30.55 KB, application/x-xz)
2014-06-13 12:18 UTC, Marcin Juszkiewicz
no flags Details
preprocessed source (385.93 KB, text/x-csrc)
2014-06-13 12:23 UTC, Marcin Juszkiewicz
no flags Details
rh1108219.c (6.62 KB, text/plain)
2014-06-13 13:47 UTC, Jakub Jelinek
no flags Details

Description Marcin Juszkiewicz 2014-06-11 14:55:54 UTC
Description of problem:

During mass rebuild we got postgresql 9.3.4-6.fc21 failed to build on AArch64. Previous version built fine on 5th June.

I did builds with gcc 4.9.0-6 and it failed. Same with 4.9.0-8 one. So to check I did 4.9.0-8 build and then replaced ONLY gcc related packages to 4.9.0-5 (which were current when previous koji build went fine). Package got 

Version-Release number of selected component (if applicable):

4.9.0-6 4.9.0-8

How reproducible:

always

Steps to Reproduce:
1. build postgresql on aarch64
2.
3.

Actual results:

parallel group (2 tests):  create_view create_index
     create_index             ... FAILED (test process exited with exit code 2)
     create_view              ... ok
parallel group (11 tests, in groups of 5):  create_aggregate triggers create_function_3 constraints create_cast vacuum create_table_like typed_table inherit drop_if_exists updatable_views
     create_aggregate         ... FAILED (test process exited with exit code 2)
     create_function_3        ... FAILED (test process exited with exit code 2)
     create_cast              ... FAILED (test process exited with exit code 2)
     constraints              ... FAILED (test process exited with exit code 2)
     triggers                 ... FAILED (test process exited with exit code 2)
     inherit                  ... FAILED (test process exited with exit code 2)
     create_table_like        ... FAILED (test process exited with exit code 2)
     typed_table              ... FAILED (test process exited with exit code 2)
     vacuum                   ... FAILED (test process exited with exit code 2)
     drop_if_exists           ... FAILED (test process exited with exit code 2)
     updatable_views          ... FAILED (test process exited with exit code 2)
test sanity_check             ... FAILED (test process exited with exit code 2)
test errors                   ... FAILED (test process exited with exit code 2)
test select                   ... FAILED (test process exited with exit code 2)
parallel group (20 tests, in groups of 5):  select_into select_having select_implicit select_distinct select_distinct_on union join subselect case aggregates transactions arrays random portals btree_index namespace hash_index prepared_xacts update delete
     select_into              ... FAILED (test process exited with exit code 2)
     select_distinct          ... FAILED (test process exited with exit code 2)
     select_distinct_on       ... FAILED (test process exited with exit code 2)
     select_implicit          ... FAILED (test process exited with exit code 2)
     select_having            ... FAILED (test process exited with exit code 2)
     subselect                ... FAILED (test process exited with exit code 2)
     union                    ... FAILED (test process exited with exit code 2)
     case                     ... FAILED (test process exited with exit code 2)
     join                     ... FAILED (test process exited with exit code 2)
     aggregates               ... FAILED (test process exited with exit code 2)
     transactions             ... FAILED (test process exited with exit code 2)
     random                   ... failed (ignored) (test process exited with exit code 2)
     portals                  ... FAILED (test process exited with exit code 2)
     arrays                   ... FAILED (test process exited with exit code 2)
     btree_index              ... FAILED (test process exited with exit code 2)
     hash_index               ... FAILED (test process exited with exit code 2)
     update                   ... FAILED (test process exited with exit code 2)
     namespace                ... FAILED (test process exited with exit code 2)
     prepared_xacts           ... FAILED (test process exited with exit code 2)
     delete                   ... FAILED (test process exited with exit code 2)
parallel group (4 tests):  collate privileges security_label matview
     privileges               ... FAILED (test process exited with exit code 2)
     security_label           ... FAILED (test process exited with exit code 2)
     collate                  ... FAILED (test process exited with exit code 2)
     matview                  ... FAILED (test process exited with exit code 2)
parallel group (3 tests):  alter_generic psql misc
     alter_generic            ... FAILED (test process exited with exit code 2)
     misc                     ... FAILED (test process exited with exit code 2)
     psql                     ... FAILED (test process exited with exit code 2)
test rules                    ... FAILED (test process exited with exit code 2)
test event_trigger            ... FAILED (test process exited with exit code 2)
parallel group (16 tests, in groups of 5):  portals_p2 cluster foreign_key select_views dependency combocid bitmapops tsdicts tsearch guc foreign_data functional_deps advisory_lock window xmlmap json
     select_views             ... FAILED (test process exited with exit code 2)
     portals_p2               ... FAILED (test process exited with exit code 2)
     foreign_key              ... FAILED (test process exited with exit code 2)
     cluster                  ... FAILED (test process exited with exit code 2)
     dependency               ... FAILED (test process exited with exit code 2)
     guc                      ... FAILED (test process exited with exit code 2)
     bitmapops                ... FAILED (test process exited with exit code 2)
     combocid                 ... FAILED (test process exited with exit code 2)
     tsearch                  ... FAILED (test process exited with exit code 2)
     tsdicts                  ... FAILED (test process exited with exit code 2)
     foreign_data             ... FAILED (test process exited with exit code 2)
     window                   ... FAILED (test process exited with exit code 2)
     xmlmap                   ... FAILED (test process exited with exit code 2)
     functional_deps          ... FAILED (test process exited with exit code 2)
     advisory_lock            ... FAILED (test process exited with exit code 2)
     json                     ... FAILED (test process exited with exit code 2)
parallel group (19 tests, in groups of 5):  limit plancache temp copy2 plpgsql prepare conversion without_oid domain rangefuncs sequence rowtypes truncate polymorphism alter_table returning xml largeobject with
     plancache                ... ok
     limit                    ... ok
     plpgsql                  ... ok
     copy2                    ... ok
     temp                     ... ok
     domain                   ... ok
     rangefuncs               ... ok
     prepare                  ... ok
     without_oid              ... ok
     conversion               ... ok
     truncate                 ... ok
     alter_table              ... ok
     sequence                 ... ok
     polymorphism             ... ok
     rowtypes                 ... FAILED
     returning                ... ok
     largeobject              ... ok
     with                     ... FAILED
     xml                      ... ok
test stats                    ... ok
============== shutting down postmaster               ==============
======================================================
 62 of 136 tests failed, 1 of these failures ignored. 
======================================================

Expected results:

     create_index             ... ok
     create_view              ... ok
parallel group (11 tests, in groups of 5):  create_aggregate create_cast create_function_3 constraints triggers drop_if_exists typed_table vacuum create_table_like inherit updatable_views
     create_aggregate         ... ok
     create_function_3        ... ok
     create_cast              ... ok
     constraints              ... ok
     triggers                 ... ok
     inherit                  ... ok
     create_table_like        ... ok
     typed_table              ... ok
     vacuum                   ... ok
     drop_if_exists           ... ok
     updatable_views          ... ok
test sanity_check             ... ok
test errors                   ... ok
test select                   ... ok
parallel group (20 tests, in groups of 5):  select_distinct_on select_distinct select_having select_implicit select_into case union subselect aggregates join btree_index random transactions portals arrays delete namespace update hash_index prepared_xacts
     select_into              ... ok
     select_distinct          ... ok
     select_distinct_on       ... ok
     select_implicit          ... ok
     select_having            ... ok
     subselect                ... ok
     union                    ... ok
     case                     ... ok
     join                     ... ok
     aggregates               ... ok
     transactions             ... ok
     random                   ... ok
     portals                  ... ok
     arrays                   ... ok
     btree_index              ... ok
     hash_index               ... ok
     update                   ... ok
     namespace                ... ok
     prepared_xacts           ... ok
     delete                   ... ok
parallel group (4 tests):  security_label collate privileges matview
     privileges               ... ok
     security_label           ... ok
     collate                  ... ok
     matview                  ... ok
parallel group (3 tests):  psql alter_generic misc
     alter_generic            ... ok
     misc                     ... ok
     psql                     ... ok
test rules                    ... ok
test event_trigger            ... ok
parallel group (16 tests, in groups of 5):  portals_p2 dependency cluster select_views foreign_key combocid tsdicts guc tsearch bitmapops advisory_lock xmlmap functional_deps window foreign_data json
     select_views             ... ok
     portals_p2               ... ok
     foreign_key              ... ok
     cluster                  ... ok
     dependency               ... ok
     guc                      ... ok
     bitmapops                ... ok
     combocid                 ... ok
     tsearch                  ... ok
     tsdicts                  ... ok
     foreign_data             ... ok
     window                   ... ok
     xmlmap                   ... ok
     functional_deps          ... ok
     advisory_lock            ... ok
     json                     ... ok
parallel group (19 tests, in groups of 5):  limit plancache temp copy2 plpgsql prepare conversion without_oid domain rangefuncs sequence rowtypes truncate polymorphism alter_table returning xml with largeobject
     plancache                ... ok
     limit                    ... ok
     plpgsql                  ... ok
     copy2                    ... ok
     temp                     ... ok
     domain                   ... ok
     rangefuncs               ... ok
     prepare                  ... ok
     without_oid              ... ok
     conversion               ... ok
     truncate                 ... ok
     alter_table              ... ok
     sequence                 ... ok
     polymorphism             ... ok
     rowtypes                 ... ok
     returning                ... ok
     largeobject              ... ok
     with                     ... ok
     xml                      ... ok
test stats                    ... ok
============== shutting down postmaster               ==============
=======================
 All 136 tests passed. 
=======================

Additional info:

Comment 1 Marcin Juszkiewicz 2014-06-11 14:57:41 UTC
create_index generates this kernel message (which is bad pointer):

[26985.027982] postgres[24006]: unhandled level 3 translation fault (11) at 0x13cc70b4, esr 0x92000007
[26985.037013] pgd = fffffe016dc20000
[26985.040399] [13cc70b4] *pgd=0000004156ea0003, *pmd=0000004156ea0003, *pte=0000000000000000

[26985.050155] CPU: 1 PID: 24006 Comm: postgres Tainted: GF            3.13.0-0.rc7.33.sa2.aarch64 #1
[26985.059072] task: fffffe03ff74cd00 ti: fffffe03a5100000 task.ti: fffffe03a5100000
[26985.066526] PC is at 0x4a5a2c
[26985.069476] LR is at 0x4a58f8
[26985.072431] pc : [<00000000004a5a2c>] lr : [<00000000004a58f8>] pstate: 80000000
[26985.079787] sp : 000003ffcd808a60
[26985.083086] x29: 000003ffcd808a60 x28: 000003ffcd809020 
[26985.088390] x27: 000003ffa2b5ce48 x26: 0000000000000000 
[26985.093696] x25: 000003ffa2b5ce51 x24: 000003ffcd809020 
[26985.098999] x23: 0000000009e63834 x22: 000003ffcd809050 
[26985.104306] x21: 0000000000000022 x20: 0000000000000001 
[26985.109608] x19: 0000000000000001 x18: 0000000000002000 
[26985.114915] x17: 000003ffaa067c80 x16: 0000000000911040 
[26985.120218] x15: 00000000ffffffff x14: 0000000000000020 
[26985.125525] x13: 2074432020202020 x12: 2020202020202020 
[26985.130828] x11: 2020202020202020 x10: 2020202072656243 
[26985.136135] x9 : 0000000000100001 x8 : 00000087000000e0 
[26985.141442] x7 : 0000000000000008 x6 : 000000000000006d 
[26985.146744] x5 : 0000000000000000 x4 : 0000000009e61280 
[26985.152051] x3 : 000003ffa2b5ce52 x2 : 0000000009e63835

Comment 2 Jakub Jelinek 2014-06-11 15:14:11 UTC
If you have access to an aarch64 box or chroot, can you please bisect which *.o file it is (try to mix *.o files from build with gcc-4.9.0-5.fc21 with *.o files from build with gcc-4.9.0-6.fc21 and ideally narrow it to one where if all *.o files come from 4.9.0-6.fc21 but that one from -5.fc21 it works and if all *.o files come from -5.fc21 but that one from -6.fc21 it doesn't work.

If you get to this state, please attach here preprocessed source and mention all gcc command line options used to compile it, I can then find out what changed using a cross-compiler.

Thanks.

Comment 3 Marcin Juszkiewicz 2014-06-11 15:19:11 UTC
I have AArch64 machine under desk. My plan for tomorrow is bisecting gcc 4.9.0-5 -> 4.9.0-6 update to find out when exactly it failed.

Comment 4 Jakub Jelinek 2014-06-11 15:26:24 UTC
If you mean bisect redhat/gcc-4_9-branch, then that is unlikely to help, there have been exactly 2 svn commits, one which added -fsanitize=float-cast-overflow, very unlikely related, and one which backported about 46 fixes from upstream gcc-4_9-branch.
I'd say bisecting *.o file is faster, then one e.g. can try to reproduce with upstream 4.9 branch, or -fdump-tree-all -fdump-rtl-all to find out where it starts to differ and from that guess problematic change, etc.

Comment 5 Jeff Law 2014-06-11 15:32:15 UTC
It also wouldn't be a terrible idea to wait for Jakub to import the next build of gcc.  We're still seeing a fair number of codegen bug reports which are being backported to the release (and presumably vendor) branch.

Comment 6 Marcin Juszkiewicz 2014-06-11 15:39:56 UTC
So by *.o you mean /usr/lib/gcc/*/*/*.o files?

Comment 7 Jakub Jelinek 2014-06-11 16:00:09 UTC
No, I meant you build postgresql with gcc-4.9.0-5.fc21, make a backup copy of the build tree, build postgresql with gcc-4.9.0-6.fc21, make a backup copy of the build tree.  Then, divide the *.o files in the postgresql approximately into two halves, for the first half copy (+ touch) them from the backup tree built with 4.9.0-5.fc21, for the second half copy (+ touch) them from the backup tree built with 4.9.0-6.fc21, relink (hopefully just make would do, but please verify no *.o files are rebuilt in that step), retest.  If the result works, it means the problematic file is supposedly in the first half, if it doesn't, it means the problematic file is supposedly in the second half.  Then divide the problematic half into two approx. same sized parts and continue until you narrow it to one file, then just verify it is really just that single one file.

If nothing needs to be recompiled, each step will be just editing some list file containing names of the *.o files, copying/touching/relinking and retesting, so it shouldn't be that slow.

Comment 8 Jeff Law 2014-06-12 18:44:40 UTC
Note postgresql has a copy of the Spencer regex library which we know GCC has been miscompiling on PPC & s390.  I've just backported the fix for that bug into the upstream gcc-4.9 release branch and when Jakub does the next resync & koji build Fedora will pick up that fix.

Martin, could you try just compiling the bits in the regex subdirectory with gcc-4.8 or without optimization and see if that improves the test results?

Comment 9 Marcin Juszkiewicz 2014-06-13 06:20:11 UTC
Updated to the latest gcc. Still fails:

======================================================
 38 of 136 tests failed, 1 of these failures ignored. 
======================================================

The differences that caused some tests to fail can be viewed in the
file "/builddir/build/BUILD/postgresql-9.3.4/src/test/regress/regression.diffs".  A copy of the test summary that you see
above is saved in the file "/builddir/build/BUILD/postgresql-9.3.4/src/test/regress/regression.out".

GNUmakefile:138: recipe for target 'check' failed
make: *** [check] Error 1
błąd: Błędny stan wyjścia z /var/tmp/rpm-tmp.0iwiQ5 (%build)


Błędy budowania pakietu RPM:
    Błędny stan wyjścia z /var/tmp/rpm-tmp.0iwiQ5 (%build)
<mock-chroot>[mockbuild@pinkypie /]$ gcc --version
gcc (GCC) 4.9.0 20140612 (Red Hat 4.9.0-9)
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Comment 10 Jakub Jelinek 2014-06-13 06:29:57 UTC
As you said it regressed between 4.9.0-{5,6}.fc21, then that is kind of expected, the jump threading bug Jeff fixed was older than that.

Anyway, have you succeeded with the binary search to find out problematic *.o file (of course, brute force can be replaced with a guess based on what you see in the debugger, then you can just try to replace the single file).

Comment 11 Marcin Juszkiewicz 2014-06-13 10:45:42 UTC
OK. I took gcc upstream git, extracted all commits between 20140518 (-5 fedora) and 20140529 (-6 fedora) and created quilt patchset from them.

This gave me 52 patches (had to drop 2 of them as they were in Fedora already). Chrooted into mock with gcc 4.9.0-5 and started bisecting.

0029 = daily bump to 20140523 == fail
0021 = daily bump to 20140522 == works
0026 = PR target/61208 [1] == works

1. https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=a4fbabf41b62765a5da3a8e5394b4b3c7441315e

So problem is in one of two patches:

7766b1d93 2014-05-22 Vladimir Makarov <vmakarov>
a790cfefd gcc/

But as 7766b1d93 is rs6000 related I suspect a790cfefd to be faulty one.

- https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=7766b1d931a85f4d6c887dd0164a94ee7b29be51
- https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a790cfefdfbbc4e5aa23f11a163ecb51c84fd128

Comment 12 Jakub Jelinek 2014-06-13 10:53:34 UTC
The PR60969 fix had I think 2 regressions it caused, but none of those is wrong-code.  In any case, we really need the problematic file (and find problematic function in it, hopefully if the PR60969 fix doesn't cause too many changes in every function one could do that by comparing assembly between the two changes), otherwise Vlad can't work on a fix.

Comment 13 Marcin Juszkiewicz 2014-06-13 12:15:08 UTC
Looks like src/backend/access/spgist/spgtextproc.o is to blame.

Command to compile:

gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2   -c -o spgtextproc.o spgtextproc.c

Comment 14 Jakub Jelinek 2014-06-13 12:16:23 UTC
Can you preprocess it and attach spgtextproc.i ?  Just add -save-temps to the above command.

Comment 15 Marcin Juszkiewicz 2014-06-13 12:18:57 UTC
Created attachment 908547 [details]
archive with problematic file

Tarball contents:

t/spgtextproc.c

t/broken/ - built with gcc 4.9.0-5.0029

t/broken/spgtextproc.o
t/broken/spgtextproc.o.s

t/worked/ - built with gcc 4.9.0-5.0026

t/worked/spgtextproc.o
t/worked/spgtextproc.o.s

Assembly version done with "objdump -d".

Comment 16 Marcin Juszkiewicz 2014-06-13 12:23:05 UTC
Created attachment 908550 [details]
preprocessed source

Comment 17 Jakub Jelinek 2014-06-13 13:47:32 UTC
Created attachment 908587 [details]
rh1108219.c

If my cross-compiler doesn't behave too differently from the native one, seems the only differences are due to different register allocator decisions in 
spg_text_choose function, does that sound likely from what you see in the debugger?  If yes, can you find out how many times that function is called before things go wrong and ideally in which iteration it does something wrong and if possible what?

I'm attaching a delta reduced source for the spg_text_choose, if the problem is indeed there, best would be if we could turn this into a self-contained executable testcase, so in particular stub the palloc and pg_detoast_datum_packed functions in a different *.c file, it is enough if they set and return only whatever spg_text_choose needs, and stub main function in that different *.c file so that it will call spg_text_choose with the parameters where it will misbehave (of course when it is dereferencing pointers, they must point to something etc.).

Appart from reshufling a few registers (x22->x24, x23->x22, x24->x23, w1->w21)
and some hopefully unimportant insn scheduling changes I see two hunks that look differently:

-       ldr     x3, [x1,w2,sxtw 3]
-       uxtb    w3, w3
+       ldrb    w3, [x1,x0]

and

+       ldrb    w3, [x1,x0]
        add     w4, w7, w5
        asr     w4, w4, 1
-       ldr     x3, [x1,w4,sxtw 3]
-       uxtb    w3, w3

Any help in finding out what goes wrong in that function and if it is really that function would be appreciated.  E.g. try to add __attribute__((optimize (0))) on that function and see if the problems go away...

Comment 18 Vladimir Makarov 2014-06-13 14:32:17 UTC
The patch in question itself should be safe but I guess it just triggered some hidden bug.  I'll investigate this further.

Comment 19 Vladimir Makarov 2014-06-13 14:35:10 UTC
(In reply to Jakub Jelinek from comment #17)
> Created attachment 908587 [details]

> Appart from reshufling a few registers (x22->x24, x23->x22, x24->x23,
> w1->w21)
> and some hopefully unimportant insn scheduling changes I see two hunks that
> look differently:
> 
> -       ldr     x3, [x1,w2,sxtw 3]
> -       uxtb    w3, w3
> +       ldrb    w3, [x1,x0]
> 
> and
> 
> +       ldrb    w3, [x1,x0]
>         add     w4, w7, w5
>         asr     w4, w4, 1
> -       ldr     x3, [x1,w4,sxtw 3]
> -       uxtb    w3, w3
> 
> Any help in finding out what goes wrong in that function and if it is really
> that function would be appreciated.  E.g. try to add __attribute__((optimize
> (0))) on that function and see if the problems go away...

I found this code also suspicious. It is a result of equiv. memory substitution.  And I think some parts of the address was lost.

Comment 20 Vladimir Makarov 2014-06-13 15:22:55 UTC
The bug is in address decomposition code in rtlanal.c.  It was written by Richard Sandiford and out of my maintained code base.  I'll try to make a patch but I am not sure when it will be approved.  I hope it will be fixed on next week.

Comment 21 Vladimir Makarov 2014-06-16 21:56:58 UTC
I found another solution inside LRA code base.
 
I committed the patch into gcc-4.9-branch.

Comment 22 Jakub Jelinek 2014-06-24 15:44:34 UTC
Is the problem fixed with gcc-4.9.0-12.fc21 ?

Comment 23 Marcin Juszkiewicz 2014-06-24 15:55:08 UTC
We built postgresql with gcc 4.9.0-10.fc21 just fine.

http://arm.koji.fedoraproject.org/koji/buildinfo?buildID=203514


Note You need to log in before you can comment on or make changes to this bug.