1544349 – GCC -O1 -ftree-slp-vectorize miscompilation [ppc64][ppc64le]

Bug 1544349 - GCC -O1 -ftree-slp-vectorize miscompilation [ppc64][ppc64le]

Summary: GCC -O1 -ftree-slp-vectorize miscompilation [ppc64][ppc64le]

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	28
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	PPCTracker 1543753 1600395
TreeView+	depends on / blocked

Reported:	2018-02-12 08:59 UTC by Petr Kubat
Modified:	2018-07-12 12:31 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Clones:	1600395 (view as bug list)
Environment:
Last Closed:	2018-07-12 12:31:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
full stack from the abort (30.77 KB, text/plain) 2018-02-13 10:49 UTC, Petr Kubat	no flags	Details
reproducer (1.60 KB, application/x-xz) 2018-07-04 08:00 UTC, Pavel Raiskup	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNU Compiler Collection	85698	0	None	None	None	2018-07-12 11:24:23 UTC
Red Hat Bugzilla	1547495	0	unspecified	CLOSED	Incorrect branching in postgresql on ppc64 when compiled with -O3	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1580850	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 1547495 1580850

Description Petr Kubat 2018-02-12 08:59:58 UTC

$ initdb

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /tmp/tmp.5J9relRIz8 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... sysv
creating configuration files ... ok
running bootstrap script ... 2018-02-12 03:52:25.425 EST [23048] FATAL:  out of memory
2018-02-12 03:52:25.425 EST [23048] PANIC:  cannot abort transaction 1, it was already committed
child process was terminated by signal 6: Aborted
initdb: removing contents of data directory "/tmp/tmp.5J9relRIz8"

This can also be seen (sans logs) in the test suite for the build:
https://kojipkgs.fedoraproject.org//work/tasks/9065/24839065/build.log

Comment 1 Petr Kubat 2018-02-12 09:13:36 UTC

Forgot to specify that 'latest' in this case is the version currently in master branch (postgresql 10.2) which is not built in koji yet (due to the test case failure).

Comment 2 Petr Kubat 2018-02-12 13:59:43 UTC

Seems like an issue with latest gcc.

I have tried rebuilding latest master with an older gcc from f27 repositories (thanks Pavel for the suggestion!) and initdb works in that build.

Comment 3 Tom Lane 2018-02-12 14:24:13 UTC

Hm, so either it's a gcc bug, or PG is doing something that's not quite kosher but previous compiler versions have let us get away with.

It'd be useful to get a stack trace from the point of the error.  You could patch an abort() call into the FATAL stanza in src/backend/utils/error/elog.c, say right before the proc_exit(1) call (at line 543 in 10.2), to get a core dump there.

Comment 4 Petr Kubat 2018-02-13 10:49:35 UTC

Created attachment 1395270 [details]
full stack from the abort

Comment 5 Petr Kubat 2018-02-13 10:52:36 UTC

While getting the stack trace I tried modifying the optimisation compile flags since we are using -O3 (for ppc64 specifically) which was introduced via bug 1051075. Dropping it down to the default -O2 made initdb succeed.

Comment 6 Tom Lane 2018-02-13 15:20:01 UTC

Hm.  The stack trace says that the misbehavior is in fd.c, probably something along the lines of doubling the size of the VfdCache array each time through, whether or not it was full.  However, all that code is really old --- depending on what you want to count as a change, no part of AllocateVfd has changed in between 10 and 20 years, according to "git blame".  I thought possibly something we'd changed recently in PG was broken, but now it's hard to avoid the conclusion that this is a gcc bug.

Comment 7 Petr Kubat 2018-02-16 14:00:21 UTC

(In reply to Tom Lane from comment #6)
> Hm.  The stack trace says that the misbehavior is in fd.c, probably
> something along the lines of doubling the size of the VfdCache array each
> time through, whether or not it was full.

After looking into the issue some more I can say that is precisely what happens.

Each time the AllocateVfd function is called the cache size is doubled. Never is it considered that the cache is not yet full so postgres runs out of memory to use very quickly.

When using -O2 instead, postgres behaves correctly and only doubles the cache when it is needed so by the time the more optimized version would have already aborted with out-of-memory, the less optimized version's cache is only 128 items long.

Comment 8 Tom Lane 2018-02-16 14:45:10 UTC

Seems like the short term answer is to revert to -O2 for the ppc64 build, and file a gcc bug to get the problem fixed.

BTW, so far as I know, nobody in the PG dev community really tests builds with -O3.  Even though this particular issue seems to be 100% the compiler's fault, I wonder whether there are any dubious coding practices in there that might be exposed with optimization levels above -O2.  I can't view the bug you mention in #c5, so I don't know why the higher level was introduced in the first place.

Comment 9 Fedora End Of Life 2018-02-20 15:26:06 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 10 Petr Kubat 2018-02-21 12:53:55 UTC

(In reply to Tom Lane from comment #8)
> I can't view the bug
> you mention in #c5, so I don't know why the higher level was introduced in
> the first place.

Ah right, it is marked as private for some reason, sorry about that. The higher optimization flag was introduced as postgresql's performance would benefit from it. Pavel might know more details about it since I was not around at the time

I have tried creating a reproducer for the issue from postgresql's code and attached it to the gcc bug report (bug 1547495). Interestingly it reproduces even on other configurations while postgresql only fails on ppc4 with gcc 8.

Comment 11 Pavel Raiskup 2018-02-21 13:39:25 UTC

The -O3 was per customer request, for limited set of (important) packages
where some benchmarks claimed that it is worth it.  Later the
%_performance_build toggle macro was invented in redhat-rpm-macros in RHEL,
but we didn't adopt early enough.

Brent (cc) might have more info, but I think the minimal reproducer in bug
1547495 is a good start atm.

Comment 12 Pavel Raiskup 2018-03-01 17:08:41 UTC

Even after gcc fix, -O3 build fails on tests (cross-arch I guess), I'm
going to debug soon ... but disabling the -O3 for now.

Comment 13 Pavel Raiskup 2018-06-19 10:49:02 UTC

I've got my VM killed, waiting for new one :/  But so far I can tell this:

- the remaining -O3 issue is only in btree_gist contrib module, as far as
  the testing in RPM goes...
- it's reproducible on 64-bit ppc, not on x86_64
- the problem is in gbt_var_bin_union(), btree_utils_var.c
- combination of -fstack-progector and -O3 triggers this
- gcc 8.0+

Comment 14 Pavel Raiskup 2018-07-04 07:58:08 UTC

This really seems to be a GCC 8 bug, triggered by options
'-O1 -fstack-protector -ftree-slp-vectorize'.

Since with '-O0' I can not reproduce this issue, I tried to diff the
outputs from:
  gcc -O1 -Q --help=optimizers
  gcc -O0 -Q --help=optimizers

But even if I used '-O0' with all the missing options from -O1, I did
not reproduce this.  So there's a potential for another bug -- seems like
--help=optimizer doesn't tell a complete truth.

There's actually also slightly related bug (a bit complicating the
reproducibility of this GCC issue), I'll report upstream.

Comment 15 Pavel Raiskup 2018-07-04 08:00:29 UTC

Created attachment 1456403 [details]
reproducer

(not really that much) minimal reproducer:

// grab ppc64le machine

$ tar xf reproduce.tar.xz
$ cd reproduce/
$ make
...
diff -ruN expected output
--- expected    2018-07-04 03:48:53.000000000 -0400
+++ output      2018-07-04 03:49:47.753905002 -0400
@@ -6,5 +6,5 @@
 comparing 10 12
 comparing 11 12
 is upper
- => new_lower  10
+ => new_lower  11
  => new_upper  12

Comment 16 Pavel Raiskup 2018-07-04 09:55:39 UTC

(In reply to Pavel Raiskup from comment #14)
> There's actually also slightly related bug (a bit complicating the
> reproducibility of this GCC issue), I'll report upstream.

I mean, there's also a bug in PostgreSQL:
https://www.postgresql.org/message-id/18451583.ZEuM8ThfqI%40nb.usersys.redhat.com

Comment 17 Jakub Jelinek 2018-07-12 09:06:59 UTC

Started with http://gcc.gnu.org/r256656
No change in -fdump-tree-optimized, so it is a rs6000 backend bug.

The assembly difference (good to bad) is:
-       lxvd2x 0,0,9
-       xxpermdi 0,0,0,2
-       xxpermdi 0,0,0,2
+       lvx 0,0,9
        li 9,-16
        addi 10,1,64
-       stxvd2x 0,10,9
+       stxvd2x 32,10,9

Comment 18 Jakub Jelinek 2018-07-12 11:24:23 UTC

This got already fixed in http://gcc.gnu.org/r260329, so should work fine in f29 already.

Comment 19 Pavel Raiskup 2018-07-12 12:31:09 UTC

Ah, OK, I can confirm that:
https://koji.fedoraproject.org/koji/taskinfo?taskID=28157095

Note You need to log in before you can comment on or make changes to this bug.

aoliva
bbaude
dan
davejohansen
devrim
dmalcolm
fweimer
hhorak
jakub
jmlich83
jstanek
jwakely
law
mpolacek
msebor
nickc
pkubat
ppisar
praiskup
tgl