Bug 1544349
Summary: | GCC -O1 -ftree-slp-vectorize miscompilation [ppc64][ppc64le] | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Petr Kubat <pkubat> | ||||||
Component: | gcc | Assignee: | Jakub Jelinek <jakub> | ||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 28 | CC: | aoliva, bbaude, dan, davejohansen, devrim, dmalcolm, fweimer, hhorak, jakub, jmlich83, jstanek, jwakely, law, mpolacek, msebor, nickc, pkubat, ppisar, praiskup, tgl | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | ppc64le | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1600395 (view as bug list) | Environment: | |||||||
Last Closed: | 2018-07-12 12:31:09 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1071880, 1543753, 1600395 | ||||||||
Attachments: |
|
Description
Petr Kubat
2018-02-12 08:59:58 UTC
Forgot to specify that 'latest' in this case is the version currently in master branch (postgresql 10.2) which is not built in koji yet (due to the test case failure). Seems like an issue with latest gcc. I have tried rebuilding latest master with an older gcc from f27 repositories (thanks Pavel for the suggestion!) and initdb works in that build. Hm, so either it's a gcc bug, or PG is doing something that's not quite kosher but previous compiler versions have let us get away with. It'd be useful to get a stack trace from the point of the error. You could patch an abort() call into the FATAL stanza in src/backend/utils/error/elog.c, say right before the proc_exit(1) call (at line 543 in 10.2), to get a core dump there. Created attachment 1395270 [details]
full stack from the abort
While getting the stack trace I tried modifying the optimisation compile flags since we are using -O3 (for ppc64 specifically) which was introduced via bug 1051075. Dropping it down to the default -O2 made initdb succeed. Hm. The stack trace says that the misbehavior is in fd.c, probably something along the lines of doubling the size of the VfdCache array each time through, whether or not it was full. However, all that code is really old --- depending on what you want to count as a change, no part of AllocateVfd has changed in between 10 and 20 years, according to "git blame". I thought possibly something we'd changed recently in PG was broken, but now it's hard to avoid the conclusion that this is a gcc bug. (In reply to Tom Lane from comment #6) > Hm. The stack trace says that the misbehavior is in fd.c, probably > something along the lines of doubling the size of the VfdCache array each > time through, whether or not it was full. After looking into the issue some more I can say that is precisely what happens. Each time the AllocateVfd function is called the cache size is doubled. Never is it considered that the cache is not yet full so postgres runs out of memory to use very quickly. When using -O2 instead, postgres behaves correctly and only doubles the cache when it is needed so by the time the more optimized version would have already aborted with out-of-memory, the less optimized version's cache is only 128 items long. Seems like the short term answer is to revert to -O2 for the ppc64 build, and file a gcc bug to get the problem fixed. BTW, so far as I know, nobody in the PG dev community really tests builds with -O3. Even though this particular issue seems to be 100% the compiler's fault, I wonder whether there are any dubious coding practices in there that might be exposed with optimization levels above -O2. I can't view the bug you mention in #c5, so I don't know why the higher level was introduced in the first place. This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle. Changing version to '28'. (In reply to Tom Lane from comment #8) > I can't view the bug > you mention in #c5, so I don't know why the higher level was introduced in > the first place. Ah right, it is marked as private for some reason, sorry about that. The higher optimization flag was introduced as postgresql's performance would benefit from it. Pavel might know more details about it since I was not around at the time I have tried creating a reproducer for the issue from postgresql's code and attached it to the gcc bug report (bug 1547495). Interestingly it reproduces even on other configurations while postgresql only fails on ppc4 with gcc 8. The -O3 was per customer request, for limited set of (important) packages where some benchmarks claimed that it is worth it. Later the %_performance_build toggle macro was invented in redhat-rpm-macros in RHEL, but we didn't adopt early enough. Brent (cc) might have more info, but I think the minimal reproducer in bug 1547495 is a good start atm. Even after gcc fix, -O3 build fails on tests (cross-arch I guess), I'm going to debug soon ... but disabling the -O3 for now. I've got my VM killed, waiting for new one :/ But so far I can tell this: - the remaining -O3 issue is only in btree_gist contrib module, as far as the testing in RPM goes... - it's reproducible on 64-bit ppc, not on x86_64 - the problem is in gbt_var_bin_union(), btree_utils_var.c - combination of -fstack-progector and -O3 triggers this - gcc 8.0+ This really seems to be a GCC 8 bug, triggered by options '-O1 -fstack-protector -ftree-slp-vectorize'. Since with '-O0' I can not reproduce this issue, I tried to diff the outputs from: gcc -O1 -Q --help=optimizers gcc -O0 -Q --help=optimizers But even if I used '-O0' with all the missing options from -O1, I did not reproduce this. So there's a potential for another bug -- seems like --help=optimizer doesn't tell a complete truth. There's actually also slightly related bug (a bit complicating the reproducibility of this GCC issue), I'll report upstream. Created attachment 1456403 [details]
reproducer
(not really that much) minimal reproducer:
// grab ppc64le machine
$ tar xf reproduce.tar.xz
$ cd reproduce/
$ make
...
diff -ruN expected output
--- expected 2018-07-04 03:48:53.000000000 -0400
+++ output 2018-07-04 03:49:47.753905002 -0400
@@ -6,5 +6,5 @@
comparing 10 12
comparing 11 12
is upper
- => new_lower 10
+ => new_lower 11
=> new_upper 12
(In reply to Pavel Raiskup from comment #14) > There's actually also slightly related bug (a bit complicating the > reproducibility of this GCC issue), I'll report upstream. I mean, there's also a bug in PostgreSQL: https://www.postgresql.org/message-id/18451583.ZEuM8ThfqI%40nb.usersys.redhat.com Started with http://gcc.gnu.org/r256656 No change in -fdump-tree-optimized, so it is a rs6000 backend bug. The assembly difference (good to bad) is: - lxvd2x 0,0,9 - xxpermdi 0,0,0,2 - xxpermdi 0,0,0,2 + lvx 0,0,9 li 9,-16 addi 10,1,64 - stxvd2x 0,10,9 + stxvd2x 32,10,9 This got already fixed in http://gcc.gnu.org/r260329, so should work fine in f29 already. Ah, OK, I can confirm that: https://koji.fedoraproject.org/koji/taskinfo?taskID=28157095 |