postgresql-9.4.1-1.fc22 fails to build in F22: ============== creating temporary installation ============== ============== initializing database system ============== pg_regress: initdb failed Examine /builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/log/initdb.log for the reason. Command was: "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/install//usr/bin/initdb" -D "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/data" -L "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/install//usr/share/pgsql" --noclean --nosync > "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/log/initdb.log" 2>&1 GNUmakefile:120: recipe for target 'check' failed make[1]: Leaving directory '/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl' make[1]: *** [check] Error 2 make: *** [check-plperl-recurse] Error 2 Makefile:35: recipe for target 'check-plperl-recurse' failed make: Leaving directory '/builddir/build/BUILD/postgresql-9.4.1/src/pl' + test_failure=1 + set +x === make failure: src/pl/plperl/regression.diffs === + mv src/Makefile.global src/Makefile.global.save + cp src/Makefile.global.python3 src/Makefile.global RPM build errors: cp: error writing 'src/Makefile.global': No space left on device cp: failed to extend 'src/Makefile.global': No space left on device error: Bad exit status from /var/tmp/rpm-tmp.Usxj2a (%build) Bad exit status from /var/tmp/rpm-tmp.Usxj2a (%build) Child return code was: 1 EXCEPTION: Command failed. See logs for output. Difference between working and failing build root: perl-Encode 2:2.68-1.fc22 > 2:2.70-1.fc22 libgcc 4.9.2-5.fc22 > 5.0.0-0.7.fc22 libgomp 4.9.2-5.fc22 > 5.0.0-0.7.fc22 shared-mime-info 1.4-1.fc22 > 1.4-2.fc22 libstdc++ 4.9.2-5.fc22 > 5.0.0-0.7.fc22 gcc-c++ 4.9.2-5.fc22 > 5.0.0-0.7.fc22 gcc 4.9.2-5.fc22 > 5.0.0-0.7.fc22 isl > 0.14-3.fc22 cpp 4.9.2-5.fc22 > 5.0.0-0.7.fc22 libstdc++-devel 4.9.2-5.fc22 > 5.0.0-0.7.fc22
Thanks for the report. Reproduced, initdb (its sub-process postgres) eats a lot of space in PGDATA dir, when we use gcc-5.0.0 (not entirely sure gcc is the real trigger), together with -O2. With -O1/-O0 initdb works fine. I'll try to figure out how to debug, and debug this properly. Or at least minimize minimal example. Postgres process aborts while initdb tries to feed the process with bki file: #0 0x00007ffff7330187 in raise () from /lib64/libc.so.6 #1 0x00007ffff7331dea in abort () from /lib64/libc.so.6 #2 0x000000000073a1c9 in errfinish (dummy=dummy@entry=0) at elog.c:569 #3 0x000000000073bbb0 in elog_finish (elevel=elevel@entry=22, fmt=fmt@entry=0x772b70 "cannot abort transaction %u, it was already committed") at elog.c:1362 #4 0x00000000004b05f3 in RecordTransactionAbort (isSubXact=isSubXact@entry=0 '\000') at xact.c:1467 #5 0x00000000004b06b4 in AbortTransaction () at xact.c:2415 #6 0x00000000004b3955 in AbortOutOfAnyTransaction () at xact.c:4000 #7 0x00000000007447c9 in ShutdownPostgres (code=<optimized out>, arg=<optimized out>) at postinit.c:1058 #8 0x000000000064994d in shmem_exit (code=code@entry=1) at ipc.c:230 #9 0x0000000000649a35 in proc_exit_prepare (code=code@entry=1) at ipc.c:187 #10 0x0000000000649aa8 in proc_exit (code=code@entry=1) at ipc.c:102 #11 0x000000000073a1f5 in errfinish (dummy=<optimized out>) at elog.c:555 #12 0x0000000000660ef8 in mdextend (reln=0xc0cd90, forknum=FSM_FORKNUM, blocknum=<optimized out>, buffer=0xc16660 "", skipFsync=<optimized out>) at md.c:527 #13 0x00000000006471c4 in fsm_extend (fsm_nblocks=1055795, rel=0xbefb80) at freespace.c:587 #14 fsm_readbuf (rel=rel@entry=0xbefb80, addr=..., addr@entry=..., extend=extend@entry=1 '\001') at freespace.c:525 #15 0x00000000006472ee in fsm_set_and_search (rel=rel@entry=0xbefb80, addr=..., slot=slot@entry=3518, newValue=<optimized out>, minValue=minValue@entry=6 '\006') at freespace.c:615 #16 0x000000000064778d in RecordAndGetPageWithFreeSpace (rel=rel@entry=0xbefb80, oldPage=oldPage@entry=4294967295, oldSpaceAvail=oldSpaceAvail@entry=0, spaceNeeded=spaceNeeded@entry=176) at freespace.c:159 #17 0x000000000048f1b2 in RelationGetBufferForTuple (relation=relation@entry=0xbefb80, len=176, otherBuffer=otherBuffer@entry=0, options=options@entry=0, bistate=bistate@entry=0x0, vmbuffer=vmbuffer@entry=0x7fffffffdb0c, vmbuffer_other=0x0) at hio.c:414 #18 0x0000000000488c9a in heap_insert (relation=0xbefb80, tup=tup@entry=0xc07800, cid=<optimized out>, options=options@entry=0, bistate=bistate@entry=0x0) at heapam.c:2082 #19 0x00000000004899be in simple_heap_insert (relation=<optimized out>, tup=tup@entry=0xc07800) at heapam.c:2572 #20 0x00000000004cf211 in InsertOneTuple (objectid=1242) at bootstrap.c:799 #21 0x00000000004cdcd9 in boot_yyparse () at bootparse.y:277 #22 0x00000000004ce83f in BootstrapModeMain () at bootstrap.c:491 #23 AuxiliaryProcessMain (argc=5, argc@entry=6, argv=0xba2998, argv@entry=0xba2990) at bootstrap.c:411 #24 0x00000000004609db in main (argc=6, argv=0xba2990) at main.c:219 Pavel
Judging from the stack trace, I'd say that something is busted in the logic that determines where in a relation (aka table, file) there is a page with enough free space to insert a new tuple. For some reason it's repeatedly deciding it can't find enough space and then extending the relation by another page. This probably points to a compiler bug or overenthusiastic optimization manifesting somewhere in the FSM (free space map) logic. I'm a bit busy right now but am willing to help out if you can't isolate it quickly.
I dug into this a bit in a rawhide mock installation. It appears that your stack trace above is telling the truth that RelationGetBufferForTuple is passing oldPage=oldPage@entry=4294967295 to RecordAndGetPageWithFreeSpace. The latter then goes nuts extending the free space map out to such a high block number. (So it's not really an infinite loop, but it is consuming unreasonable amounts of disk space.) Now the thing is that the logic in RelationGetBufferForTuple() looks like this: while (targetBlock != InvalidBlockNumber) { ... do a bunch of stuff that does not change targetBlock ... targetBlock = RecordAndGetPageWithFreeSpace(relation, targetBlock, pageFreeSpace, len + saveFreeSpace); } It is therefore impossible on its face that this code ever passes 4294967295 (a/k/a InvalidBlockNumber) to RecordAndGetPageWithFreeSpace. And yet it is doing that: I put a test for oldPage == InvalidBlockNumber into RecordAndGetPageWithFreeSpace, and it fired. I think we can safely classify this as a gcc bug, and a pretty bad one too.
Thanks Tom for looking at it. Yes, I agree - clear gcc bug. I'm trying to cut out minimal example and I'll switch then to gcc.
Created attachment 991396 [details] Small reproducer Ok, it can be definitely more "minimized", but for gcc purposes the attached example should be good enough. There is main() calling reproduce() from different module, that calls another functions from yet another module. Check the reproduce function (reproducer.c), simplified: block = invalid; while (block != invalid) { // should never run printf("but this is run with -O2\n"); } === wrong behavior === $ ./configure CFLAGS="-O2 -g3" >/dev/null $ make >/dev/null $ ./hello equals(a,b): 0, a = 4294967295, b = 4294967295 === correct behavior === $ ./configure CFLAGS="-O2 -g3" >/dev/null $ make >/dev/null $ ./hello # should be silent === with -O0 program behaves correctly === $ ./configure CFLAGS="-O0 -g3" >/dev/null $ make >/dev/null $ ./hello # should be silent
Reduced: int i; __attribute__ ((noinline)) unsigned int foo (void) { return 0; } int main () { unsigned int u = -1; if (u == -1) { unsigned int n = foo (); if (n > 0) u = n - 1; } while (u != -1) { asm ("" : "+g" (u)); u = -1; i = 1; } if (i) __builtin_abort (); }
Ok with -O and -O2 -fno-tree-vrp; fails with -O2.
Tracking this upstream now.
Confirmed that postgresql-9.4.1-1.fc23 builds (including passing its self-tests) with gcc-5.0.0-0.13.fc23.x86_64, where it did not with gcc-5.0.0-0.12.fc23.x86_64. Thanks for the quick turnaround!