Bug 1190978

Summary: gcc 5.0.0 causes FTBFS of postgresql
Product: [Fedora] Fedora Reporter: Petr Pisar <ppisar>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: davejohansen, devrim, hhorak, hof, jakub, jmlich, jstanek, law, mpolacek, praiskup, tgl
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://koji.fedoraproject.org/koji/taskinfo?taskID=8841005
Whiteboard:
Fixed In Version: gcc-5.0.0-0.13.fc22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-15 15:50:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Small reproducer none

Description Petr Pisar 2015-02-10 07:47:43 UTC
postgresql-9.4.1-1.fc22 fails to build in F22:

============== creating temporary installation        ==============
============== initializing database system           ==============
pg_regress: initdb failed
Examine /builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/log/initdb.log for the reason.
Command was: "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/install//usr/bin/initdb" -D "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/data" -L "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/./tmp_check/install//usr/share/pgsql" --noclean --nosync > "/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl/log/initdb.log" 2>&1
GNUmakefile:120: recipe for target 'check' failed
make[1]: Leaving directory '/builddir/build/BUILD/postgresql-9.4.1/src/pl/plperl'
make[1]: *** [check] Error 2
make: *** [check-plperl-recurse] Error 2
Makefile:35: recipe for target 'check-plperl-recurse' failed
make: Leaving directory '/builddir/build/BUILD/postgresql-9.4.1/src/pl'
+ test_failure=1
+ set +x
=== make failure: src/pl/plperl/regression.diffs ===
+ mv src/Makefile.global src/Makefile.global.save
+ cp src/Makefile.global.python3 src/Makefile.global
RPM build errors:
cp: error writing 'src/Makefile.global': No space left on device
cp: failed to extend 'src/Makefile.global': No space left on device
error: Bad exit status from /var/tmp/rpm-tmp.Usxj2a (%build)
    Bad exit status from /var/tmp/rpm-tmp.Usxj2a (%build)
Child return code was: 1
EXCEPTION: Command failed. See logs for output.

Difference between working and failing build root:

        perl-Encode 	2:2.68-1.fc22 	> 	2:2.70-1.fc22
	libgcc 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	libgomp 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	shared-mime-info 	1.4-1.fc22 	> 	1.4-2.fc22
	libstdc++ 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	gcc-c++ 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	gcc 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	isl 		> 	0.14-3.fc22
	cpp 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22
	libstdc++-devel 	4.9.2-5.fc22 	> 	5.0.0-0.7.fc22

Comment 1 Pavel Raiskup 2015-02-11 14:45:28 UTC
Thanks for the report.  Reproduced, initdb (its sub-process postgres) eats a
lot of space in PGDATA dir, when we use gcc-5.0.0 (not entirely sure gcc is
the real trigger), together with -O2.  With -O1/-O0 initdb works fine.

I'll try to figure out how to debug, and debug this properly.  Or at least
minimize minimal example.  Postgres process aborts while initdb tries to feed
the process with bki file:

#0  0x00007ffff7330187 in raise () from /lib64/libc.so.6
#1  0x00007ffff7331dea in abort () from /lib64/libc.so.6
#2  0x000000000073a1c9 in errfinish (dummy=dummy@entry=0) at elog.c:569
#3  0x000000000073bbb0 in elog_finish (elevel=elevel@entry=22, fmt=fmt@entry=0x772b70 "cannot abort transaction %u, it was already committed") at elog.c:1362
#4  0x00000000004b05f3 in RecordTransactionAbort (isSubXact=isSubXact@entry=0 '\000') at xact.c:1467
#5  0x00000000004b06b4 in AbortTransaction () at xact.c:2415
#6  0x00000000004b3955 in AbortOutOfAnyTransaction () at xact.c:4000
#7  0x00000000007447c9 in ShutdownPostgres (code=<optimized out>, arg=<optimized out>) at postinit.c:1058
#8  0x000000000064994d in shmem_exit (code=code@entry=1) at ipc.c:230
#9  0x0000000000649a35 in proc_exit_prepare (code=code@entry=1) at ipc.c:187
#10 0x0000000000649aa8 in proc_exit (code=code@entry=1) at ipc.c:102
#11 0x000000000073a1f5 in errfinish (dummy=<optimized out>) at elog.c:555
#12 0x0000000000660ef8 in mdextend (reln=0xc0cd90, forknum=FSM_FORKNUM, blocknum=<optimized out>, buffer=0xc16660 "", skipFsync=<optimized out>) at md.c:527
#13 0x00000000006471c4 in fsm_extend (fsm_nblocks=1055795, rel=0xbefb80) at freespace.c:587
#14 fsm_readbuf (rel=rel@entry=0xbefb80, addr=..., addr@entry=..., extend=extend@entry=1 '\001') at freespace.c:525
#15 0x00000000006472ee in fsm_set_and_search (rel=rel@entry=0xbefb80, addr=..., slot=slot@entry=3518, newValue=<optimized out>, minValue=minValue@entry=6 '\006') at freespace.c:615
#16 0x000000000064778d in RecordAndGetPageWithFreeSpace (rel=rel@entry=0xbefb80, oldPage=oldPage@entry=4294967295, oldSpaceAvail=oldSpaceAvail@entry=0, spaceNeeded=spaceNeeded@entry=176) at freespace.c:159
#17 0x000000000048f1b2 in RelationGetBufferForTuple (relation=relation@entry=0xbefb80, len=176, otherBuffer=otherBuffer@entry=0, options=options@entry=0, bistate=bistate@entry=0x0, vmbuffer=vmbuffer@entry=0x7fffffffdb0c, 
    vmbuffer_other=0x0) at hio.c:414
#18 0x0000000000488c9a in heap_insert (relation=0xbefb80, tup=tup@entry=0xc07800, cid=<optimized out>, options=options@entry=0, bistate=bistate@entry=0x0) at heapam.c:2082
#19 0x00000000004899be in simple_heap_insert (relation=<optimized out>, tup=tup@entry=0xc07800) at heapam.c:2572
#20 0x00000000004cf211 in InsertOneTuple (objectid=1242) at bootstrap.c:799
#21 0x00000000004cdcd9 in boot_yyparse () at bootparse.y:277
#22 0x00000000004ce83f in BootstrapModeMain () at bootstrap.c:491
#23 AuxiliaryProcessMain (argc=5, argc@entry=6, argv=0xba2998, argv@entry=0xba2990) at bootstrap.c:411
#24 0x00000000004609db in main (argc=6, argv=0xba2990) at main.c:219

Pavel

Comment 2 Tom Lane 2015-02-11 15:31:03 UTC
Judging from the stack trace, I'd say that something is busted in the logic that determines where in a relation (aka table, file) there is a page with enough free space to insert a new tuple.  For some reason it's repeatedly deciding it can't find enough space and then extending the relation by another page.  This probably points to a compiler bug or overenthusiastic optimization manifesting somewhere in the FSM (free space map) logic.

I'm a bit busy right now but am willing to help out if you can't isolate it quickly.

Comment 3 Tom Lane 2015-02-12 23:13:27 UTC
I dug into this a bit in a rawhide mock installation.  It appears that your stack trace above is telling the truth that RelationGetBufferForTuple is passing oldPage=oldPage@entry=4294967295 to RecordAndGetPageWithFreeSpace.  The latter then goes nuts extending the free space map out to such a high block number.  (So it's not really an infinite loop, but it is consuming unreasonable amounts of disk space.)

Now the thing is that the logic in RelationGetBufferForTuple() looks like this:

while (targetBlock != InvalidBlockNumber)
{
   ... do a bunch of stuff that does not change targetBlock ...

   targetBlock = RecordAndGetPageWithFreeSpace(relation,
					       targetBlock,
					       pageFreeSpace,
					       len + saveFreeSpace);
}

It is therefore impossible on its face that this code ever passes 4294967295
(a/k/a InvalidBlockNumber) to RecordAndGetPageWithFreeSpace.  And yet it is
doing that: I put a test for oldPage == InvalidBlockNumber into RecordAndGetPageWithFreeSpace, and it fired.

I think we can safely classify this as a gcc bug, and a pretty bad one too.

Comment 4 Pavel Raiskup 2015-02-13 12:29:14 UTC
Thanks Tom for looking at it.  Yes, I agree - clear gcc bug.  I'm trying to cut
out minimal example and I'll switch then to gcc.

Comment 5 Pavel Raiskup 2015-02-13 14:47:31 UTC
Created attachment 991396 [details]
Small reproducer

Ok, it can be definitely more "minimized", but for gcc purposes the attached
example should be good enough.

There is main() calling reproduce() from different module, that calls another
functions from yet another module.  Check the reproduce function
(reproducer.c), simplified:

    block = invalid;
    while (block != invalid) {
        // should never run
        printf("but this is run with -O2\n");
    }

=== wrong behavior ===

    $ ./configure CFLAGS="-O2 -g3" >/dev/null
    $ make >/dev/null
    $ ./hello
    equals(a,b): 0, a = 4294967295, b = 4294967295

=== correct behavior ===

    $ ./configure CFLAGS="-O2 -g3" >/dev/null
    $ make >/dev/null
    $ ./hello # should be silent

=== with -O0 program behaves correctly ===

    $ ./configure CFLAGS="-O0 -g3" >/dev/null
    $ make >/dev/null
    $ ./hello # should be silent

Comment 6 Marek Polacek 2015-02-13 15:56:49 UTC
Reduced:

int i;

__attribute__ ((noinline))
unsigned int foo (void)
{
  return 0;
}

int
main ()
{
  unsigned int u = -1;
  if (u == -1)
    {
      unsigned int n = foo ();
      if (n > 0)
	u = n - 1;
    }

  while (u != -1)
    {
      asm ("" : "+g" (u));
      u = -1;
      i = 1;
    }

  if (i)
    __builtin_abort ();
}

Comment 7 Marek Polacek 2015-02-13 15:58:24 UTC
Ok with -O and -O2 -fno-tree-vrp; fails with -O2.

Comment 8 Jakub Jelinek 2015-02-13 16:32:43 UTC
Tracking this upstream now.

Comment 9 Tom Lane 2015-02-15 17:36:44 UTC
Confirmed that postgresql-9.4.1-1.fc23 builds (including passing its self-tests) with gcc-5.0.0-0.13.fc23.x86_64, where it did not with gcc-5.0.0-0.12.fc23.x86_64.  Thanks for the quick turnaround!