Bug 1546964 - pdftex segfaults on i686 (breaks multiple package builds)
Summary: pdftex segfaults on i686 (breaks multiple package builds)
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: texlive
Version: rawhide
Hardware: i686
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: Tom "spot" Callaway
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1546913 (view as bug list)
Depends On:
Blocks: 1538855 1546676 1547112
TreeView+ depends on / blocked
 
Reported: 2018-02-20 07:50 UTC by Adam Williamson
Modified: 2018-02-23 03:30 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-02-22 22:01:49 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
GNU Compiler Collection 84478 0 None None None 2018-02-20 22:05:33 UTC

Description Adam Williamson 2018-02-20 07:50:11 UTC
Since it was rebuilt with GCC 8 - we think - pdftex is segfaulting on i686. This is breaking at least two package builds, OpenColorIO:

https://koji.fedoraproject.org/koji/taskinfo?taskID=25176488

which is ultimately in the dependency path of a key component of KDE, Calligra, and R-htmltools:

https://koji.fedoraproject.org/koji/taskinfo?taskID=25173683

It's very likely breaking / going to break other package builds too.

Both segfault like this during doc generation:

mktexfmt [INFO]: --- remaking pdflatex with pdftex
mktexfmt: running `pdftex -ini   -jobname=pdflatex -progname=pdflatex -translate-file=cp227.tcx *pdflatex.ini' ...
sh: line 1: 29445 Segmentation fault      (core dumped) pdftex -ini -jobname=pdflatex -progname=pdflatex -translate-file=cp227.tcx *pdflatex.ini 1>&2 < /dev/null
mktexfmt [ERROR]: running `pdftex -ini   -jobname=pdflatex -progname=pdflatex -translate-file=cp227.tcx *pdflatex.ini >&2 </dev/null' return status 139
mktexfmt [ERROR]: `pdftex -ini   -jobname=pdflatex -progname=pdflatex -translate-file=cp227.tcx *pdflatex.ini >&2 </dev/null' failed (no pdflatex.fmt)

I have shelled into the OpenColorIO build process in a mock and got a backtrace out of gdb. However, that's hardly the end of the story, because this is goddamn texlive so of course it isn't:

1834	pdftex-pool.c: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (Thread 0xf731d740 (LWP 110)):
#0  0x565cb88d in loadpoolstrings (spare_size=6160000) at pdftex-pool.c:1834
        l = <optimized out>
        s = 0x56824001 <error: Cannot access memory at address 0x56824001>
        g = 0
        i = 1727
        j = <optimized out>
#1  0x565651e0 in getstringsstarted () at pdftexini.c:649
        Result = <optimized out>
        k = <optimized out>
        l = <optimized out>
        g = <optimized out>
#2  0x56572d97 in mainbody () at pdftexini.c:5301
        eqtb = 0xf4255010
#3  0x5655d9f3 in main (ac=6, av=0xffffd674) at ../../../texk/web2c/lib/texmfmp.c:1013
No locals.

Yup: it's crashing in a file that, so far as Fedora debuginfo is concerned, doesn't exist. This is because pdftex-pool.c is *generated on the fly during the build process*, because there is not enough whisky in the goddamn world.

This is where loadpoolstrings *ultimately* actually gets defined, as best as I can tell:

https://tug.org/svn/pdftex/branches/stable/source/src/texk/web2c/web2c/makecpool.c?view=markup

around line 81. Naturally it uses variables with single-character names. See above note in re whisky.

This is about as far as I've got with this mess so far. Updates as and when the whisky resupply truck arrives.

Comment 1 Adam Williamson 2018-02-20 07:58:27 UTC
Kevin, Rex: this is why I still can't rebuild OpenColorIO, hence why we can't rebuild Calligra, hence why Calligra has dependency problems and isn't in the lives ATM.

Comment 2 Tom Hughes 2018-02-20 08:23:10 UTC
It broke gdal as well (https://koji.fedoraproject.org/koji/buildinfo?buildID=1044737) because the pdftex segv caused the noarch packages to not match between architectures.

Comment 3 Tom Hughes 2018-02-20 08:24:24 UTC
It looks like armv7hl is broken as well so it's likely something specific to 32 bit platforms.

Comment 4 Adam Williamson 2018-02-20 08:51:21 UTC
yeah, I think basically this hoopy makecpool nuttiness is overflowing something - it's basically stuffing a whole bunch of...contents of some 'pool' files which I think are *also* dynamically generated?...or something...into this 'poolfilearr' array, then iterating over it...'s' in loadpoolstrings is meant to be the next bit read out from poolfilearr, but instead at some point it fails to read what should be the next thing in poolfilearr (because it's too big, or something?) and so s becomes the error message ('Cannot access memory') instead and everything blows up. but that's as far as my limited C skills take me especially at this time of night.

it makes fuzzy sense to me that this would happen on 32-bit but not 64-bit - poolfilearr is probably bigger on 64-bit, or something - but not *specific* sense yet.

Comment 5 Adam Williamson 2018-02-20 09:26:13 UTC
If it helps anyone fiddling with this, here's how you can get to interact with it:

1. Grab https://kojipkgs.fedoraproject.org//work/tasks/6436/25176436/OpenColorIO-1.1.0-3.fc28.src.rpm
2. Try to build it in a fedora-rawhide-i386 mock: 'mock -r fedora-rawhide-i386 --rebuild OpenColorIO-1.1.0-3.fc28.src.rpm', it should fail
3. Shell into the mock: 'mock -r fedora-rawhide-i386 --shell' (or you may want to use 'mock -r fedora-rawhide-i386 --enable-network --shell' so you can use the network in the mock shell, e.g. to install debuginfo packages)
4. cd /builddir/build/BUILD/OpenColorIO-1.1.0/build/
5. pdftex -ini   -jobname=pdflatex -progname=pdflatex -translate-file=cp227.tcx *pdflatex.ini

That last command should trigger the crash each time you run it. You can use 'mock -r fedora-rawhide-i386 --install (package)' from outside the mock to install packages; if you do 'mock -r fedora-rawhide-i386 --install dnf' then shell into the mock with --enable-network you can use dnf from within the mock to install debuginfo packages and stuff.

Comment 6 Tom Hughes 2018-02-20 09:44:51 UTC
So it looks from the stack trace like s is bogus, but that is just tracking the generated strings in poolfilearr which is all just static data generated when the C file is created and I can't see any obvious bug.

The valgrind trace is much the same as the crash:

==19== Invalid read of size 1
==19==    at 0x17E88D: loadpoolstrings (pdftex-pool.c:1834)
==19==    by 0x1181DF: getstringsstarted (pdftexini.c:649)
==19==    by 0x125D96: mainbody (pdftexini.c:5301)
==19==    by 0x1109F2: main (texmfmp.c:1013)
==19==  Address 0x271000 is not stack'd, malloc'd or (recently) free'd

So far I'm leaning towards a compiler bug...

Comment 7 Jakub Jelinek 2018-02-20 12:00:10 UTC
Reduced testcase for -m32 -O2:
long poolptr;
unsigned char *strpool;
static const char *poolfilearr[] = {
  "mu",
#define A "",
#define B A A A A A A A A A A
#define C B B B B B B B B B B
#define D C C C C C C C C C C
  D C C C C C C C B B B A
 ((void *)0) 
};

__attribute__((noipa)) long
makestring (void)
{
  return 0;
}

__attribute__((noipa))
int
loadpoolstrings (long spare_size)
{
  const char *s;
  long g = 0;
  int i = 0, j = 0;
  while ((s = poolfilearr[j++]))
    {
      int l = __builtin_strlen (s);
      i += l;
      if (i >= spare_size) return 0;
      while (l-- > 0) strpool[poolptr++] = *s++;
      g = makestring ();
    }
  return g;
}

int
main ()
{
  poolptr = 0;
  strpool = __builtin_malloc (1000);
  asm volatile ("" : : : "memory");
  volatile int r = loadpoolstrings (1000);
  __builtin_free (strpool);
  return 0;
}

Looking into this.

Comment 8 Rex Dieter 2018-02-20 13:56:57 UTC
Should be possible to do a bootstrapped OpenColorIO build (without docs or whatever it uses pdftex for).  I'll look into that.

Comment 9 Rex Dieter 2018-02-20 14:36:50 UTC
bootstrapped OpenColorIO is underway (looks promising, several archs completed already), opened bug #1547112 to follow this one so that docs can be re-enabled someday.

Comment 10 Tom Hughes 2018-02-20 18:02:50 UTC
*** Bug 1546913 has been marked as a duplicate of this bug. ***

Comment 11 Dominik 'Rathann' Mierzejewski 2018-02-21 11:04:19 UTC
It looks like gcc-8.0.1-0.15 contains the fix, so once it's done building, pdftex can be fixed by rebuilding. Big thanks to Jakub, Adam and everyone else involved!

Comment 12 Adam Williamson 2018-02-21 16:18:35 UTC
The fun part is that we ship pdftex out of texlive, so we'll get to rebuild the whole of texlive...which is always fun! Especially with a new GCC. It looks like dtardon did manage to build it on 2018-02-15, though, so maybe it'll be OK.

Comment 13 Sergio Basto 2018-02-21 20:58:51 UTC
gcc-8.0.1-0.15.fc28 finished to build  , who build pdftex ?

Comment 14 Adam Williamson 2018-02-21 21:50:05 UTC
I've got it.

Here's a fun note: I think texlive's own test suite actually caught this bug, in this build:

https://koji.fedoraproject.org/koji/taskinfo?taskID=24899896

at least, two pdftex tests failed on i686 in that build:

FAIL: pdftexdir/wprob.test
FAIL: pdftexdir/pdfimage.test

along with several others...then dtardon just turned off the failing tests and built it again :/

I understand why, though - he wanted to get it rebuilt for a poppler soname bump. Just very unfortunate that the poppler soname bump and the GCC 8 update coincided like that :(

I am going to rebuild it with the failing tests still turned off first, then fire a build with the tests turned back *on* and see how many still fail; perhaps we have other issues with GCC to fix here.

Comment 15 Jakub Jelinek 2018-02-22 08:41:53 UTC
This should be fixed in gcc-8.0.1-0.15.fc{28,29}, texlive needs to be rebuilt.

Comment 16 Adam Williamson 2018-02-22 10:53:14 UTC
It already is rebuilt. Well, for fc29. The fc28 build ran but failed due to a koji issue; puiterwijk has fixed that and I've re-fired it.

Comment 17 Sergio Basto 2018-02-22 16:53:12 UTC
BTW in resume of my though , IMHO, if we disable the tests, we should use the same that is used in gpgme [1] (%bcond_without check) and disable all check.

GCC8 have make check || : , and 90% of building time of building is in make test , so disable all make test , we can save a lot of hours . 


[1]
https://src.fedoraproject.org/rpms/gpgme/blob/master/f/gpgme.spec

Comment 18 Jakub Jelinek 2018-02-22 17:01:49 UTC
While GCC uses make check || :, I have scripts that record the test results from the build.log files and compare them regularly, and know what FAILs are blockers and what aren't that important.  Especially for the compiler it is a very bad idea to skip the tests.

Comment 19 Adam Williamson 2018-02-22 22:01:49 UTC
F28 and F29 builds are done now.

Comment 20 Sergio Basto 2018-02-23 03:30:12 UTC
(In reply to Jakub Jelinek from comment #18)
> While GCC uses make check || :, I have scripts that record the test results
> from the build.log files and compare them regularly, and know what FAILs are
> blockers and what aren't that important.  Especially for the compiler it is
> a very bad idea to skip the tests.

I'm just asking you to be reasonable, when we will ignore tests we may also disable it at all and save about 20 hours of build time (or at least a huge amount of time). 
I'm not asking to disable tests forever just when we need save time, notice that gpgme have tests enabled but we may disable it (when we have soname bump emergency)

Best regards,


Note You need to log in before you can comment on or make changes to this bug.