432146 – nss miscompiles on x86_64 with gcc 4.3.0

Bug 432146 - nss miscompiles on x86_64 with gcc 4.3.0

Summary: nss miscompiles on x86_64 with gcc 4.3.0

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-09 02:46 UTC by Kai Engert (:kaie) (inactive account)
Modified:	2008-02-19 07:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-02-11 08:51:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch (1.89 KB, patch) 2008-02-09 06:34 UTC, Wan-Teh Chang	no flags	Details \| Diff
preprocessed output (587.39 KB, text/plain) 2008-02-10 01:15 UTC, Matthew Gregan [:kinetik]	no flags	Details
Proposed patch v2 (by Jakub Jelinek) (4.90 KB, patch) 2008-02-14 07:10 UTC, Wan-Teh Chang	no flags	Details \| Diff
Patch v3 (4.52 KB, patch) 2008-02-15 01:36 UTC, Kai Engert (:kaie) (inactive account)	no flags	Details \| Diff
Patch v4 (4.51 KB, patch) 2008-02-16 05:08 UTC, Kai Engert (:kaie) (inactive account)	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNU Compiler Collection	35160	0	None	None	None	Never

Description Kai Engert (:kaie) (inactive account) 2008-02-09 02:46:06 UTC

A while ago this upstream bug got filed:
  https://bugzilla.mozilla.org/show_bug.cgi?id=407866

Apparently some asm opt no longer works when being built with gcc 4.3.0

One of the NSS developers claimed this must be fixed in gcc.

Now, this bug is still present on latest Rawhide, and we have a mass rebuild
with gcc 4.3.0 of everything scheduled.

Until a better solution is known, I plan to use a workaround that is proposed in
the upstream bug, which disables the optimization.

Comment 1 Wan-Teh Chang 2008-02-09 06:34:27 UTC

Created attachment 294453 [details]
Proposed patch

I found that compiling mozilla/security/nss/lib/freebl/mpi/mp_comba.c
without optimizations works.  But even -O1 breaks it.  I tried enabling
the optimizations in -O1 one by one, but couldn't identify which one
breaks it.  Perhaps it's a combination of optimizations that breaks,
or something else that I don't understand.

Finally, since mp_comba.c is third-party code (TomFastMath), I downloaded
the latest version (TomFastMath 0.12) and compared the inline assembly
code.  One of the inline assembly code blocks, SQRADDDB, is different.
I plugged in the version from TomFastMath 0.12, and it fixes this problem.
I don't understand inline assembly, so I don't know if this is a GCC 4.3.0
pre-release bug, or a bug of the original SQRADDDB.

I looked in the changes.txt file in TomFastMath 0.12:
http://libtom.org/?page=changes&newsitems=5&whatfile=tfm

but didn't spot anything that's related to this change.
We will need to ask Tom St. Denis.

Comment 2 Wan-Teh Chang 2008-02-09 15:18:50 UTC

Comment on attachment 294453 [details]
Proposed patch

Our mp_comba.c is based on TFM v0.03.  I found that the
change in this patch was made in TFM v0.04, in CVS rev. 1.10
of fp_sqr_comba.c:

http://libtom.org/cvs/cvsweb.pl/libtom/tomsfastmath/Attic/fp_sqr_comba.c

http://libtom.org/cvs/cvsweb.pl/libtom/tomsfastmath/Attic/fp_sqr_comba.c.diff?r
1=1.9;r2=1.10

The CVS commit comment just says "fixes", so it's not clear
what the bug was.  It's also difficult to determine if it is
safe to just take this change without the other changes between
TFM v0.03 and v0.04. The only change that may be relevant is
CVS rev. 1.5, made 11 days before this change.	The long gap
between the two changes suggests that they should be independent.

Comment 3 Matthew Gregan [:kinetik] 2008-02-09 15:54:41 UTC

I tried to narrow down the problem a bit further by adding the newer
implementation of SQRADDDB in parallel and swapping uses between the older and
newer implementation.

The single use that seems to be causing the problem in this case is the "output
25" section of s_mp_sqr_comba_16.  Switching SQRADDDB over to the new version
just in this section allows shsignlib to complete successfully.

Comment 4 Matthew Gregan [:kinetik] 2008-02-09 16:36:55 UTC

Disassembly of SQRADDB in "output 25" section, GCC 4.1.2:
    3f8c:       4c 01 c3                add    %r8,%rbx
    3f8f:       49 11 f9                adc    %rdi,%r9
    3f92:       48 11 f0                adc    %rsi,%rax
    3f95:       4c 01 c3                add    %r8,%rbx
    3f98:       49 11 f9                adc    %rdi,%r9
    3f9b:       48 11 f0                adc    %rsi,%rax

Disassembly of SQRADDB in "output 25" section, GCC 4.3:
    47fe:       49 01 c8                add    %rcx,%r8
    4801:       48 11 ce                adc    %rcx,%rsi
    4804:       48 11 ef                adc    %rbp,%rdi
    4807:       49 01 c8                add    %rcx,%r8
    480a:       48 11 ce                adc    %rcx,%rsi
    480d:       48 11 ef                adc    %rbp,%rdi

The reuse of %rcx looks wrong to me.  Looking through the rest of the
disassembly, SQRADDB is compiled the same way (reuse of %rcx), not just in the
"output 25" section.

Looking at the disassembly of mp_comba.c where I replaced the SQRADDB in the
"output 25" section with the newer implementation, the code emitted is different
(as expected) and looks right:

    4ab3:       4d 01 c8                add    %r9,%r8
    4ab6:       48 11 ee                adc    %rbp,%rsi
    4ab9:       48 11 cf                adc    %rcx,%rdi
    4abc:       4d 01 c8                add    %r9,%r8
    4abf:       48 11 ee                adc    %rbp,%rsi
    4ac2:       48 11 cf                adc    %rcx,%rdi

...but the interesting thing (which seems to explain why shsignlib works), is
that all of the other SQRADDB invocations (using the older implementation) now
have the following code emitted:

    4a25:       4c 01 d7                add    %r10,%rdi
    4a28:       49 11 e8                adc    %rbp,%r8
    4a2b:       48 11 ce                adc    %rcx,%rsi
    4a2e:       4c 01 d7                add    %r10,%rdi
    4a31:       49 11 e8                adc    %rbp,%r8
    4a34:       48 11 ce                adc    %rcx,%rsi

Comment 5 Wan-Teh Chang 2008-02-09 19:04:47 UTC

As I reported in comment 1, mp_comba.c works if compiled without
optimizations.  I tried to narrow down which compiler optimization
breaks it.

I first tried adding -fno-strict-aliasing.  It didn't help.

I then looked at the current GCC 4.3.0 manual and got the list
of optimization flags that -O turns on:

          -fauto-inc-dec \
          -fcprop-registers \
          -fdce \
          -fdefer-pop \
          -fdelayed-branch \
          -fdse \
          -fguess-branch-probability \
          -fif-conversion2 \
          -fif-conversion \
          -finline-small-functions \
          -fipa-pure-const \
          -fipa-reference \
          -fmerge-constants \
          -fsplit-wide-types \
          -ftree-ccp \
          -ftree-ch \
          -ftree-copyrename \
          -ftree-dce \
          -ftree-dominator-opts \
          -ftree-dse \
          -ftree-fre \
          -ftree-sra \
          -ftree-ter \
          -funit-at-a-time \
          -fomit-frame-pointer \  (on some platforms only)

(See http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options.)

-O breaks mp_comba.c, but if I use that list of optimization
flags, with or without -fomit-frame-pointer, mp_comba.c works.
This shows that -O is not equivalent to that list of optimization
flags.

Comment 6 Jakub Jelinek 2008-02-09 20:13:18 UTC

That's weird, because the patch in #c1 is buggy.  As all the 3 input only operands
are used after the 3 input/output registers have been modified, you need an
earlyclobber on all 3 (plus the input only arguments could have "g" constraint
rather than "r" I guess.

Anyway, can you please attach preprocessed source (mp_comba.i that you get e.g.
if you pass additional -save-temps option) and what exact options are used to
compile it?

Comment 7 Matthew Gregan [:kinetik] 2008-02-10 01:15:05 UTC

Created attachment 294481 [details]
preprocessed output

Compiler command line:

gcc -o
/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/nss/freebl/Linux_SINGLE_SHLIB/mp_comba.o
-c -O2 -fPIC -DLINUX1_2 -D_XOPEN_SOURCE -DLINUX2_1  -ansi -Wall
-Werror-implicit-function-declaration -Wno-switch -pipe -DLINUX -Dlinux
-D_POSIX_SOURCE -D_BSD_SOURCE -DHAVE_STRERROR -DXP_UNIX -DSHLIB_SUFFIX=\"so\"
-DSHLIB_PREFIX=\"lib\" -DSHLIB_VERSION=\"3\" -DSOFTOKEN_SHLIB_VERSION=\"3\"
-DRIJNDAEL_INCLUDE_TABLES -UDEBUG -DNDEBUG -D_REENTRANT -DNSS_ENABLE_ECC
-DUSE_UTIL_DIRECTLY -DNSS_USE_64 -DNSS_BEVAND_ARCFOUR -DMPI_AMD64
-DMP_ASSEMBLY_MULTIPLY -DNSS_USE_COMBA -DMP_CHAR_STORE_SLOW
-DMP_IS_LITTLE_ENDIAN -DMP_API_COMPATIBLE
-I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include/nspr
-I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include 
-I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/public/nss
-I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/private/nss
-I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include 
-Impi -Iecl  mpi/mp_comba.c

Comment 8 Matthew Gregan [:kinetik] 2008-02-10 01:17:08 UTC

Oops, I missed that you were asking wtc for this info.  Sorry!  I hope you don't
mind me providing it instead.

Comment 9 Matthew Gregan [:kinetik] 2008-02-10 02:48:31 UTC

Going slightly further along the path of experimenting with specific
optimization flags doesn't reveal much beyond what wtc found.  According to:

% gcc -c -Q -O0 --help=optimizers > o0
% gcc -c -Q -O1 --help=optimizers > o1
% diff o0 o1 | grep enabled

...(which is hopefully more accurate than the manual), the following
optimizations are enabled at -O1:

        -fcprop-registers
        -fdefer-pop
        -fguess-branch-probability
        -fif-conversion
        -fif-conversion2
        -fipa-pure-const
        -fipa-reference
        -fmerge-constants
        -fomit-frame-pointer
        -fsplit-wide-types
        -ftree-ccp
        -ftree-ch
        -ftree-copy-prop
        -ftree-copyrename
        -ftree-dce
        -ftree-dominator-opts
        -ftree-dse
        -ftree-fre
        -ftree-salias
        -ftree-sink
        -ftree-sra
        -ftree-ter
        -funit-at-a-time

...which is a similar list to what's in the manual, with the addition of
-ftree-salias and -ftree-sink, and some flags dropped due to platform
differences, etc.

Building with -O0 and the above optimizations enabled results in a binary that
works.  Building with -O1 and the above optimizations switched off (all changed
to -fno-FOO) still results in a binary that fails.

Strangely, building with -O1 and the above optimizations off and -fweb enabled
results in a binary that works.  This is strange because, according to the
output from --help=optimizers at -O0 and -O1, -fweb is enabled in both cases anyway.

Comment 10 Matthew Gregan [:kinetik] 2008-02-10 05:26:06 UTC

All of the s_mp_sqr_comba functions have the following comment and line of code:

        /* get rid of some compiler warnings */
        sc0 = 0; sc1 = 0; sc2 = 0;

If I comment out the initialization of sc[0-2], GCC emits a warning saying that
these variables are used uninitialized, but the code emitted for SQRADDB looks
valid and shsignlib works as expected.  Both GCC 4.1.2 and 4.3 emit the same
warnings.

This explains why switching between -O0 with explicit optimization flags and -O1
with the same optimizations disabled resulted in -O0 builds working and -O1
builds failing--the warning for use of uninitialized variables (and presumably
any optimization based on detecting this) only occurs at -O1 and higher and does
not seemed to be controlled by any of the explicit optimization flags passed to
the -O0 builds.

Out of time for now, but I think the next step is to look at the code closely
and determine if sc[0-2] are really used before initialization, or if GCC
believes so incorrectly.

Comment 11 Jakub Jelinek 2008-02-11 08:51:08 UTC

Reduced, this is really a GCC bug, tracking it upstream.
Anyway, please don't apply the buggy patch in #c1.
What you could do instead is to stop using unneccessary constraints in various
inline asms and you'll then no longer hit this GCC bug.
Particularly (untested):
--- mp_comba.c.jj       2005-08-16 15:25:48.000000000 -0400
+++ mp_comba.c  2008-02-11 03:46:09.000000000 -0500
@@ -92,12 +92,12 @@ __asm__ (                               
 
 #define SQRADDSC(i, j)                                    \
 __asm__ (                                                     \
-     "movq  %6,%%rax     \n\t"                            \
-     "mulq  %7           \n\t"                            \
+     "movq  %3,%%rax     \n\t"                            \
+     "mulq  %4           \n\t"                            \
      "movq  %%rax,%0     \n\t"                            \
      "movq  %%rdx,%1     \n\t"                            \
      "xorq  %2,%2        \n\t"                            \
-     :"=r"(sc0), "=r"(sc1), "=r"(sc2): "0"(sc0), "1"(sc1), "2"(sc2), "g"(i),
"g"(j) :"%rax","%rdx","%cc");
+     :"=r"(sc0), "=r"(sc1), "=r"(sc2): "g"(i), "g"(j) :"%rax","%rdx","%cc");
 
 #define SQRADDAC(i, j)                                                         \
 __asm__ (                                                     \
@@ -116,7 +116,7 @@ __asm__ (                               
      "addq %3,%0         \n\t"                            \
      "adcq %4,%1         \n\t"                            \
      "adcq %5,%2         \n\t"                            \
-     :"=r"(c0), "=r"(c1), "=r"(c2), "=r"(sc0), "=r"(sc1), "=r"(sc2) : "0"(c0),
"1"(c1), "2"(c2), "3"(sc0), "4"(sc1), "5"(sc2) : "%cc");
+     :"=&r"(c0), "=&r"(c1), "=&r"(c2) : "0"(c0), "1"(c1), "2"(c2), "r"(sc0),
"r"(sc1), "r"(sc2) : "%cc");
 
 
 
@@ -651,8 +651,6 @@ void s_mp_mul_comba_32(const mp_int *A, 
 void s_mp_sqr_comba_4(const mp_int *A, mp_int *B)
 {
    mp_digit *a, b[8], c0, c1, c2, sc0, sc1, sc2;
-   /* get rid of some compiler warnings */
-   sc0 = 0; sc1 = 0; sc2 = 0;
 
    a = A->dp;
    COMBA_START; 
@@ -705,8 +703,6 @@ void s_mp_sqr_comba_4(const mp_int *A, m
 void s_mp_sqr_comba_8(const mp_int *A, mp_int *B)
 {
    mp_digit *a, b[16], c0, c1, c2, sc0, sc1, sc2;
-   /* get rid of some compiler warnings */
-   sc0 = 0; sc1 = 0; sc2 = 0;
 
    a = A->dp;
    COMBA_START; 
@@ -799,8 +795,6 @@ void s_mp_sqr_comba_8(const mp_int *A, m
 void s_mp_sqr_comba_16(const mp_int *A, mp_int *B)
 {
    mp_digit *a, b[32], c0, c1, c2, sc0, sc1, sc2;
-   /* get rid of some compiler warnings */
-   sc0 = 0; sc1 = 0; sc2 = 0;
 
    a = A->dp;
    COMBA_START; 
@@ -974,8 +968,6 @@ void s_mp_sqr_comba_16(const mp_int *A, 
 void s_mp_sqr_comba_32(const mp_int *A, mp_int *B)
 {
    mp_digit *a, b[64], c0, c1, c2, sc0, sc1, sc2;
-   /* get rid of some compiler warnings */
-   sc0 = 0; sc1 = 0; sc2 = 0;
 
    a = A->dp;
    COMBA_START; 

SQRADDSC doesn't care about the previous value of sc0/sc1/sc2, so there is no
need to pretend it needs it (and I believe thus you no longer need to initialize
sc0/sc1/sc2).  Similarly SQRADDDB doesn't every modify sc0/sc1/sc2, so it
shouldn't pretend it does that (but earlyclobbers are needed to tell GCC
that the input arguments are consumed after the output registers are modified.

Comment 12 Wan-Teh Chang 2008-02-13 15:50:06 UTC

Jakub, thanks for your help.  I just tested your suggested changes.

If I apply only your change to SQRADDSC, it works.
If I apply your changes to both SQRADDSC and SQRADDDB, it doesn't work.

Comment 13 Wan-Teh Chang 2008-02-14 07:10:19 UTC

Created attachment 294894 [details]
Proposed patch v2 (by Jakub Jelinek)

Jakub, I believe this is what you meant in comment 11.
In SQRADDDB, sc0, sc1, sc2 should be %6, %7, %8.

Comment 14 Jakub Jelinek 2008-02-14 08:08:42 UTC

Yeah, sorry.  Alternatively you could just have the 3 "r" constraints
before the "0"/"1"/"2" constraints and not change the numbers in the pattern string.

Comment 15 Wan-Teh Chang 2008-02-14 14:27:50 UTC

Jakub, I'm still not familiar with the asm constraint syntax.
Is this the alternative change you suggested?  Thanks.

@@ -116,7 +116,7 @@ __asm__ (                               
      "addq %3,%0         \n\t"                            \
      "adcq %4,%1         \n\t"                            \
      "adcq %5,%2         \n\t"                            \
-     :"=r"(c0), "=r"(c1), "=r"(c2), "=r"(sc0), "=r"(sc1), "=r"(sc2) : "0"(c0),
"1"(c1), "2"(c2), "3"(sc0), "4"(sc1), "5"(sc2) : "%cc");
+     :"=&r"(c0), "=&r"(c1), "=&r"(c2) : "r"(sc0), "r"(sc1), "r"(sc2), "0"(c0),
"1"(c1), "2"(c2) : "%cc");

Comment 16 Jakub Jelinek 2008-02-14 14:37:06 UTC

Yeah.  Though the #c13 patch is fine as well.

Comment 17 Kai Engert (:kaie) (inactive account) 2008-02-15 01:36:10 UTC

Created attachment 294959 [details]
Patch v3

This is patch v2 plus comment 15 incorporated.

I've successfully built an nss.rpm with that patch on a x86_64 rawhide system
using gcc 4.3.0.

I'm currently building an rpm in the fedora rawhide build system.

I took this as a motivation to finally get another open request done: the NSS
rpm build script now executes the nss test suite and requires that it passes
successfully.

Comment 18 Kai Engert (:kaie) (inactive account) 2008-02-16 05:08:09 UTC

Created attachment 295062 [details]
Patch v4

Comment 19 Kai Engert (:kaie) (inactive account) 2008-02-16 05:31:35 UTC

I plan to remove the workaround patch as soon as a fixed gcc is released for
rawhide, in order to keep the crypto in rawhide in synch with the upstream nss.

Comment 20 Jakub Jelinek 2008-02-16 10:23:20 UTC

Well, IMNSHO you want instead sync this patch to upstream.  It is not just an
workaround.

Comment 21 Kai Engert (:kaie) (inactive account) 2008-02-16 11:21:25 UTC

Jakub, may I ask you to write some arguments that support the idea to merge this
patch into upstream?

Nelson Bolyard has expressed a preference to avoid unnecessary changes, so some
arguments from you would be very helpful to convince him to accept it.

Thanks a lot.

Comment 22 Wan-Teh Chang 2008-02-17 07:13:50 UTC

Jakub, I have three questions about the inline assembly.

1. In SQRADDSC, sc2 corresponds to register %2.  We want
to set sc2 to 0, and this is done using the instruction:                       
                
      "xorq  %2,%2        \n\t"                            \

This is equivalent to the C code:
      sc2 = sc2 ^ sc2;

So in some sense sc2 is used as input.  Is it still correct
to say sc2 is output only?  (I understand that xorq instruction
is an idiom to load 0 into a register.)

2. In SQRADDDB we add earlyclobbers to the output c0/c1/c2.
What exactly is an "earlyclobber"?  What are the consequences
if we don't specify the earlyclobbers?

3. In SQRADDDB we have an alternative change to avoid changing
the numbers in the pattern string.  I'm just curious if having
"holes" in the numbers have any bad effects, and if people
usually try to avoid having holes in the numbers in the pattern
string.

Thanks!

Comment 23 Jakub Jelinek 2008-02-17 09:50:38 UTC

The arguments for changing the current code are:
1) you can remove the unnecessary workarounds where you need to initialize
sc* = 0 in order to avoid compiler warnings.  All the clearing of the vars
really has to be done in the generated code
2) by accurrately describing what the inline asm does you give the compiler more
freedom to generate better code, especially if register preassure is high.

xorq %2, %2 doesn't use the input in any way, because the integer registers are
non-trapping and so it doesn't matter which value the register had before.  Plus
the i?86/x86_64 hardware special case this as load of 0 internally anyway.
While talking about this, the pattern should be probably changed to avoid doing
the xorq and mentioning sc2 at all - instead just sc2 = 0; in C code inside of
the macro - then the compiler can schedule it separately as needed.

Earlyclobbers (see info gcc --index-search='Constraint Modifier') tell the
compiler that some input is used after the particular output has been
overwritten. SQRADDDB uses all inputs twice, the second time after changing
all 3 output registers, so you need 3 earlyclobbers.  Without this the compiler
could e.g. assign the same hard register to an output and input, provided it
has the same value.  Consider say:
  int output1 = 6, output2 = 4, input1 = 3, input2 = 6;
  asm ("addq %2, %0; addcq %3, %1" : "=r" (output1), "=r" (output2) : "r"
(input1), "r" (input2), "0" (output1), "1" (output2));
The constraints are incorrect, because the compiler could very well assign
e.g. output1 = %rdi, output2 = %rsi, input1 = %rcx, input2 = %rdi
movl 6, %rdi; movl 4, %rsi; movl 3, %rcx;
addq %rcx, %rdi; addcq %rdi, %rsi
whic won't add 6 to %rsi, but 9.  In this pattern only one earlyclobber would be
needed, "=&r" (output1) to tell gcc that output1 must not be shared with any
input regs.

"holes" really don't matter, unless you need more than 10 operands, which would
force you to use %[name] style.

Comment 24 Kai Engert (:kaie) (inactive account) 2008-02-19 07:56:44 UTC

I see that latest rawhide now contains the updated gcc-4.3.0-9, which has the
fix for this bug, according to changelog.

I tested unmodified NSS, and I confirm it works correctly.

Note You need to log in before you can comment on or make changes to this bug.