A while ago this upstream bug got filed: https://bugzilla.mozilla.org/show_bug.cgi?id=407866 Apparently some asm opt no longer works when being built with gcc 4.3.0 One of the NSS developers claimed this must be fixed in gcc. Now, this bug is still present on latest Rawhide, and we have a mass rebuild with gcc 4.3.0 of everything scheduled. Until a better solution is known, I plan to use a workaround that is proposed in the upstream bug, which disables the optimization.
Created attachment 294453 [details] Proposed patch I found that compiling mozilla/security/nss/lib/freebl/mpi/mp_comba.c without optimizations works. But even -O1 breaks it. I tried enabling the optimizations in -O1 one by one, but couldn't identify which one breaks it. Perhaps it's a combination of optimizations that breaks, or something else that I don't understand. Finally, since mp_comba.c is third-party code (TomFastMath), I downloaded the latest version (TomFastMath 0.12) and compared the inline assembly code. One of the inline assembly code blocks, SQRADDDB, is different. I plugged in the version from TomFastMath 0.12, and it fixes this problem. I don't understand inline assembly, so I don't know if this is a GCC 4.3.0 pre-release bug, or a bug of the original SQRADDDB. I looked in the changes.txt file in TomFastMath 0.12: http://libtom.org/?page=changes&newsitems=5&whatfile=tfm but didn't spot anything that's related to this change. We will need to ask Tom St. Denis.
Comment on attachment 294453 [details] Proposed patch Our mp_comba.c is based on TFM v0.03. I found that the change in this patch was made in TFM v0.04, in CVS rev. 1.10 of fp_sqr_comba.c: http://libtom.org/cvs/cvsweb.pl/libtom/tomsfastmath/Attic/fp_sqr_comba.c http://libtom.org/cvs/cvsweb.pl/libtom/tomsfastmath/Attic/fp_sqr_comba.c.diff?r 1=1.9;r2=1.10 The CVS commit comment just says "fixes", so it's not clear what the bug was. It's also difficult to determine if it is safe to just take this change without the other changes between TFM v0.03 and v0.04. The only change that may be relevant is CVS rev. 1.5, made 11 days before this change. The long gap between the two changes suggests that they should be independent.
I tried to narrow down the problem a bit further by adding the newer implementation of SQRADDDB in parallel and swapping uses between the older and newer implementation. The single use that seems to be causing the problem in this case is the "output 25" section of s_mp_sqr_comba_16. Switching SQRADDDB over to the new version just in this section allows shsignlib to complete successfully.
Disassembly of SQRADDB in "output 25" section, GCC 4.1.2: 3f8c: 4c 01 c3 add %r8,%rbx 3f8f: 49 11 f9 adc %rdi,%r9 3f92: 48 11 f0 adc %rsi,%rax 3f95: 4c 01 c3 add %r8,%rbx 3f98: 49 11 f9 adc %rdi,%r9 3f9b: 48 11 f0 adc %rsi,%rax Disassembly of SQRADDB in "output 25" section, GCC 4.3: 47fe: 49 01 c8 add %rcx,%r8 4801: 48 11 ce adc %rcx,%rsi 4804: 48 11 ef adc %rbp,%rdi 4807: 49 01 c8 add %rcx,%r8 480a: 48 11 ce adc %rcx,%rsi 480d: 48 11 ef adc %rbp,%rdi The reuse of %rcx looks wrong to me. Looking through the rest of the disassembly, SQRADDB is compiled the same way (reuse of %rcx), not just in the "output 25" section. Looking at the disassembly of mp_comba.c where I replaced the SQRADDB in the "output 25" section with the newer implementation, the code emitted is different (as expected) and looks right: 4ab3: 4d 01 c8 add %r9,%r8 4ab6: 48 11 ee adc %rbp,%rsi 4ab9: 48 11 cf adc %rcx,%rdi 4abc: 4d 01 c8 add %r9,%r8 4abf: 48 11 ee adc %rbp,%rsi 4ac2: 48 11 cf adc %rcx,%rdi ...but the interesting thing (which seems to explain why shsignlib works), is that all of the other SQRADDB invocations (using the older implementation) now have the following code emitted: 4a25: 4c 01 d7 add %r10,%rdi 4a28: 49 11 e8 adc %rbp,%r8 4a2b: 48 11 ce adc %rcx,%rsi 4a2e: 4c 01 d7 add %r10,%rdi 4a31: 49 11 e8 adc %rbp,%r8 4a34: 48 11 ce adc %rcx,%rsi
As I reported in comment 1, mp_comba.c works if compiled without optimizations. I tried to narrow down which compiler optimization breaks it. I first tried adding -fno-strict-aliasing. It didn't help. I then looked at the current GCC 4.3.0 manual and got the list of optimization flags that -O turns on: -fauto-inc-dec \ -fcprop-registers \ -fdce \ -fdefer-pop \ -fdelayed-branch \ -fdse \ -fguess-branch-probability \ -fif-conversion2 \ -fif-conversion \ -finline-small-functions \ -fipa-pure-const \ -fipa-reference \ -fmerge-constants \ -fsplit-wide-types \ -ftree-ccp \ -ftree-ch \ -ftree-copyrename \ -ftree-dce \ -ftree-dominator-opts \ -ftree-dse \ -ftree-fre \ -ftree-sra \ -ftree-ter \ -funit-at-a-time \ -fomit-frame-pointer \ (on some platforms only) (See http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options.) -O breaks mp_comba.c, but if I use that list of optimization flags, with or without -fomit-frame-pointer, mp_comba.c works. This shows that -O is not equivalent to that list of optimization flags.
That's weird, because the patch in #c1 is buggy. As all the 3 input only operands are used after the 3 input/output registers have been modified, you need an earlyclobber on all 3 (plus the input only arguments could have "g" constraint rather than "r" I guess. Anyway, can you please attach preprocessed source (mp_comba.i that you get e.g. if you pass additional -save-temps option) and what exact options are used to compile it?
Created attachment 294481 [details] preprocessed output Compiler command line: gcc -o /home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/nss/freebl/Linux_SINGLE_SHLIB/mp_comba.o -c -O2 -fPIC -DLINUX1_2 -D_XOPEN_SOURCE -DLINUX2_1 -ansi -Wall -Werror-implicit-function-declaration -Wno-switch -pipe -DLINUX -Dlinux -D_POSIX_SOURCE -D_BSD_SOURCE -DHAVE_STRERROR -DXP_UNIX -DSHLIB_SUFFIX=\"so\" -DSHLIB_PREFIX=\"lib\" -DSHLIB_VERSION=\"3\" -DSOFTOKEN_SHLIB_VERSION=\"3\" -DRIJNDAEL_INCLUDE_TABLES -UDEBUG -DNDEBUG -D_REENTRANT -DNSS_ENABLE_ECC -DUSE_UTIL_DIRECTLY -DNSS_USE_64 -DNSS_BEVAND_ARCFOUR -DMPI_AMD64 -DMP_ASSEMBLY_MULTIPLY -DNSS_USE_COMBA -DMP_CHAR_STORE_SLOW -DMP_IS_LITTLE_ENDIAN -DMP_API_COMPATIBLE -I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include/nspr -I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include -I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/public/nss -I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/private/nss -I/home/kinetik/work/mozilla-cvs/obj-opt-x86_64-unknown-linux-gnu/dist/include -Impi -Iecl mpi/mp_comba.c
Oops, I missed that you were asking wtc for this info. Sorry! I hope you don't mind me providing it instead.
Going slightly further along the path of experimenting with specific optimization flags doesn't reveal much beyond what wtc found. According to: % gcc -c -Q -O0 --help=optimizers > o0 % gcc -c -Q -O1 --help=optimizers > o1 % diff o0 o1 | grep enabled ...(which is hopefully more accurate than the manual), the following optimizations are enabled at -O1: -fcprop-registers -fdefer-pop -fguess-branch-probability -fif-conversion -fif-conversion2 -fipa-pure-const -fipa-reference -fmerge-constants -fomit-frame-pointer -fsplit-wide-types -ftree-ccp -ftree-ch -ftree-copy-prop -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-salias -ftree-sink -ftree-sra -ftree-ter -funit-at-a-time ...which is a similar list to what's in the manual, with the addition of -ftree-salias and -ftree-sink, and some flags dropped due to platform differences, etc. Building with -O0 and the above optimizations enabled results in a binary that works. Building with -O1 and the above optimizations switched off (all changed to -fno-FOO) still results in a binary that fails. Strangely, building with -O1 and the above optimizations off and -fweb enabled results in a binary that works. This is strange because, according to the output from --help=optimizers at -O0 and -O1, -fweb is enabled in both cases anyway.
All of the s_mp_sqr_comba functions have the following comment and line of code: /* get rid of some compiler warnings */ sc0 = 0; sc1 = 0; sc2 = 0; If I comment out the initialization of sc[0-2], GCC emits a warning saying that these variables are used uninitialized, but the code emitted for SQRADDB looks valid and shsignlib works as expected. Both GCC 4.1.2 and 4.3 emit the same warnings. This explains why switching between -O0 with explicit optimization flags and -O1 with the same optimizations disabled resulted in -O0 builds working and -O1 builds failing--the warning for use of uninitialized variables (and presumably any optimization based on detecting this) only occurs at -O1 and higher and does not seemed to be controlled by any of the explicit optimization flags passed to the -O0 builds. Out of time for now, but I think the next step is to look at the code closely and determine if sc[0-2] are really used before initialization, or if GCC believes so incorrectly.
Reduced, this is really a GCC bug, tracking it upstream. Anyway, please don't apply the buggy patch in #c1. What you could do instead is to stop using unneccessary constraints in various inline asms and you'll then no longer hit this GCC bug. Particularly (untested): --- mp_comba.c.jj 2005-08-16 15:25:48.000000000 -0400 +++ mp_comba.c 2008-02-11 03:46:09.000000000 -0500 @@ -92,12 +92,12 @@ __asm__ ( #define SQRADDSC(i, j) \ __asm__ ( \ - "movq %6,%%rax \n\t" \ - "mulq %7 \n\t" \ + "movq %3,%%rax \n\t" \ + "mulq %4 \n\t" \ "movq %%rax,%0 \n\t" \ "movq %%rdx,%1 \n\t" \ "xorq %2,%2 \n\t" \ - :"=r"(sc0), "=r"(sc1), "=r"(sc2): "0"(sc0), "1"(sc1), "2"(sc2), "g"(i), "g"(j) :"%rax","%rdx","%cc"); + :"=r"(sc0), "=r"(sc1), "=r"(sc2): "g"(i), "g"(j) :"%rax","%rdx","%cc"); #define SQRADDAC(i, j) \ __asm__ ( \ @@ -116,7 +116,7 @@ __asm__ ( "addq %3,%0 \n\t" \ "adcq %4,%1 \n\t" \ "adcq %5,%2 \n\t" \ - :"=r"(c0), "=r"(c1), "=r"(c2), "=r"(sc0), "=r"(sc1), "=r"(sc2) : "0"(c0), "1"(c1), "2"(c2), "3"(sc0), "4"(sc1), "5"(sc2) : "%cc"); + :"=&r"(c0), "=&r"(c1), "=&r"(c2) : "0"(c0), "1"(c1), "2"(c2), "r"(sc0), "r"(sc1), "r"(sc2) : "%cc"); @@ -651,8 +651,6 @@ void s_mp_mul_comba_32(const mp_int *A, void s_mp_sqr_comba_4(const mp_int *A, mp_int *B) { mp_digit *a, b[8], c0, c1, c2, sc0, sc1, sc2; - /* get rid of some compiler warnings */ - sc0 = 0; sc1 = 0; sc2 = 0; a = A->dp; COMBA_START; @@ -705,8 +703,6 @@ void s_mp_sqr_comba_4(const mp_int *A, m void s_mp_sqr_comba_8(const mp_int *A, mp_int *B) { mp_digit *a, b[16], c0, c1, c2, sc0, sc1, sc2; - /* get rid of some compiler warnings */ - sc0 = 0; sc1 = 0; sc2 = 0; a = A->dp; COMBA_START; @@ -799,8 +795,6 @@ void s_mp_sqr_comba_8(const mp_int *A, m void s_mp_sqr_comba_16(const mp_int *A, mp_int *B) { mp_digit *a, b[32], c0, c1, c2, sc0, sc1, sc2; - /* get rid of some compiler warnings */ - sc0 = 0; sc1 = 0; sc2 = 0; a = A->dp; COMBA_START; @@ -974,8 +968,6 @@ void s_mp_sqr_comba_16(const mp_int *A, void s_mp_sqr_comba_32(const mp_int *A, mp_int *B) { mp_digit *a, b[64], c0, c1, c2, sc0, sc1, sc2; - /* get rid of some compiler warnings */ - sc0 = 0; sc1 = 0; sc2 = 0; a = A->dp; COMBA_START; SQRADDSC doesn't care about the previous value of sc0/sc1/sc2, so there is no need to pretend it needs it (and I believe thus you no longer need to initialize sc0/sc1/sc2). Similarly SQRADDDB doesn't every modify sc0/sc1/sc2, so it shouldn't pretend it does that (but earlyclobbers are needed to tell GCC that the input arguments are consumed after the output registers are modified.
Jakub, thanks for your help. I just tested your suggested changes. If I apply only your change to SQRADDSC, it works. If I apply your changes to both SQRADDSC and SQRADDDB, it doesn't work.
Created attachment 294894 [details] Proposed patch v2 (by Jakub Jelinek) Jakub, I believe this is what you meant in comment 11. In SQRADDDB, sc0, sc1, sc2 should be %6, %7, %8.
Yeah, sorry. Alternatively you could just have the 3 "r" constraints before the "0"/"1"/"2" constraints and not change the numbers in the pattern string.
Jakub, I'm still not familiar with the asm constraint syntax. Is this the alternative change you suggested? Thanks. @@ -116,7 +116,7 @@ __asm__ ( "addq %3,%0 \n\t" \ "adcq %4,%1 \n\t" \ "adcq %5,%2 \n\t" \ - :"=r"(c0), "=r"(c1), "=r"(c2), "=r"(sc0), "=r"(sc1), "=r"(sc2) : "0"(c0), "1"(c1), "2"(c2), "3"(sc0), "4"(sc1), "5"(sc2) : "%cc"); + :"=&r"(c0), "=&r"(c1), "=&r"(c2) : "r"(sc0), "r"(sc1), "r"(sc2), "0"(c0), "1"(c1), "2"(c2) : "%cc");
Yeah. Though the #c13 patch is fine as well.
Created attachment 294959 [details] Patch v3 This is patch v2 plus comment 15 incorporated. I've successfully built an nss.rpm with that patch on a x86_64 rawhide system using gcc 4.3.0. I'm currently building an rpm in the fedora rawhide build system. I took this as a motivation to finally get another open request done: the NSS rpm build script now executes the nss test suite and requires that it passes successfully.
Created attachment 295062 [details] Patch v4
I plan to remove the workaround patch as soon as a fixed gcc is released for rawhide, in order to keep the crypto in rawhide in synch with the upstream nss.
Well, IMNSHO you want instead sync this patch to upstream. It is not just an workaround.
Jakub, may I ask you to write some arguments that support the idea to merge this patch into upstream? Nelson Bolyard has expressed a preference to avoid unnecessary changes, so some arguments from you would be very helpful to convince him to accept it. Thanks a lot.
Jakub, I have three questions about the inline assembly. 1. In SQRADDSC, sc2 corresponds to register %2. We want to set sc2 to 0, and this is done using the instruction: "xorq %2,%2 \n\t" \ This is equivalent to the C code: sc2 = sc2 ^ sc2; So in some sense sc2 is used as input. Is it still correct to say sc2 is output only? (I understand that xorq instruction is an idiom to load 0 into a register.) 2. In SQRADDDB we add earlyclobbers to the output c0/c1/c2. What exactly is an "earlyclobber"? What are the consequences if we don't specify the earlyclobbers? 3. In SQRADDDB we have an alternative change to avoid changing the numbers in the pattern string. I'm just curious if having "holes" in the numbers have any bad effects, and if people usually try to avoid having holes in the numbers in the pattern string. Thanks!
The arguments for changing the current code are: 1) you can remove the unnecessary workarounds where you need to initialize sc* = 0 in order to avoid compiler warnings. All the clearing of the vars really has to be done in the generated code 2) by accurrately describing what the inline asm does you give the compiler more freedom to generate better code, especially if register preassure is high. xorq %2, %2 doesn't use the input in any way, because the integer registers are non-trapping and so it doesn't matter which value the register had before. Plus the i?86/x86_64 hardware special case this as load of 0 internally anyway. While talking about this, the pattern should be probably changed to avoid doing the xorq and mentioning sc2 at all - instead just sc2 = 0; in C code inside of the macro - then the compiler can schedule it separately as needed. Earlyclobbers (see info gcc --index-search='Constraint Modifier') tell the compiler that some input is used after the particular output has been overwritten. SQRADDDB uses all inputs twice, the second time after changing all 3 output registers, so you need 3 earlyclobbers. Without this the compiler could e.g. assign the same hard register to an output and input, provided it has the same value. Consider say: int output1 = 6, output2 = 4, input1 = 3, input2 = 6; asm ("addq %2, %0; addcq %3, %1" : "=r" (output1), "=r" (output2) : "r" (input1), "r" (input2), "0" (output1), "1" (output2)); The constraints are incorrect, because the compiler could very well assign e.g. output1 = %rdi, output2 = %rsi, input1 = %rcx, input2 = %rdi movl 6, %rdi; movl 4, %rsi; movl 3, %rcx; addq %rcx, %rdi; addcq %rdi, %rsi whic won't add 6 to %rsi, but 9. In this pattern only one earlyclobber would be needed, "=&r" (output1) to tell gcc that output1 must not be shared with any input regs. "holes" really don't matter, unless you need more than 10 operands, which would force you to use %[name] style.
I see that latest rawhide now contains the updated gcc-4.3.0-9, which has the fix for this bug, according to changelog. I tested unmodified NSS, and I confirm it works correctly.