From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.7.3) Gecko/20041020 Firefox/0.10.1 Description of problem: It appears that gcc in FC3 miscompiles numerical code. The problem appears to be with lapack libraries and can be demonstrated with octave (which uses them): [dima@localhost ~]$ octave GNU Octave, version 2.1.57 (i686-pc-linux-gnu). Copyright (C) 2004 John W. Eaton. .... octave:1> a=rand(100); octave:2> tic; eig(a); toc error: dgeev failed to converge octave:2> --------------- Sometimes it just hangs there for few minutes after which I kill it. I tried to compile octave myseelf agains ATLAS (different, optimized blas/lapack implementation) libraries, which I compile myself as well. The result was the same. I also tried different versions of octave. This al works on RHEL3, RH9, FC2, FC1. It is possible that the problem is actually with glibc. I was not able to recompile octave with gcc33 to check that. Version-Release number of selected component (if applicable): gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3) How reproducible: Always Steps to Reproduce: 1. start octave 2. type at the octave prompt as shown 3. Actual Results: octave hangs or gives an error Expected Results: "0.004" (may vary slightly) -- this is 4 msec that it too the code to run on FC2. Additional info: this is on athlon/xp 2000MHz/500Meg.
*** Bug 138685 has been marked as a duplicate of this bug. ***
It is me again. Recompiling both lapack and octave with FFLAGS="-O -ffloat-storage" seems to solve this problem.
Then it is IMHO not a GCC bug. -ffloat-store is not the default on purpose, it is too slow and most of the software out there doesn't need it. IMHO you want to open a bug against lapack (and/or octave) and request that it be compiled in two versions on IA-32: -ffloat-store and -mfpmath=sse -msse2 (the latter for P4 & recent AMD CPUs).
The issue here is that lapack is being around for a while and is compiled by previous generations of gcc as well as bunch of other compilers just fine. Suddenly it breaks, which make me think that the default compiler options are not "safe." When clock chimes 13 times it is not the 13th chime that is broken, it is the clock that needs fixing. I did catch lapack, but who knows what else might be broken? As for the speed part -- I heard rumors that -ffloat-store is slow, but in fact all my benchmark run now with the same speed as always (and some, which involve itterations, like Schur decomposition, runs about 20% faster because of the faster convergence). Anyway, I did not think of -ffloat-stare as a fix, but ruther a workaround for a potentially more serious problem with gcc.
Beg to disagree. For code that relies on computation not being done with extra precision -ffloat-store is a must and lapack clearly relies on it. Why things worked in this particular case with < GCC 3.4 and don't work anymore is most probably because GCC is now better at optimizing and likely will have less spills to memory (and only spills to memory on the mis-designed i387 FPU round to the declared precision instead of using full long double precision). If you give me one exact routine in lapack that causes the problems you are seeing, I can look at it in detail and tell exactly what is going on. But I certainly don't intend to debug half of octave/lapack to figure it out.
Created attachment 106477 [details] lapack file causing the problem
Can you please also specify what exact arguments you are calling that routine with? Thanks.
All my comments are gone -- I will try again. I understand that the problem is (most likely) due to aggressive optimization. My concern was that it is _too_ aggressive and produces wrong code. The error which I posted was in the file dgeev.f (from lapack). I attached the file for the reference. I am going to write to octave mailing list about it. Perhaps it would be good to involve lapack peopl as well. I did file this as a bug against lapack: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=138791 Thanks for your attention and a fast response. Sincerely, Dmitri.
Since I do not call it directly (I use octave, which calls this lapack routine), it is hard for me to tell you how it is called exactly. I am going to write to Octave mailing list and perhaps John Eaton (Octave author) can give you an authoritative answer. Sincerely, Dmitri.
Created attachment 109763 [details] Simple test code which causes LAPACK to hang Compile with f77 test_svd.f -llapack -o test_svd_fedora
I just created an attachment (sorry that I didn't put all the comments there, I thought it would bring me back here). The above line will produce a binary linked against the dynamic lapack/blas in FC3: planck[libmwrep]> ldd test_svd_fedora liblapack.so.3 => /usr/lib/liblapack.so.3 (0x003d6000) libg2c.so.0 => /usr/lib/libg2c.so.0 (0x00db6000) libm.so.6 => /lib/tls/libm.so.6 (0x00d30000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x002f9000) libc.so.6 => /lib/tls/libc.so.6 (0x00c04000) libblas.so.3 => /usr/lib/libblas.so.3 (0x00101000) /lib/ld-linux.so.2 (0x00beb000) If I run this exact same binary on a RedHat9.0 box, it runs fine: kellogg[libmwrep]> ./test_svd_atlas Entering dgesvd. If this takes more than a second or two it means it has hanged. Kill it with Ctrl-C dgesvd finished svals: 4. 8.32667268E-17 4.67733824E-51 1.52114348E-84 However, on a Fedora3 machine it hangs forever, with 100% cpu utilization. I hope this can be fixed with a new lapack release which is correctly compiled soon. Many thanks in advance, f.
You can get lapack from rawhide (development tree) which does not have this problem. I personally added -ffloat-store to the FFLAGS in src.rpm and rebuild an rpm file. Works fine: [dima@tumbleweed bug]$ ./test_svd_fedora Entering dgesvd. If this takes more than a second it means it has hanged. Kill it with Ctrl-C dgesvd finished svals: 4. 8.32667268E-17 4.67733824E-51 1.52114348E-84 I still do not understand why the fix is not in the update tree... Dmitri.
BTW, the bug for lapack is: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=138447
Dmitri, thanks for the lapack bug number. Unfortunately it is not tru that the problem is solved, as is claimed in that bug. I hope they reopen it and actually fix it. Cheers, f
OK, I just fished out the -28 blas/lapack RPMs out of a 'development' fedora repo, and slapped them onto my FC3 box. The problem is indeed solved by them. But they should be backported to FC3-updates, please. These corrected packages have been available since 12/21, and I would not have wasted a whole day tracking this (which I thought was a bug in my code) if only they had been made available to updates. Yum would have nicely picked them up right away. Please put these corrected -28 blas/lapack packages on the released updates repos for FC3. And many thanks to Dmitri for your help!
I wonder if this is related to: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=5900
See https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146447 for analysis and trivial test program. When lapack is determining hardware accuracy, it goes through the jumps and calls trivial DLAMC3 to enforce memory storage instead of just doing addition: DOUBLE PRECISION FUNCTION DLAMC3( A, B ) DOUBLE PRECISION A, B DLAMC3 = A + B RETURN What is interesting: compiling the test code from 146447 with intel fortran compiler never exposes the problem, even when compiled with most agressive optimization options. gcc generates following code: gcc: pushl %ebp movl %esp, %ebp movl 12(%ebp), %edx fldl (%edx) movl 8(%ebp), %eax faddl (%eax) leave ret while ifort: dlamc3_: # parameter 1: %eax # parameter 2: %edx ..B2.1: # Preds ..B2.0 movl 4(%esp), %eax #187.32 movl 8(%esp), %edx #187.32 .globl dlamc3_. dlamc3_.: # subl $8, %esp #187.32 movsd (%eax), %xmm0 #215.6 addsd (%edx), %xmm0 #215.6 movsd %xmm0, (%esp) #215.6 fldl (%esp) #215.6 addl $8, %esp #221.6 ret #221.6 Perhaps the parameters are passed differently?
Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you!
This did get pushed to FC3 as an update as of lapack-3.0-26.fc3. It's fixed in FC4 and forwards. Since FC3 is Fedora Legacy for security updates only, and it was fixed as an update, closing the bug.