Created attachment 651995 [details] Test program for benchmarks The matrix-vector and matrix-matrix multiply operations are for some reason extremely slow in ATLAS, i.e., much slower than the intrinsic Fortran function or hand-written for loops. When linked to e.g. OpenBLAS or the Intel MKL, the test case works properly.
I've reproduced the problem on EL5 and EL6 as well.
Created attachment 652698 [details] Benchmark result. Packaged and custom built Atlas 3.8. Matrix size modified to obtain bearable speed - no heavy swapping. This might be a disadvantage for atlas.
Thank you for this benchmark. Performance is poor. Speed of custom built Atlas is at most twice the speed of packaged Fedora 17 Atlas - there is not much space to improve packaging. Either this is not exatly the use case, where atlas should shine, or it truly does not keep the promise of top speed.
Indeed... If you compare the performance to OpenBLAS (bug #739398, based on the blazing-fast GotoBLAS), which really blows your socks off. E.g. matrix-vector (no transpose for matrix) N = 10000 blas 4.199071e+00 atlas 2.095580e+00 openblas 5.367314e-01 for 9.848689e-02 intr 9.298230e-02 matrix-matrix N = 1000 atlas 4.227416e+01 blas 4.118667e+01 for 6.518103e+00 intr 7.458387e-01 openblas 5.499128e-02 so in the matrix-matrix multiplication OpenBLAS is faster by THREE ORDERS OF MAGNITUDE.
Tested with new Atlas 3.10 custom-built with AVX - at most only about 5x speed increase. Do you remember some older much faster version of Atlas? It could indicate that current status is a solvable bug indeed.
I'm afraid not... This is a thing I've never tried before, I only stumbled into this as part of my preparations for a programming course I'm giving.
This is the author of ATLAS. Someone at redhat sent me e-mail directly rather than submitting to the ATLAS tracker, and that e-mail was lost. If you still need help, submit the problem to the support tracker rather than e-mailing me directly: http://math-atlas.sourceforge.net/faq.html#help I will comment that something is definitely wrong. I suspect the redhat package is crippled in some way, or the timing isn't right. Here is ATLAS dgemm vs. the F77BLAS (first call to f77blas, second to ATLAS): ./xdl3blastst -N 200 2000 200 --------------------------------- GEMM ---------------------------------- TST# A B M N K ALPHA LDA LDB BETA LDC TIME MFLOP SpUp TEST ==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== ===== 0 N N 200 200 200 1.0 2000 2000 1.0 2000 0.01 2007.6 1.00 ----- 0 N N 200 200 200 1.0 2000 2000 1.0 2000 0.00 12044.8 6.00 PASS 1 N N 400 400 400 1.0 2000 2000 1.0 2000 0.06 2034.2 1.00 ----- 1 N N 400 400 400 1.0 2000 2000 1.0 2000 0.01 16497.2 8.11 PASS 2 N N 600 600 600 1.0 2000 2000 1.0 2000 0.22 1985.3 1.00 ----- 2 N N 600 600 600 1.0 2000 2000 1.0 2000 0.02 18630.7 9.38 PASS 3 N N 800 800 800 1.0 2000 2000 1.0 2000 0.56 1813.0 1.00 ----- 3 N N 800 800 800 1.0 2000 2000 1.0 2000 0.05 19869.2 10.96 PASS 4 N N 1000 1000 1000 1.0 2000 2000 1.0 2000 1.14 1747.5 1.00 ----- 4 N N 1000 1000 1000 1.0 2000 2000 1.0 2000 0.10 20067.7 11.48 PASS 5 N N 1200 1200 1200 1.0 2000 2000 1.0 2000 1.96 1760.2 1.00 ----- 5 N N 1200 1200 1200 1.0 2000 2000 1.0 2000 0.17 20578.7 11.69 PASS 6 N N 1400 1400 1400 1.0 2000 2000 1.0 2000 3.09 1776.2 1.00 ----- 6 N N 1400 1400 1400 1.0 2000 2000 1.0 2000 0.27 20241.1 11.40 PASS 7 N N 1600 1600 1600 1.0 2000 2000 1.0 2000 4.58 1790.0 1.00 ----- 7 N N 1600 1600 1600 1.0 2000 2000 1.0 2000 0.39 20822.7 11.63 PASS 8 N N 1800 1800 1800 1.0 2000 2000 1.0 2000 6.40 1821.6 1.00 ----- 8 N N 1800 1800 1800 1.0 2000 2000 1.0 2000 0.56 20923.6 11.49 PASS 9 N N 2000 2000 2000 1.0 2000 2000 1.0 2000 8.54 1874.0 1.00 ----- 9 N N 2000 2000 2000 1.0 2000 2000 1.0 2000 0.77 20800.6 11.10 PASS So, ATLAS is more than 11 times the speed of the f77BLAS. Last time I checked, ATLAS1.10.0 *is* slower than MKL on this platform but by a modest amount (maybe 8-15% or so?). The above times are for serial ATLAS. If you link to the parallel ATLAS library, then the gap is even more ridiculous (ATLAS more than 40 times faster than F77BLAS). Try building ATLAS from the official tarfile, and you should get performance like this. You can build the above timer in BLDdir/bin "make xdl3blastst". I'm working on a new GEMM for the 3.11 series that should close this gap with MKL, but it won't be ready for a bit, but ATLAS has never been slower than the F77BLAS at any time with any compiler. Regards, Clint
On my machine, ATLAS cblas_dgemm needs less than 2 seconds to multiply square matrices of 1851 rows. The same ATLAS dgemm (Fortran interface instead of C interface) needs 260 seconds. Intristic Fortran matmul needs less than 5 seconds. There is definitely a Fortran related bug somewhere (ATLAS, benchmark, compiler...). I can not speak Fortran, but I will do my best to learn.
You use double precision matrix, double precision Atlas function, but single precision constants in your Fortran code. A good opportunity to show your students type casting, memory corruption and debugging techniques. Closing as NOTABUG.
Yes, I noticed this during the weekend, since the calculations didn't reproduce the same results. I'll rerun the tests to see if the speed has changed.