Bug 880237 - Amazingly slow operation of dgemv and dgemm
Summary: Amazingly slow operation of dgemv and dgemm
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: atlas
Version: 18
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Frantisek Kluknavsky
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-26 14:55 UTC by Susi Lehtola
Modified: 2012-12-13 13:55 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-12-13 12:52:46 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Test program for benchmarks (2.16 KB, application/x-compressed-tar)
2012-11-26 14:55 UTC, Susi Lehtola
no flags Details
Benchmark result. Packaged and custom built Atlas 3.8. (20.82 KB, text/plain)
2012-11-27 13:20 UTC, Frantisek Kluknavsky
no flags Details

Description Susi Lehtola 2012-11-26 14:55:32 UTC
Created attachment 651995 [details]
Test program for benchmarks

The matrix-vector and matrix-matrix multiply operations are for some reason extremely slow in ATLAS, i.e., much slower than the intrinsic Fortran function or hand-written for loops.

When linked to e.g. OpenBLAS or the Intel MKL, the test case works properly.

Comment 1 Susi Lehtola 2012-11-26 14:56:30 UTC
I've reproduced the problem on EL5 and EL6 as well.

Comment 2 Frantisek Kluknavsky 2012-11-27 13:20:53 UTC
Created attachment 652698 [details]
Benchmark result. Packaged and custom built Atlas 3.8.

Matrix size modified to obtain bearable speed - no heavy swapping. This might be a disadvantage for atlas.

Comment 3 Frantisek Kluknavsky 2012-11-27 13:32:49 UTC
Thank you for this benchmark. Performance is poor.
Speed of custom built Atlas is at most twice the speed of packaged Fedora 17 Atlas - there is not much space to improve packaging.
Either this is not exatly the use case, where atlas should shine, or it truly does not keep the promise of top speed.

Comment 4 Susi Lehtola 2012-11-27 14:16:42 UTC
Indeed... If you compare the performance to OpenBLAS (bug #739398, based on the blazing-fast GotoBLAS), which really blows your socks off.

E.g. matrix-vector (no transpose for matrix)
N =    10000
        blas     4.199071e+00
        atlas    2.095580e+00
        openblas 5.367314e-01
        for      9.848689e-02
        intr     9.298230e-02

matrix-matrix
N =     1000
        atlas    4.227416e+01
        blas     4.118667e+01
        for      6.518103e+00
        intr     7.458387e-01
        openblas 5.499128e-02

so in the matrix-matrix multiplication OpenBLAS is faster by THREE ORDERS OF MAGNITUDE.

Comment 5 Frantisek Kluknavsky 2012-11-28 12:22:38 UTC
Tested with new Atlas 3.10 custom-built with AVX - at most only about 5x speed increase.
Do you remember some older much faster version of Atlas? It could indicate that current status is a solvable bug indeed.

Comment 6 Susi Lehtola 2012-11-28 13:02:54 UTC
I'm afraid not... This is a thing I've never tried before, I only stumbled into this as part of my preparations for a programming course I'm giving.

Comment 7 Clint Whaley 2012-12-04 23:43:13 UTC
This is the author of ATLAS.  Someone at redhat sent me e-mail directly rather than submitting to the ATLAS tracker, and that e-mail was lost.  If you still need help, submit the problem to the support tracker rather than e-mailing me directly:
   http://math-atlas.sourceforge.net/faq.html#help

I will comment that something is definitely wrong.  I suspect the redhat package is crippled in some way, or the timing isn't right.  Here is ATLAS dgemm vs. the F77BLAS (first call to f77blas, second to ATLAS):
./xdl3blastst -N 200 2000 200 
--------------------------------- GEMM ----------------------------------
TST# A B    M    N    K ALPHA  LDA  LDB  BETA  LDC  TIME MFLOP SpUp  TEST
==== = = ==== ==== ==== ===== ==== ==== ===== ==== ===== ===== ==== =====
   0 N N  200  200  200   1.0 2000 2000   1.0 2000  0.01 2007.6 1.00 -----
   0 N N  200  200  200   1.0 2000 2000   1.0 2000  0.00 12044.8 6.00 PASS 
   1 N N  400  400  400   1.0 2000 2000   1.0 2000  0.06 2034.2 1.00 -----
   1 N N  400  400  400   1.0 2000 2000   1.0 2000  0.01 16497.2 8.11 PASS 
   2 N N  600  600  600   1.0 2000 2000   1.0 2000  0.22 1985.3 1.00 -----
   2 N N  600  600  600   1.0 2000 2000   1.0 2000  0.02 18630.7 9.38 PASS 
   3 N N  800  800  800   1.0 2000 2000   1.0 2000  0.56 1813.0 1.00 -----
   3 N N  800  800  800   1.0 2000 2000   1.0 2000  0.05 19869.2 10.96 PASS 
   4 N N 1000 1000 1000   1.0 2000 2000   1.0 2000  1.14 1747.5 1.00 -----
   4 N N 1000 1000 1000   1.0 2000 2000   1.0 2000  0.10 20067.7 11.48 PASS 
   5 N N 1200 1200 1200   1.0 2000 2000   1.0 2000  1.96 1760.2 1.00 -----
   5 N N 1200 1200 1200   1.0 2000 2000   1.0 2000  0.17 20578.7 11.69 PASS 
   6 N N 1400 1400 1400   1.0 2000 2000   1.0 2000  3.09 1776.2 1.00 -----
   6 N N 1400 1400 1400   1.0 2000 2000   1.0 2000  0.27 20241.1 11.40 PASS 
   7 N N 1600 1600 1600   1.0 2000 2000   1.0 2000  4.58 1790.0 1.00 -----
   7 N N 1600 1600 1600   1.0 2000 2000   1.0 2000  0.39 20822.7 11.63 PASS 
   8 N N 1800 1800 1800   1.0 2000 2000   1.0 2000  6.40 1821.6 1.00 -----
   8 N N 1800 1800 1800   1.0 2000 2000   1.0 2000  0.56 20923.6 11.49 PASS 
   9 N N 2000 2000 2000   1.0 2000 2000   1.0 2000  8.54 1874.0 1.00 -----
   9 N N 2000 2000 2000   1.0 2000 2000   1.0 2000  0.77 20800.6 11.10 PASS 

So, ATLAS is more than 11 times the speed of the f77BLAS.  Last time I checked, ATLAS1.10.0 *is* slower than MKL on this platform but by a modest amount (maybe 8-15% or so?).  The above times are for serial ATLAS.  If you link to the parallel ATLAS library, then the gap is even more ridiculous (ATLAS more than 40 times faster than F77BLAS).

Try building ATLAS from the official tarfile, and you should get performance like this.  You can build the above timer in BLDdir/bin "make xdl3blastst".

I'm working on a new GEMM for the 3.11 series that should close this gap with MKL, but it won't be ready for a bit, but ATLAS has never been slower than the F77BLAS at any time with any compiler.

Regards,
Clint

Comment 8 Frantisek Kluknavsky 2012-12-11 13:41:52 UTC
On my machine, ATLAS cblas_dgemm needs less than 2 seconds to multiply square matrices of 1851 rows. The same ATLAS dgemm (Fortran interface instead of C interface) needs 260 seconds. Intristic Fortran matmul needs less than 5 seconds.
There is definitely a Fortran related bug somewhere (ATLAS, benchmark, compiler...). I can not speak Fortran, but I will do my best to learn.

Comment 9 Frantisek Kluknavsky 2012-12-13 12:52:46 UTC
You use double precision matrix, double precision Atlas function, but single precision constants in your Fortran code. A good opportunity to show your students type casting, memory corruption and debugging techniques.

Closing as NOTABUG.

Comment 10 Susi Lehtola 2012-12-13 13:55:27 UTC
Yes, I noticed this during the weekend, since the calculations didn't reproduce the same results.

I'll rerun the tests to see if the speed has changed.


Note You need to log in before you can comment on or make changes to this bug.