Bug 1889069 - Numpy significantly slower when using FlexiBLAS instead of OpenBLAS directly
Summary: Numpy significantly slower when using FlexiBLAS instead of OpenBLAS directly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: flexiblas
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Iñaki Ucar
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-17 11:28 UTC by Christian Dersch
Modified: 2020-10-26 01:06 UTC (History)
1 user (show)

Fixed In Version: flexiblas-3.0.4-1.fc33
Clone Of:
Environment:
Last Closed: 2020-10-26 01:06:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Christian Dersch 2020-10-17 11:28:50 UTC
Description of problem:



Version-Release number of selected component (if applicable):


How reproducible: always


Steps to Reproduce:
1. Install python3-numpy on Fedora 33
2. Run python3 script https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py
3. Note down the execution times

# Now change backend to openblas-threads (as this one is used directly in F32)
4. Install flexiblas-openblas-threads
5. Change backend in /etc/flexiblasrc to "openblas-threads"
6. Run the Python script 2 again
7. Note down execution times

8. Install python3-numpy on F32 or build on F33 directly against threaded openblas
9. Run script of step 2 again
10. Note down execution times

11. Compare results

Actual results: 
Some operations like eigendecompostition are much slower with flexiblas in between

F33 with flexiblas and openblas-threads backend:
Dotted two 4096x4096 matrices in 3.36 s.
Dotted two vectors of length 524288 in 0.75 ms.
SVD of a 2048x1024 matrix in 10.47 s.
Cholesky decomposition of a 2048x2048 matrix in 1.62 s.
Eigendecomposition of a 2048x2048 matrix in 51.59 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['flexiblas', 'flexiblas']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
blas_opt_info:
    libraries = ['flexiblas', 'flexiblas']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['flexiblas', 'flexiblas']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
lapack_opt_info:
    libraries = ['flexiblas', 'flexiblas']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']

F33 with python3-numpy linked against libopenblasp directly (or F32 default, very similar):
Dotted two 4096x4096 matrices in 2.95 s.
Dotted two vectors of length 524288 in 0.57 ms.
SVD of a 2048x1024 matrix in 1.70 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 14.05 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblasp', 'openblasp']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
blas_opt_info:
    libraries = ['openblasp', 'openblasp']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblasp', 'openblasp']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']
lapack_opt_info:
    libraries = ['openblasp', 'openblasp']
    library_dirs = ['/usr/lib64']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/usr/lib64']




Expected results: Performance is similar with and without flexiblas

Comment 1 Christian Dersch 2020-10-17 11:30:37 UTC
It looks like SVD and Eigendecomposition run on one core only with FlexiBLAS (both OpenMP and Threads version) while other operations like matrix multiplication run on all cores with FlexiBLAS too.

Comment 2 Iñaki Ucar 2020-10-17 14:09:06 UTC
Confirmed here, thanks for the report. I've opened an issue upstream: https://github.com/mpimd-csc/flexiblas/issues/7

Comment 3 Iñaki Ucar 2020-10-21 15:33:17 UTC
The issue has been identified and a fix is underway.

Comment 4 Christian Dersch 2020-10-22 12:56:22 UTC
Tried new release 3.0.4 (easy to rebuild, no spec changes except version required), this fixes the issue :)

Comment 5 Fedora Update System 2020-10-22 14:36:48 UTC
FEDORA-2020-cd5d97c1e4 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2020-cd5d97c1e4

Comment 6 Fedora Update System 2020-10-23 23:40:12 UTC
FEDORA-2020-cd5d97c1e4 has been pushed to the Fedora 33 testing repository.
In short time you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2020-cd5d97c1e4`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2020-cd5d97c1e4

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 7 Fedora Update System 2020-10-26 01:06:18 UTC
FEDORA-2020-cd5d97c1e4 has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.