Bug 479935 - mvapich / scalapack: mpi error when running xznep, xcnep
mvapich / scalapack: mpi error when running xznep, xcnep
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: mvapich (Show other bugs)
5.3
All Linux
low Severity medium
: rc
: ---
Assigned To: Doug Ledford
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-14 02:10 EST by Mehdi Bozzo-Rey
Modified: 2009-09-02 05:51 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 05:51:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mehdi Bozzo-Rey 2009-01-14 02:10:57 EST
Description of problem:mpi error when running 2 specific test cases, part of scalapack


Version-Release number of selected component (if applicable):


How reproducible: always


Steps to Reproduce:
1. install mvapich
2. recompile scalapack and run the test cases
3.
  
Actual results: xznep, xcnep fail with MPI_RECV : Invalid buffer pointer


Expected results: pass


Additional info:

[mbozzore@compute-0-11 scalapack]$ mpirun_rsh -ssh -np 4 -hostfile ./hosts GFORTRAN_UNBUFFERED_ALL=y ./xznep
ScaLAPACK QSQ^H by Schur Decomposition.
'MPI machine'

Tests of the parallel complex double precision Schur decomposition.
The following scaled residual checks will be computed:
 Residual               = ||H-QSQ^H|| / (||H|| * eps * N )
 Orthogonality residual = ||I - Q^HQ|| / ( eps * N )
The matrix A is randomly generated for each test.

An explanation of the input/output parameters follows:
TIME    : Indicates whether WALL or CPU time was used.
N       : The number of columns in the matrix A.
NB      : The size of the square blocks the matrix A is split into.
P       : The number of process rows.
Q       : The number of process columns.
THRESH  : If a residual value is less than THRESH, CHECK is flagged as PASSED
NEP time : Time in seconds to decompose the  matrix
MFLOPS  : Rate of execution

The following parameter values will be used:
  N       :             1     2     3     4     6    10    50
  NB      :             6     8    17
  P       :             1     2
  Q       :             1     2

Relative machine precision (eps) is taken to be       0.111022E-15
Routines pass computational tests if scaled residual is less than   20.000

TIME     N  NB    P    Q NEP Time   MFLOPS  CHECK
---- ----- --- ---- ---- -------- -------- ------

WALL     1   6    1    1     0.00     1.00 PASSED
WALL     1   8    1    1     0.00     6.00 PASSED
WALL     1  17    1    1     0.00     6.00 PASSED
WALL     2   6    1    1     0.00     1.62 PASSED
WALL     2   8    1    1     0.00     8.47 PASSED
WALL     2  17    1    1     0.00     9.00 PASSED
WALL     3   6    1    1     0.00     2.64 PASSED
WALL     3   8    1    1     0.00     4.23 PASSED
WALL     3  17    1    1     0.00     4.12 PASSED
WALL     4   6    1    1     0.00     7.63 PASSED
WALL     4   8    1    1     0.00     7.73 PASSED
WALL     4  17    1    1     0.00     7.89 PASSED
WALL     6   6    1    1     0.00    13.14 PASSED
WALL     6   8    1    1     0.00    13.99 PASSED
WALL     6  17    1    1     0.00    13.84 PASSED
WALL    10   6    1    1     0.00    28.71 PASSED
WALL    10   8    1    1     0.00    30.98 PASSED
WALL    10  17    1    1     0.00    33.15 PASSED
WALL    50   6    1    1     0.01   259.49 PASSED
WALL    50   8    1    1     0.01   298.96 PASSED
WALL    50  17    1    1     0.01   333.23 PASSED
WALL     1   6    2    2     0.00     0.86 PASSED
WALL     1   8    2    2     0.00     1.64 PASSED
WALL     1  17    2    2     0.00     1.64 PASSED
WALL     2   6    2    2     0.00     0.90 PASSED
WALL     2   8    2    2     0.00     2.72 PASSED
WALL     2  17    2    2     0.00     2.82 PASSED
WALL     3   6    2    2     0.00     1.50 PASSED
WALL     3   8    2    2     0.00     1.79 PASSED
WALL     3  17    2    2     0.00     1.80 PASSED
WALL     4   6    2    2     0.00     3.57 PASSED
WALL     4   8    2    2     0.00     3.65 PASSED
WALL     4  17    2    2     0.00     3.65 PASSED
WALL     6   6    2    2     0.00     7.28 PASSED
WALL     6   8    2    2     0.00     8.13 PASSED
WALL     6  17    2    2     0.00     8.38 PASSED
2 - MPI_RECV : Invalid buffer pointer
[2] [] Aborting Program!
0 - MPI_RECV : Invalid buffer pointer
[0] [] Aborting Program!
Exit code -3 signaled from compute-0-11
Killing remote processes...Abort signaled by rank 0:  Aborting program !
Abort signaled by rank 2:  Aborting program !
MPI process terminated unexpectedly
DONE
[mbozzore@compute-0-11 scalapack]$ Signal 15 received.
Signal 15 received.








[mbozzore@compute-0-11 scalapack]$ mpirun_rsh -ssh -np 4 -hostfile ./hosts GFORTRAN_UNBUFFERED_ALL=y ./xcnep
ScaLAPACK QSQ^H by Schur Decomposition.
'MPI machine'

Tests of the parallel complex single precision Schur decomposition.
The following scaled residual checks will be computed:
 Residual               = ||H-QSQ^H|| / (||H|| * eps * N )
 Orthogonality residual = ||I - Q^HQ|| / ( eps * N )
The matrix A is randomly generated for each test.

An explanation of the input/output parameters follows:
TIME    : Indicates whether WALL or CPU time was used.
N       : The number of columns in the matrix A.
NB      : The size of the square blocks the matrix A is split into.
P       : The number of process rows.
Q       : The number of process columns.
THRESH  : If a residual value is less than THRESH, CHECK is flagged as PASSED
NEP time : Time in seconds to decompose the  matrix
MFLOPS  : Rate of execution

The following parameter values will be used:
  N       :             1     2     3     4     6    10    50
  NB      :             6     8    17
  P       :             1     2
  Q       :             1     2

Relative machine precision (eps) is taken to be       0.596046E-07
Routines pass computational tests if scaled residual is less than   20.000

TIME     N  NB    P    Q NEP Time   MFLOPS  CHECK
---- ----- --- ---- ---- -------- -------- ------

WALL     1   6    1    1     0.00     0.90 PASSED
WALL     1   8    1    1     0.00     6.00 PASSED
WALL     1  17    1    1     0.00     4.50 PASSED
WALL     2   6    1    1     0.00     2.03 PASSED
WALL     2   8    1    1     0.00     9.00 PASSED
WALL     2  17    1    1     0.00     9.00 PASSED
WALL     3   6    1    1     0.00     3.04 PASSED
WALL     3   8    1    1     0.00     5.59 PASSED
WALL     3  17    1    1     0.00     5.86 PASSED
WALL     4   6    1    1     0.00    10.19 PASSED
WALL     4   8    1    1     0.00    10.47 PASSED
WALL     4  17    1    1     0.00    10.47 PASSED
WALL     6   6    1    1     0.00    16.07 PASSED
WALL     6   8    1    1     0.00    16.54 PASSED
WALL     6  17    1    1     0.00    16.00 PASSED
WALL    10   6    1    1     0.00    40.45 PASSED
WALL    10   8    1    1     0.00    42.35 PASSED
WALL    10  17    1    1     0.00    46.39 PASSED
WALL    50   6    1    1     0.01   226.91 PASSED
WALL    50   8    1    1     0.01   384.75 PASSED
WALL    50  17    1    1     0.00   455.56 PASSED
WALL     1   6    2    2     0.00     0.86 PASSED
WALL     1   8    2    2     0.00     1.64 PASSED
WALL     1  17    2    2     0.00     1.64 PASSED
WALL     2   6    2    2     0.00     1.04 PASSED
WALL     2   8    2    2     0.00     2.67 PASSED
WALL     2  17    2    2     0.00     2.77 PASSED
WALL     3   6    2    2     0.00     1.91 PASSED
WALL     3   8    2    2     0.00     2.48 PASSED
WALL     3  17    2    2     0.00     2.44 PASSED
WALL     4   6    2    2     0.00     4.70 PASSED
WALL     4   8    2    2     0.00     4.84 PASSED
WALL     4  17    2    2     0.00     4.72 PASSED
WALL     6   6    2    2     0.00     7.89 PASSED
WALL     6   8    2    2     0.00     8.86 PASSED
WALL     6  17    2    2     0.00     8.92 PASSED
2 - MPI_RECV : Invalid buffer pointer
[2] [] Aborting Program!
0 - MPI_RECV : Invalid buffer pointer
[0] [] Aborting Program!
Exit code -3 signaled from compute-0-11
Killing remote processes...Abort signaled by rank 0:  Aborting program !
Abort signaled by rank 2:  Aborting program !
MPI process terminated unexpectedly
DONE
[mbozzore@compute-0-11 scalapack]$ Signal 15 received.
Signal 15 received.
Comment 1 Mehdi Bozzo-Rey 2009-01-14 02:11:39 EST
note: this problem was fixed by OSU in the latest version of mvapich1
Comment 2 Doug Ledford 2009-04-22 20:44:52 EDT
mvapich has been upgraded to 1.1.0-3143, the same as in ofed 1.4.1-rc3.  That should resolve this issue.
Comment 6 errata-xmlrpc 2009-09-02 05:51:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1304.html

Note You need to log in before you can comment on or make changes to this bug.